(This lesson was adapted from Kara Woo’s materials which were adapted from Jeff Hollister’s materials.

We first want to load some data. We’ll use the function `read.csv()`

to load data from a CSV file.

`gapminder <- read.csv("http://kbroman.org/datacarp/gapminder.csv")`

In the “environment pane”, you’ll now see that your workspace contains an object `gapminder`

, which is a rectangle of data with 1704 rows and 6 columns.

Use `head()`

to look at the first few rows.

`head(gapminder)`

These data concern the life expectancy, population, and GDP per capita for different countries for every 5th year from 1952–2007. GDP is in 2005 US dollars.

There’s also a function `tail()`

to look at the last few rows. And both `head()`

and `tail()`

have arguments that allow you to control how many rows are shown. For example, to look at the last 20 rows of the data:

`tail(gapminder, 20)`

Challenge: The function`str()`

tells you about the structure of a data object. Use`str()`

with the`gapminder`

data.

- How many countries are there?
- What is the “class” of this data object?

Other useful functions for learning about a data set are `dim()`

, `nrow()`

, and `ncol()`

.

You can use square brackets to pull out individual values from the data frame.

```
gapminder[1,1]
gapminder[3,5]
```

The number before the comma is the row; the number after the comma is the column.

You can also pull out full rows or columns by leaving one of the two blank. Note you always need to include the comma.

```
gapminder[1000,]
gapminder[,3]
```

When you pull out a full column, you get a vector of values. When you pull out a full row, you get a data frame with one row.

You can also refer to the columns using their names. And you can further refer to them with dollar signs.

```
gapminder[80,"lifeExp"]
gapminder$lifeExp[80]
```

You use vectors to grab slices of the data.

`gapminder[101:110, c("country", "year")]`

You can also use conditions.

`gapminder[gapminder$pop <= 100000, ]`

Challenge: Which of the following arenotequivalent?

`gapminder[50,4]`

and`gapminder[50, "lifeExp"]`

`gapminder[50,4]`

and`gapminder[4, 50]`

`gapminder[50,4]`

and`gapminder$lifeExp[50]`

Challenge: Which countries have had life expectancies greater than 80?

dplyr is an R package that simplifies the “manipulation” of data frames in R. It helps to organize the process by defining a set of discrete actions that you may wish to perform:

`filter`

- filter a set of rows`select`

- select a set of columns`arrange`

- arrange/sort the rows`mutate`

- add a “mutated” version of a column or columns`summarize`

- summarize a column`group_by`

- group rows by the value of some column or columns

We first need to load the dplyr package.

`library(dplyr)`

Let’s start with `filter`

, for choosing some set of rows. For example, the following grabs all of the rows for Sweden.

`filter(gapminder, country == "Sweden")`

You can filter with multiple criteria.

`filter(gapminder, country=="Sweden", year < 1969)`

Challenge: What was the population of the United States in 1952?

What if we want Sweden for the years 1952 and 2007? There are two ways to do this. First we can use the vertical bar (`|`

) which stands for “or”.

`filter(gapminder, country=="Sweden", year==1952 | year==2007)`

Second, we can use the `%in%`

operator.

`filter(gapminder, country=="Sweden", year %in% c(1952, 2007))`

We use `select`

to select a set of columns. We can combine the two by saving the output of `filter`

and then using that for `select`

.

```
sweden <- filter(gapminder, country == "Sweden")
select(sweden, year, pop)
```

To use `filter`

and then `select`

, you need to send the output of one function into the next one. Above, we saved the result of `filter`

and then used it when calling `select`

. We could also have used nested functions.

`select( filter(gapminder, country=="Sweden"), year, pop)`

A more convenient way to do this is with the “pipe” operator, which looks like `%>%`

and is made available via the magrittr package, automatically loaded with dplyr. There’s an RStudio shortcut `Ctrl-Shift-M`

. With the pipe operator, the output of one function is passed directly as input to the next function.

```
gapminder %>%
filter(country=="Sweden") %>%
select(year, pop)
```

Challenge: Using pipes, subset the gapminder data to grab rows where`gdpPercap`

is greater than or equal to 35,000. Retain the columns`country`

,`year`

, and`gdpPercap`

.

We use `arrange`

to sort the rows based on some column. For example, we could sort the results of that last challenge based on `gdpPercap`

.

```
gapminder %>%
filter(gdpPercap >= 35000) %>%
select(country, year, gdpPercap) %>%
arrange(gdpPercap)
```

The default is to sort from smallest to largest (“ascending”). To sort in the opposite order (“descending”), we use `desc()`

.

```
gapminder %>%
filter(gdpPercap >= 35000) %>%
select(country, year, gdpPercap) %>%
arrange(desc(gdpPercap))
```

We use `mutate`

to create new columns based on the existing columns. For example, if we wanted a total GDP column, we could do the following:

`mutate(gapminder, total_gdp = gdpPercap * pop)`

You could pipe that into `head`

to just see a few rows.

```
gapminder %>%
mutate(total_gdp = gdpPercap * pop) %>%
head
```

Challenge: Use`mutate`

to calculate the total GDP in billions of dollars, retrieve just the results for the year 2007, and sort the rows so that the total GDP is in decreasing order.

We use `summarize()`

to get summaries of the values in a column.

```
gapminder %>%
filter(year==2007) %>%
summarize(mean_pop = mean(pop))
```

You can include as many summaries as you want.

```
gapminder %>%
filter(year==2007) %>%
summarize(mean_pop=mean(pop), median_pop=median(pop),
min_pop=min(pop), max_pop=max(pop))
```

Most commonly, what we want to do is get group-specific summaries. We think of this as the “split-apply-combine” approach, where we split the data by the values in a column, apply some function to each group, and then combine the results. We use `group_by`

to do the splitting, and then `summarize`

to calculate the summary and combine the results.

For example, the average population per country, by continent, in the year 2007.

```
gapminder %>%
filter(year==2007) %>%
group_by(continent) %>%
summarize(mean_pop=mean(pop))
```

I always like to sort the results.

```
gapminder %>%
filter(year==2007) %>%
group_by(continent) %>%
summarize(mean_pop=mean(pop)) %>%
arrange(desc(mean_pop))
```

You can use `n()`

to get the counts in each group.

```
gapminder %>%
filter(year==2007) %>%
group_by(continent) %>%
summarize(mean_pop=mean(pop), n=n()) %>%
arrange(desc(mean_pop))
```

Challenge: What was the average life expectancy (`lifeExp`

) by continent in 2007?

You can use `group_by`

with multiple columns, for example to get the average GDP per capita by continent and by year.

```
gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
```

Challenge: Calculate the overall GDP per capita by continent in the years 1952 and 2007.

Challenge: Calculate the overall GDP per capita by continent in the years 1952 and 2007.

- Use
`filter()`

to pull out the rows with`year %in% c(1952,2007)`

.- Use
`mutate()`

to calculate the total GDP for each country.- Use
`group_by()`

and`summarize()`

(with`sum()`

) to calculate the total GDP and total population for each continent.- Use
`mutate()`

to calculate the GDP per capita for each continent.

```
gapminder %>%
filter(year %in% c(1952, 2007)) %>%
mutate(total_gdp=pop*gdpPercap) %>%
group_by(continent, year) %>%
summarize(sum_gdp=sum(total_gdp), sum_pop=sum(pop*1.0)) %>%
mutate(overall_gdpPercap=sum_gdp/sum_pop) %>%
arrange(desc(overall_gdpPercap))
```