dplyr

(This lesson was adapted from Kara Woo’s materials which were adapted from Jeff Hollister’s materials.

Load data, etc.

We first want to load some data. We’ll use the function read.csv() to load data from a CSV file.

gapminder <- read.csv("http://kbroman.org/datacarp/gapminder.csv")

In the “environment pane”, you’ll now see that your workspace contains an object gapminder, which is a rectangle of data with 1704 rows and 6 columns.

Use head() to look at the first few rows.

head(gapminder)

These data concern the life expectancy, population, and GDP per capita for different countries for every 5th year from 1952–2007. GDP is in 2005 US dollars.

There’s also a function tail() to look at the last few rows. And both head() and tail() have arguments that allow you to control how many rows are shown. For example, to look at the last 20 rows of the data:

tail(gapminder, 20)

Challenge: The function str() tells you about the structure of a data object. Use str() with the gapminder data.

How many countries are there?

What is the “class” of this data object?

Other useful functions for learning about a data set are dim(), nrow(), and ncol().

Subsetting a data frame

You can use square brackets to pull out individual values from the data frame.

gapminder[1,1]
gapminder[3,5]

The number before the comma is the row; the number after the comma is the column.

You can also pull out full rows or columns by leaving one of the two blank. Note you always need to include the comma.

gapminder[1000,]
gapminder[,3]

When you pull out a full column, you get a vector of values. When you pull out a full row, you get a data frame with one row.

You can also refer to the columns using their names. And you can further refer to them with dollar signs.

gapminder[80,"lifeExp"]
gapminder$lifeExp[80]

You use vectors to grab slices of the data.

gapminder[101:110, c("country", "year")]

You can also use conditions.

gapminder[gapminder$pop <= 100000, ]

Challenge: Which of the following are not equivalent?

gapminder[50,4] and gapminder[50, "lifeExp"]

gapminder[50,4] and gapminder[4, 50]

gapminder[50,4] and gapminder$lifeExp[50]

Challenge: Which countries have had life expectancies greater than 80?

Data manipulation with dplyr

dplyr is an R package that simplifies the “manipulation” of data frames in R. It helps to organize the process by defining a set of discrete actions that you may wish to perform:

filter - filter a set of rows
select - select a set of columns
arrange - arrange/sort the rows
mutate - add a “mutated” version of a column or columns
summarize - summarize a column
group_by - group rows by the value of some column or columns

We first need to load the dplyr package.

library(dplyr)

Let’s start with filter, for choosing some set of rows. For example, the following grabs all of the rows for Sweden.

filter(gapminder, country == "Sweden")

You can filter with multiple criteria.

filter(gapminder, country=="Sweden", year < 1969)

Challenge: What was the population of the United States in 1952?

What if we want Sweden for the years 1952 and 2007? There are two ways to do this. First we can use the vertical bar (|) which stands for “or”.

filter(gapminder, country=="Sweden", year==1952 | year==2007)

Second, we can use the %in% operator.

filter(gapminder, country=="Sweden", year %in% c(1952, 2007))

We use select to select a set of columns. We can combine the two by saving the output of filter and then using that for select.

sweden <- filter(gapminder, country == "Sweden")
select(sweden, year, pop)

Pipes

To use filter and then select, you need to send the output of one function into the next one. Above, we saved the result of filter and then used it when calling select. We could also have used nested functions.

select( filter(gapminder, country=="Sweden"), year, pop)

A more convenient way to do this is with the “pipe” operator, which looks like %>% and is made available via the magrittr package, automatically loaded with dplyr. There’s an RStudio shortcut Ctrl-Shift-M. With the pipe operator, the output of one function is passed directly as input to the next function.

gapminder %>%
  filter(country=="Sweden") %>%
  select(year, pop)

Challenge: Using pipes, subset the gapminder data to grab rows where gdpPercap is greater than or equal to 35,000. Retain the columns country, year, and gdpPercap.

More dplyr functions

We use arrange to sort the rows based on some column. For example, we could sort the results of that last challenge based on gdpPercap.

gapminder %>%
  filter(gdpPercap >= 35000) %>%
  select(country, year, gdpPercap) %>%
  arrange(gdpPercap)

The default is to sort from smallest to largest (“ascending”). To sort in the opposite order (“descending”), we use desc().

gapminder %>%
  filter(gdpPercap >= 35000) %>%
  select(country, year, gdpPercap) %>%
  arrange(desc(gdpPercap))

We use mutate to create new columns based on the existing columns. For example, if we wanted a total GDP column, we could do the following:

mutate(gapminder, total_gdp = gdpPercap * pop)

You could pipe that into head to just see a few rows.

gapminder %>%
  mutate(total_gdp = gdpPercap * pop) %>%
  head

Challenge: Use mutate to calculate the total GDP in billions of dollars, retrieve just the results for the year 2007, and sort the rows so that the total GDP is in decreasing order.

We use summarize() to get summaries of the values in a column.

gapminder %>%
  filter(year==2007) %>%
  summarize(mean_pop = mean(pop))

You can include as many summaries as you want.

gapminder %>%
  filter(year==2007) %>%
  summarize(mean_pop=mean(pop), median_pop=median(pop),
            min_pop=min(pop), max_pop=max(pop))

Split-apply-combine

Most commonly, what we want to do is get group-specific summaries. We think of this as the “split-apply-combine” approach, where we split the data by the values in a column, apply some function to each group, and then combine the results. We use group_by to do the splitting, and then summarize to calculate the summary and combine the results.

For example, the average population per country, by continent, in the year 2007.

gapminder %>%
  filter(year==2007) %>%
  group_by(continent) %>%
  summarize(mean_pop=mean(pop))

I always like to sort the results.

gapminder %>%
  filter(year==2007) %>%
  group_by(continent) %>%
  summarize(mean_pop=mean(pop)) %>%
  arrange(desc(mean_pop))

You can use n() to get the counts in each group.

gapminder %>%
  filter(year==2007) %>%
  group_by(continent) %>%
  summarize(mean_pop=mean(pop), n=n()) %>%
  arrange(desc(mean_pop))

Challenge: What was the average life expectancy (lifeExp) by continent in 2007?

You can use group_by with multiple columns, for example to get the average GDP per capita by continent and by year.

gapminder %>%
  group_by(continent, year) %>%
  summarize(mean_gdpPercap=mean(gdpPercap))

Challenge: Calculate the overall GDP per capita by continent in the years 1952 and 2007.

Challenge: Calculate the overall GDP per capita by continent in the years 1952 and 2007.

Use filter() to pull out the rows with year %in% c(1952,2007).

Use mutate() to calculate the total GDP for each country.

Use group_by() and summarize() (with sum()) to calculate the total GDP and total population for each continent.

Use mutate() to calculate the GDP per capita for each continent.

gapminder %>%
  filter(year %in% c(1952, 2007)) %>%
  mutate(total_gdp=pop*gdpPercap) %>%
  group_by(continent, year) %>%
  summarize(sum_gdp=sum(total_gdp), sum_pop=sum(pop*1.0)) %>%
  mutate(overall_gdpPercap=sum_gdp/sum_pop) %>%
  arrange(desc(overall_gdpPercap))