Learning Objectives

I’ve placed these reduced data on the web, so you can download them directly.

reduced <- read.csv("http://kbroman.org/datacarp/portal_data_reduced.csv")

Or download the file and then load it

download.file("http://kbroman.org/datacarp/portal_data_reduced.csv",
              "CleanData/portal_data_reduced.csv")
reduced <- read.csv("CleanData/portal_data_reduced.csv")

Plotting with ggplot2

There are two main systems for making plots in R: “base graphics” (which are the traditional plotting functions distributed with R) and ggplot2, written by Hadley Wickham following Leland Wilkinson’s book Grammar of Graphics. We’re going to show you how to use ggplot2. It’s may seem a bit complicated at first, but once you get a hang of it, you’ll be able to make really useful visualizations quite rapidly.

We first need to load the dplyr and ggplot2 packages.

library(ggplot2)
library(dplyr)

I’ll assume the data are available, and we’ll focus on the “cleaned” version, reduced.

Let’s first make a scatterplot of hindfoot length vs weight. Here’s the code to do it.

ggplot(reduced, aes(x = weight, y = hindfoot_length)) + geom_point()

Two key concepts in the grammar of graphics: aesthetics map features of the data (for example, the weight variable) to features of the visualization (for example the y-axis coordinate), and geoms concern what actually gets plotted (here, each row in the data becomes a point in the plot).

Another key aspect of ggplot2: the ggplot() function creates a graphics object; additional controls are added with the + operator. The actual plot is made when the object is printed.

p1 <- ggplot(reduced, aes(x=weight, y=hindfoot_length))
p2 <- p1 + geom_point()
print(p2)

If we saved the pieces like this, we could apply other options afterwards. For example, if we wanted weight on a log scale:

p2 + scale_x_log10()

This makes it kind of easy to try out different things. For example, we could plot the x-axis on a square root scale.

p2 + scale_x_sqrt()

Challenge

Make a scatterplot of hindfoot_length vs weight, but only for the species_id, "DM".

Other aesthetics

For scatterplot, additional aesthetics include shape, size, color, and “alpha” (for transparency of points).

Let’s make a template for our plot, to make modifications easier.

surveys_plot <- ggplot(reduced, aes(x = weight, y = hindfoot_length))
surveys_plot + geom_point(alpha = 0.1)

surveys_plot + geom_point(alpha = 0.1, color = "slateblue")

surveys_plot + geom_point(alpha = 0.1, color = "slateblue", size=0.5)

Things get more interesting when we assign these aesthetics to data.

surveys_plot + geom_point(aes(color = species_id))

Challenge

Use dplyr to calculate the mean weight and hindfoot_length as well as the sample size for each species.

Make a scatterplot of mean hindfoot_length vs mean weight, with the sizes of the points corresponding to the sample size.

Layers

You can use geom_line to make a line plot. For example, we could plot the counts of species by year.

count_by_year <- reduced %>%
    group_by(year) %>%
    tally

p <- ggplot(count_by_year, aes(x=year, y=n))
p + geom_line()

You can use both geom_line and geom_point to make a line plot with points at the data values.

p + geom_line() + geom_point()

This brings up another important concept with ggplot2: layers. A given plot can have multiple layers of geometric objects, plotted one on top of the other.

If you make the lines and points different colors, we can see that the points are placed on top of the lines.

p + geom_line(color="lightblue") + geom_point(color="violetred")

If we switch the order of geom_point and geom_line, we’ll reverse the layers.

p + geom_point(color="violetred") + geom_line(color="lightblue")

Note that aesthetics included in the call to ggplot() (or completely separately) are made to be the defaults for all layers, but we can separately control the aesthetics for each layer. For example, we could color the points by year:

p + geom_line() + geom_point(aes(color=year))

Compare that to the following:

p + geom_line() + geom_point() + aes(color=year)

Challenge

Make a plot of counts of species_id "DM" and "DS" by year.

Groups

Suppose, in that last challenge, we’d wanted to have black lines but the points colored by species. We might have done this:

counts_dm_ds <- reduced %>% filter(species_id %in% c("DM", "DS")) %>%
    group_by(species_id, year) %>% tally
p <- ggplot(counts_dm_ds, aes(x=year, y=n))
p + geom_line() + geom_point(aes(color=species_id))

The points get connected left-to-right, which is not what we want.

If we make the color=species_id aesthetic global, we don’t have this problem.

p + geom_line() + geom_point() + aes(color=species_id)

Alternatively, we can use the group aesthetic, which indicates that certain data points go together. This way the lines can be a constant color.

p + geom_line(aes(group=species_id)) + geom_point(aes(color=species_id))

We could also make the group aesthetic global

p + aes(group=species_id) + geom_line() + geom_point(aes(color=species_id))

Univariate geoms

We’ve focused so far on scatterplots, but one can also create one-dimensional summaries, such as histograms or boxplots.

Challenge

Try using geom_histogram() to make a histogram visualization of the distribution of weight.

Hint: You want weight as the x-axis aesthetic. Try specifying bins in geom_histogram().

Boxplot

Visualising the distribution of weight within each species.

ggplot(reduced, aes(x = species_id, y = hindfoot_length)) +
    geom_boxplot()

By adding points to boxplot, we can have a better idea of the number of measurements and of their distribution:

ggplot(reduced, aes(x = species_id, y = hindfoot_length)) +
    geom_boxplot(alpha = 0) +
    geom_jitter(alpha = 0.3, color = "tomato")

Notice how the boxplot layer is behind the jitter layer? What do you need to change in the code to put the boxplot in front of the points such that it’s not hidden.

Challenge

A variant on the box plot is the violin plot. Use geom_violin() to make violin plots of hindfoot_length by species_id.

Faceting

ggplot has a special technique called faceting that allows to split one plot into multiple plots based on a factor included in the dataset. We will use it to make one plot for a time series for each species.

yearly_counts <- reduced %>% group_by(year, species_id) %>% tally
ggplot(yearly_counts, aes(x = year, y = n, group = species_id, colour = species_id)) +
    geom_line() +
    facet_wrap(~ species_id)

Now we would like to split line in each plot by sex of each individual measured. To do that we need to make counts in data frame grouped by sex.

Challenge

  • Calculate counts grouped by year, species_id, and sex

  • make the faceted plot splitting further by sex (within each panel)

  • color by sex rather than species

Suppose I make a similar plot of average weight by species:

yearly_weight <- reduced %>%
                 group_by(year, species_id, sex) %>%
                 summarise(avg_weight = mean(weight, na.rm = TRUE))
ggplot(yearly_weight, aes(x=year, y=avg_weight, color = species_id, group = species_id)) +
    geom_line() +
    facet_wrap(~ species_id)

Why do we see those steps in the plot?

Oops need to group by sex

yearly_weight <- reduced %>%
                 group_by(year, species_id, sex) %>%
                 summarise(avg_weight = mean(weight, na.rm = TRUE))
ggplot(yearly_weight, aes(x=year, y=avg_weight, color = sex, group = sex)) +
    geom_line() +
    facet_wrap(~ species_id)

facet_grid

The facet_wrap geometry extracts plots into an arbitrary number of dimensions to allow them to cleanly fit on one page. On the other hand, the facet_grid geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (rows ~ columns; a . can be used as a placeholder that indicates only one row or column).

## One column, facet by rows
yearly_weight %>% filter(species_id %in% c("DM", "DO", "DS")) %>%
    ggplot(aes(x=year, y=avg_weight, color = species_id, group = species_id)) +
    geom_line() +
    facet_grid(sex ~ .)

# One row, facet by column
yearly_weight %>% filter(species_id %in% c("DM", "DO", "DS")) %>%
    ggplot(aes(x=year, y=avg_weight, color = species_id, group = species_id)) +
    geom_line() +
    facet_grid( ~ sex)

# separate panel for each sex and species
yearly_weight %>% filter(species_id %in% c("DM", "DO", "DS")) %>%
    ggplot(aes(x=year, y=avg_weight, color = species_id, group = species_id)) +
    geom_line() +
    facet_grid(species_id ~ sex)

Saving plots to a file

If you want to save a plot, to share with others, use the ggsave function.

The default is to save the last plot that you created, but I think it’s safer to first save the plot as an object and pass that to ggsave. Also give the height and width in inches.

p <- ggplot(reduced, aes(x=weight, y=hindfoot_length)) + geom_point()
ggsave("scatter.png", p, height=6, width=8)

The image file type is taken from the file name extension. To make a PDF instead:

ggsave("scatter.pdf", p, height=6, width=8)

Use scale to adjust the sizes of things, for example for a talk/poster versus a paper/report. Use scale < 1 to make the various elements bigger relative to the plotting area.

ggsave("scatter_2.png", p, height=6, width=8, scale=0.8)

Customizing plots

Axis limits

When faceting, the different panels are given common x- and y-axis limits. If we were to create separate plots (say one for each country), we would need to do a bit extra to ensure that common axis limits are used.

Recall the scale_x_log10() function that we had used to create the log scale for the x axis. This can take an argument limits (a vector of length 2) defining the minimum and maximum values plotted.

There is also a scale_y_log10() function, but if you want to change the y-axis limits without going to a log scale, you would use scale_y_continuous(). (Similarly, there’s a scale_x_continuous.)

For example, to plot the data for China, using axis limits defined by the full data, we’d do the following:

xrange <- range(reduced$weight)
yrange <- range(reduced$hindfoot_length)

p <- reduced %>% filter(species_id=="DM") %>%
    ggplot(aes(x=weight, y=hindfoot_length)) +
    geom_point()
p + scale_x_log10(limits=xrange) +
    scale_y_continuous(limits=yrange)

Color choices

If you don’t like the choices for point colors, you can customize them in a number of ways. First, you can use scale_color_manual() with a vector of your preferred choices. (If it’s fill rather than color that you want to change, you’ll need to use scale_fill_manual().)

p <- reduced %>% filter(species_id %in% c("DM", "DS", "DO")) %>%
    ggplot(aes(x=weight, y=hindfoot_length)) +
    geom_point(aes(color=species_id))

colors <- c("blue", "green", "orange")
p + scale_color_manual(values=colors)

You can also use RGB hex values.

hexcolors <- c("#001F3F", "#0074D9", "#01FF70")
p + scale_color_manual(values=hexcolors)

Themes

Not everyone gray background and such in the default ggplot plots.

But you can apply one of a variety of “themes” to control the overall appearance of plots.

One that a lot of people like is theme_bw(). Add it to a plot, and the overall appearance changes.

p + theme_bw()