Learning Objectives

Presentation of the Survey Data

We will continue to look at the species and weight of animals caught in plots in a study area in Arizona over time. The dataset is stored as a CSV file: each row holds information for a single animal, and the columns represent:

Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular plot
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams
genus genus of animal
species species of animal
taxa e.g. Rodent, Reptile, Bird, Rabbit
plot_type type of plot

The data are available at http://kbroman.org/datacarp/portal_data_joined.csv.

We first use download_file() to download the data into the data/ subdirectory:


We then use read.csv() to load the data into R:

surveys <- read.csv('data/portal_data_joined.csv')

We can actually use read.csv to grab the data directly from the web, but it’s probably best to download a copy first.

surveys <- read.csv("http://kbroman.org/datacarp/portal_data_joined.csv")

The data are stored in what’s called a “data frame”. It’s a big rectangle, with rows being observations and columns being variables. The different columns can be different types (numeric, character, etc.), but they’re all the same length.

Use head() to view the first few rows.


Use tail() to view the first few rows.


Use str() to look at the structure of the data.



Based on the output of str(surveys), can you answer the following questions?

  • What is the class of the object surveys?
  • How many rows and how many columns are in this object?
  • How many species have been recorded during these surveys?

As you can see, many of the columns consist of integers, however, the columns species and sex are of a special class called a factor. Before we learn more about the data.frame class, let’s talk about factors. They are very useful but not necessarily intuitive, and therefore require some attention.


Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you use factor() to create a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can check this by using the function levels(), and check the number of levels using nlevels():


Sometimes, the order of the factors does not matter, other times you might want to specify a particular order.

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
food <- factor(food, levels=c("low", "medium", "high"))

Converting factors

If you need to convert a factor to a character vector, you use as.character(x).

Converting factors where the levels appear as numbers (such as concentration levels) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. function. Compare:

f <- factor(c(1, 5, 10, 2))
as.numeric(f)               ## wrong! and there is no warning...
as.numeric(as.character(f)) ## works...


The function table() tabulates observations.

expt <- c("treat1", "treat2", "treat1", "treat3", "treat1",
          "control", "treat1", "treat2", "treat3")
expt <- factor(expt)
  • In which order are the treatments listed?
  • How can you recreate this table with “control” listed last instead of first?


The default when reading in data with read.csv(), columns with text get turned into factors.

You can avoid this with the argument stringsAsFactors=FALSE.

surveys_chr <- read.csv("data/portal_data_joined.csv", stringsAsFactors=FALSE)

Then when you look at the result of str(), you’ll see that the previously factor columns are now chr.


Constructing data frames “by hand”

You can also create a data frame manually with the function data.frame(). This function can also take the argument stringsAsFactors. Compare the output of these examples, and compare the difference between when the data are being read as character, and when they are being read as factor.

df1 <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                  feel=c("furry", "furry", "squishy", "spiny"),
                  weight=c(45, 8, 1.1, 0.8))
#> 'data.frame':    4 obs. of  3 variables:
#>  $ animal: Factor w/ 4 levels "cat","dog","sea cucumber",..: 2 1 3 4
#>  $ feel  : Factor w/ 3 levels "furry","spiny",..: 1 1 3 2
#>  $ weight: num  45 8 1.1 0.8
df2 <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                  feel=c("furry", "furry", "squishy", "spiny"),
                  weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
#> 'data.frame':    4 obs. of  3 variables:
#>  $ animal: chr  "dog" "cat" "sea cucumber" "sea urchin"
#>  $ feel  : chr  "furry" "furry" "squishy" "spiny"
#>  $ weight: num  45 8 1.1 0.8


There are a few mistakes in this hand crafted data.frame, can you spot and fix them? Don’t hesitate to experiment!

author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
                          author_last=c(Darwin, Mayr, Dobzhansky),
                          year=c(1942, 1970))

Inspecting data frames

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Indexing, Sequences, and Subsetting

We pulled out parts of a vector by indexing with square brackets. We can do the same thing with data frames, but we need to provide two values: row and column, with a comma between them.

For example, to get the element in the 1st row, 1st column:


To get the element in the 2nd row, 7th column:


To get the entire 2nd row, leave the column part blank:


And to get the entire 7th column, leave the row part blank:

sex <- surveys[,7]

You can also refer to columns by name, in multiple ways.

sex <- surveys[, "sex"]
ssex <- surveys[["sex"]]
sex <- surveys$sex

Treating blanks as missing

Another aside: it’s probably best to treat those blanks as missing (NA). To do that, use the argument na.strings when reading the data. na.strings can be a vector of multiple character strings. We need that a missing value code can never exist as a valid value, because they all will be converted to the missing data code NA. And note that the default for na.strings is "NA", which will cause problems if "NA" is a valid value for your data (e.g., as an abbreviation "North America").

surveys_noblanks <- read.csv("data/portal_data_joined.csv", na.strings="")


As with vectors, you can also use logical vectors when indexing.

weights_males <- surveys[surveys$sex == 'M', "weight"]
mean(weights_males, na.rm=TRUE)

mean(surveys[surveys$sex == 'F', "weight"], na.rm=TRUE)

Or you can use numeric vectors. To pull out larger slices, it’s helpful to have ways of creating sequences of numbers.

First, the operator : gives you a sequence of consecutive values.


seq is more flexible.

seq(1, 10, by=2)
seq(5, 10, length.out=3)
seq(50, by=5, length.out=10)
seq(1, 8, by=3) # sequence stops to stay below upper limit
seq(10, 2, by=-2)  # can also go backwards

To get slices of our data frame, we can include a vector for the row or column indexes (or both)

surveys[1:3, 7]   # first three elements in the 7th column
surveys[1, 1:3]   # first three columns in the first row
surveys[2:4, 6:7] # rows 2-4, columns 6-7


  1. The function nrow() on a data.frame returns the number of rows. Use it, in conjuction with seq() to create a new data.frame called surveys_by_10 that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)

  2. Create a data.frame containing only the observations from the year 1999 of the surveys dataset.