Learning Objectives
- load external data (CSV files) in memory using the survey table (
surveys.csv
) as an example- explore the structure and the content of a data frame in R
- understand what factors are and how to manipulate them
- understand the concept of a
data.frame
- use sequences
- know how to access any element of a
data.frame
We will continue to look at the species and weight of animals caught in plots in a study area in Arizona over time. The dataset is stored as a CSV file: each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxa | e.g. Rodent, Reptile, Bird, Rabbit |
plot_type | type of plot |
The data are available at http://kbroman.org/datacarp/portal_data_joined.csv.
We first use download_file()
to download the data into the data/
subdirectory:
download.file("http://kbroman.org/datacarp/portal_data_joined.csv",
"data/portal_data_joined.csv")
We then use read.csv()
to load the data into R:
surveys <- read.csv('data/portal_data_joined.csv')
We can actually use read.csv
to grab the data directly from the web, but it’s probably best to download a copy first.
surveys <- read.csv("http://kbroman.org/datacarp/portal_data_joined.csv")
The data are stored in what’s called a “data frame”. It’s a big rectangle, with rows being observations and columns being variables. The different columns can be different types (numeric, character, etc.), but they’re all the same length.
Use head()
to view the first few rows.
head(surveys)
Use tail()
to view the first few rows.
tail(surveys)
Use str()
to look at the structure of the data.
str(surveys)
Based on the output of str(surveys)
, can you answer the following questions?
surveys
?As you can see, many of the columns consist of integers, however, the columns species
and sex
are of a special class called a factor
. Before we learn more about the data.frame
class, let’s talk about factors. They are very useful but not necessarily intuitive, and therefore require some attention.
Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.
Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you use factor()
to create a factor with 2 levels:
sex <- factor(c("male", "female", "female", "male"))
R will assign 1
to the level "female"
and 2
to the level "male"
(because f
comes before m
, even though the first element in this vector is "male"
). You can check this by using the function levels()
, and check the number of levels using nlevels()
:
levels(sex)
nlevels(sex)
Sometimes, the order of the factors does not matter, other times you might want to specify a particular order.
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
food <- factor(food, levels=c("low", "medium", "high"))
levels(food)
If you need to convert a factor to a character vector, you use as.character(x)
.
Converting factors where the levels appear as numbers (such as concentration levels) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. function. Compare:
f <- factor(c(1, 5, 10, 2))
as.numeric(f) ## wrong! and there is no warning...
as.numeric(as.character(f)) ## works...
The function table()
tabulates observations.
expt <- c("treat1", "treat2", "treat1", "treat3", "treat1",
"control", "treat1", "treat2", "treat3")
expt <- factor(expt)
table(expt)
control
” listed last instead of first?The default when reading in data with read.csv()
, columns with text get turned into factors.
You can avoid this with the argument stringsAsFactors=FALSE
.
surveys_chr <- read.csv("data/portal_data_joined.csv", stringsAsFactors=FALSE)
Then when you look at the result of str()
, you’ll see that the previously factor columns are now chr
.
str(surveys_chr)
You can also create a data frame manually with the function data.frame()
. This function can also take the argument stringsAsFactors
. Compare the output of these examples, and compare the difference between when the data are being read as character
, and when they are being read as factor
.
df1 <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(df1)
#> 'data.frame': 4 obs. of 3 variables:
#> $ animal: Factor w/ 4 levels "cat","dog","sea cucumber",..: 2 1 3 4
#> $ feel : Factor w/ 3 levels "furry","spiny",..: 1 1 3 2
#> $ weight: num 45 8 1.1 0.8
df2 <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(df2)
#> 'data.frame': 4 obs. of 3 variables:
#> $ animal: chr "dog" "cat" "sea cucumber" "sea urchin"
#> $ feel : chr "furry" "furry" "squishy" "spiny"
#> $ weight: num 45 8 1.1 0.8
There are a few mistakes in this hand crafted data.frame
, can you spot and fix them? Don’t hesitate to experiment!
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
We already saw how the functions head()
and str()
can be useful to check the content and the structure of a data.frame
. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the __dim__ensions of the object)nrow()
- returns the number of rowsncol()
- returns the number of columnshead()
- shows the first 6 rowstail()
- shows the last 6 rowsnames()
- returns the column names (synonym of colnames()
for data.frame
objects)rownames()
- returns the row namesstr()
- structure of the object and information about the class, length and content of each columnsummary()
- summary statistics for each columnNote: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
We pulled out parts of a vector by indexing with square brackets. We can do the same thing with data frames, but we need to provide two values: row and column, with a comma between them.
For example, to get the element in the 1st row, 1st column:
surveys[1,1]
To get the element in the 2nd row, 7th column:
surveys[2,7]
To get the entire 2nd row, leave the column part blank:
surveys[2,]
And to get the entire 7th column, leave the row part blank:
sex <- surveys[,7]
You can also refer to columns by name, in multiple ways.
sex <- surveys[, "sex"]
ssex <- surveys[["sex"]]
sex <- surveys$sex
Another aside: it’s probably best to treat those blanks as missing (NA
). To do that, use the argument na.strings
when reading the data. na.strings
can be a vector of multiple character strings. We need that a missing value code can never exist as a valid value, because they all will be converted to the missing data code NA
. And note that the default for na.strings
is "NA"
, which will cause problems if "NA"
is a valid value for your data (e.g., as an abbreviation "North America"
).
surveys_noblanks <- read.csv("data/portal_data_joined.csv", na.strings="")
As with vectors, you can also use logical vectors when indexing.
weights_males <- surveys[surveys$sex == 'M', "weight"]
mean(weights_males, na.rm=TRUE)
mean(surveys[surveys$sex == 'F', "weight"], na.rm=TRUE)
Or you can use numeric vectors. To pull out larger slices, it’s helpful to have ways of creating sequences of numbers.
First, the operator :
gives you a sequence of consecutive values.
1:10
10:1
5:8
seq
is more flexible.
seq(1, 10, by=2)
seq(5, 10, length.out=3)
seq(50, by=5, length.out=10)
seq(1, 8, by=3) # sequence stops to stay below upper limit
seq(10, 2, by=-2) # can also go backwards
To get slices of our data frame, we can include a vector for the row or column indexes (or both)
surveys[1:3, 7] # first three elements in the 7th column
surveys[1, 1:3] # first three columns in the first row
surveys[2:4, 6:7] # rows 2-4, columns 6-7
The function nrow()
on a data.frame
returns the number of rows. Use it, in conjuction with seq()
to create a new data.frame
called surveys_by_10
that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)
Create a data.frame
containing only the observations from the year 1999 of the surveys
dataset.