Including datasets

It can be useful to include example datasets in your R package, to use in examples or vignettes or to illustrate a data format.

If your example datasets are enormous, you might want to make a separate package just with the data. Examples of data packages include Hadley Wickham’s babynames, nycflights13, and usdanutrients packages.

To include datasets with your package, you create a data subdirectory and place your datasets there, in .RData format (or use the extension .rda). Use the save function to create such files, as follows:

save(mydata, file="data/mydata.RData")

Next, create a .R file with Roxygen2 comments that will produce the documentation for the dataset, and place the file in the R subdirectory with all of your other .R files. Here’s an example, for the dataset grav in my R/qtlcharts package; see grav-data.R.

#' Arabidopsis QTL data on gravitropism
#'
#' Data from a QTL experiment on gravitropism in
#' Arabidopsis, with data on 162 recombinant inbred lines (Ler x
#' Cvi). The outcome is the root tip angle (in degrees) at two-minute
#' increments over eight hours.
#'
#' @docType data
#'
#' @usage data(grav)
#'
#' @format An object of class \code{"cross"}; see \code{\link[qtl]{read.cross}}.
#'
#' @keywords datasets
#'
#' @references Moore et al. (2013) Genetics 195:1077-1086
#' (\href{https://www.ncbi.nlm.nih.gov/pubmed/23979570}{PubMed})
#'
#' @source \href{https://phenome.jax.org/projects/Moore1b}{QTL Archive}
#'
#' @examples
#' data(grav)
#' times <- attr(grav, "time")
#' phe <- grav$pheno
#' \donttest{iplotCurves(phe, times)}
"grav"

This is much like documenting a function, but we need to include @docType data and @usage data(grav), and where the function definition would ordinarily go, we just include a line with the name of the dataset as a character string.

You’ll want to describe the @format of the data, and it’s good to include the @source (where you got it) and @references. And everyone likes @examples.

That’s it! Put .RData datasets in data/ and add Roxygen2 documentation in a .R file in R/.

Well, one more thing: you might also want to include the following line in the DESCRIPTION file for your package:

LazyData: true

If you do this, the datasets in your package will be immediately available when the package is loaded; there’ll be no need to use data(). The data isn’t actually loaded into R until you use it (that’s what “lazy load” means.

Okay, one more thing: you can also include R code (in a .R file) in the data directory, and also tabular data as .txt or .csv files. (See Data in packages in the Writing R Extensions manual.) An advantage to this is that the data could be viewed on GitHub, if you put your package there. And you might use such .txt or .csv files to demonstrate file formats and how to load data into R.

Now go to the page about connecting to other R packages.