R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.

Also see the related packages, qtl2convert (for converting data among the R/qtl2, DOQTL, and R/qtl formats), and qtl2fst (for storing genotype probabilities on disk, for rapid access with reduced memory usage).

R/qtl2 contains a lot of functions. To identify functions of interest, see the organized list of R/qtl2 functions, with brief descriptions.

Installation

Install R/qtl2 from CRAN:

install.packages("qtl2")

Data file format

The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we have defined a new format for the data files. We’ll describe it here briefly; for details, see the separate vignette on the input file format.

QTL mapping data consists of a set of tables of data: marker genotypes, phenotypes, marker maps, etc. In the new format, these different tables are in separate comma-delimited (CSV) files. In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns. For example, the phenotype data file will have individual IDs in the first column and phenotype names in the first row.

A few important changes in the tabular data:

  • We will use not just the genetic marker map, but also a physical map (if available).
  • Previously, phenotypes and covariates were combined. In the new format, we separate numeric phenotypes from the often non-numeric covariates.
  • We define a table of “phenotype covariates.” These are metadata describing the phenotypes. For example, in the case of a phenotype measured over time, one column in the phenotype covariate data could be the time of measurement. For gene expression data, we would have columns representing chromosome and physical position of genes, as well as gene IDs.

In addition to the set of CSV files with the primary data, we need a separate “control” file with various control parameters (or metadata), including the names of all of the other data files and the genotype codes used in the genotype data file. The control file is in a specific format using either YAML or JSON; these are human-readable text files for representing relatively complex data.

A big advantage of this control file scheme is that it greatly simplifies the function for reading in the data. That function, read_cross2(), has a single argument: the name (with path) of the control file. So you can read in data like this:

library(qtl2)
grav2 <- read_cross2("~/my_data/grav2.yaml")

The large number of files is a bit cumbersome, so we’ve made it possible to use a zip file containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:

zip_datafiles("~/my_data/grav2.yaml")

This zip_datafiles() function will read the control file to identify all of the relevant data files and then zip them up into a file with the same name and location, but with the extension .zip rather than .yaml or .json.

To read the data back in, we use the same read_cross2() function, providing the name (and path) of the zip file rather than the control file.

grav2 <- read_cross2("~/my_data/grav2.zip")

This can even be done with remote files.

grav2 <- read_cross2("https://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")

Of course, the other advantage of the zip file is that it is compressed and so smaller than the combined set of CSV files.

The control file may be confusing for some users. To assist in its construction, there’s a function write_control_file() that takes the large set of control parameters as input and then writes the YAML or JSON control file in the appropriate format.

Sample data sets

The R/qtl2 web site includes sample data files in the new format. Zipped versions of these datasets are included with the qtl2 package and can be loaded into R using the read_cross2() function.

In the qtl2 package source, the sample zip files are located in qtl2/inst/extdata. In the installed version of the package, they are in qtl2/extdata, within whatever directory your R packages were installed. The R function system.file() can be used to construct the path to these files.

For example, one of the sample data sets concerns a gravitropism phenotype in a set of Arabidopsis recombinant inbred lines (RIL), from Moore et al. (2013) Genetics 195:1077-1086. The data are in qtl2/extdata/grav2.zip, which can be loaded as follows:

library(qtl2)
grav2 <- read_cross2( system.file("extdata", "grav2.zip", package="qtl2") )

Additional sample data sets, including data on Diversity Outbred (DO) mice, are available at https://github.com/rqtl/qtl2data.

Calculating genotype probabilities

The first basic task in QTL analysis is to calculate conditional genotype probabilities, given the observed marker data, at each putative QTL position. This is accomplished with the calc_genoprob() function. Unlike the corresponding function in R/qtl, calc.genoprob(), the result is not inserted back into the input cross object, but is returned as a list of three-dimensional arrays (one per chromosome). Each 3d array of probabilities is arranged as individuals × genotypes × positions.

If we wish to perform QTL calculations at positions between markers (so called “pseudomarkers”), we first need to insert such positions into the genetic map with the function insert_pseudomarkers(). Unlike R/qtl, the map is kept separate from the genotype probabilities.

We’ll use the iron dataset from Grant et al. (2006) Hepatology 44:174-185 (an intercross) as an example. We first load the data:

library(qtl2)
iron <- read_cross2( system.file("extdata", "iron.zip", package="qtl2") )

We then use insert_pseudomarkers() to insert pseudomarkers into the genetic map, which we grab from the iron object as iron$gmap:

map <- insert_pseudomarkers(iron$gmap, step=1)

And next we use calc_genoprob() to calculate the QTL genotype probabilities.

pr <- calc_genoprob(iron, map, error_prob=0.002)

To speed up the calculations with large datasets on a multi-core machine, you can use the argument cores. With cores=0, the number of available cores will be detected via parallel::detectCores(). Otherwise, specify the number of cores as a positive integer.

pr <- calc_genoprob(iron, map, error_prob=0.002, cores=4)

The genome scan functions (see below) use genotype probabilities as well as a matrix of phenotypes. If you wished to perform a genome scan via an additive allele model, you would first convert the genotype probabilities to allele probabilities, using the function genoprob_to_alleleprob().

apr <- genoprob_to_alleleprob(pr)

Calculating a kinship matrix

If you wish to perform a genome scan by a linear mixed model, accounting for the relationships among individuals (in other words, including a random polygenic effect), you’ll need to calculate a kinship matrix for the individuals. This is accomplished with the calc_kinship() function. It takes the genotype probabilities as input.

kinship <- calc_kinship(pr)

By default, the genotype probabilities are converted to allele probabilities, and the kinship matrix is calculated as the proportion of shared alleles. To use genotype probabilities instead, use use_allele_probs=FALSE. Further, by default we calculate the kinship matrix using both the autosomes and the X chromosome. To omit the X chromosome, use omit_x=TRUE.

In calculating the kinship matrix, one may wish to eliminate the effect of varying marker density across the genome, and only use the probabilities along the grid of pseudomarkers (defined by the step argument to insert_pseudomarkers()). To do so, we need to first use calc_grid() to determine the grid of pseudomarkers, and then probs_to_grid() to omit probabilities for positions that are not on the grid.

grid <- calc_grid(iron$gmap, step=1)
pr_grid <- probs_to_grid(pr, grid)
kinship_grid <- calc_kinship(pr_grid)

If, for your linear mixed model genome scan, you wish to use the “leave one chromosome out” (LOCO) method (scan each chromosome using a kinship matrix that is calculated using data from all other chromosomes), use type="loco" in the call to calc_kinship().

kinship_loco <- calc_kinship(pr, "loco")

On a multi-core machine, you can get some speed-up via the cores argument, as with calc_genoprob().

kinship_loco <- calc_kinship(pr, "loco", cores=4)

Special covariates for the X chromosome

In a QTL scan of the X chromosome, special covariates (such as sex) may need to be included under the null hypothesis of no QTL, to avoid spurious evidence of linkage. (See Broman et al. (2006) Genetics 174:2151-2158.)

The particular X chromosome covariates depend on the cross, and can be obtained with the function get_x_covar().

Xcovar <- get_x_covar(iron)

Performing a genome scan

To perform a genome scan by Haley-Knott regression (Haley and Knott 1992), use the function scan1().

scan1() takes as input the genotype probabilities, a matrix of phenotypes, optional additive and interactive covariates, and the special X chromosome covariates. Another option is to provide a vector of weights.

out <- scan1(pr, iron$pheno, Xcovar=Xcovar)

On a multi-core machine, you can get some speed-up via the cores argument, as with calc_genoprob() and calc_kinship().

out <- scan1(pr, iron$pheno, Xcovar=Xcovar, cores=4)

The output of scan1() is a matrix of LOD scores, positions × phenotypes.

The function plot_scan1() can be used to plot the LOD curves. You can just write plot(), as there’s an S3 method plot.scan1() and the output of scan1() has class "scan1". Use the argument lodcolumn to indicate which column to plot. Unlike the plot.scanone() function in R/qtl, the plot_scan1() function will only plot one set of LOD curves at a time, and you need to provide the marker/pseudomarker map (created by insert_pseudomarkers()).

par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out) # overall maximum LOD score
plot(out, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out), bg="gray90")

Finding LOD peaks

The function find_peaks() can be used to identify a set of LOD peaks that exceed some threshold. It can also provide LOD support or Bayes credible intervals, by using the arguments drop (the amount to drop in the LOD support intervals) or prob (the nominal coverage for the Bayes credible intervals).

You need to provide both the scan1() output as well as the marker/pseudomarker map.

find_peaks(out, map, threshold=4, drop=1.5)
##   lodindex lodcolumn chr  pos     lod ci_lo ci_hi
## 1        1     liver   2 56.8  4.9576  48.1  73.2
## 2        1     liver   7 50.1  4.0508  13.1  53.6
## 3        1     liver  16 28.6  7.6816   6.6  40.4
## 4        2    spleen   8 13.6  4.3029   0.0  32.7
## 5        2    spleen   9 56.6 12.0632  53.6  61.2

The find_peaks() function can also pick out multiple peaks on a chromosome: each peak must exceed the chosen threshold, and the argument peakdrop indicates the amount that the LOD curve must drop below the lowest of two adjacent peaks. Use this feature with caution.

find_peaks(out, map, threshold=4, peakdrop=1.8, drop=1.5)
##   lodindex lodcolumn chr  pos     lod ci_lo ci_hi
## 1        1     liver   2 56.8  4.9576  48.1  73.2
## 2        1     liver   7 25.1  4.0400  13.1  28.4
## 3        1     liver   7 50.1  4.0508  31.7  53.6
## 4        1     liver  16 28.6  7.6816   6.6  40.4
## 5        2    spleen   8 13.6  4.3029   0.0  32.7
## 6        2    spleen   9 56.6 12.0632  53.6  61.2

The functions lod_int() and bayes_int() can be used to derive the LOD support or Bayes credible intervals for QTL, for a specific chromosome and LOD score column. For example, to obtain the Bayes interval for the locus on chromosome 9 for the second phenotype (“spleen”):

bayes_int(out, map, lodcolumn=2, chr=9, prob=0.95)
##   ci_lo  pos ci_hi
## 1  53.6 56.6  61.2

Both lod_int() and bayes_int() take a peakdrop argument, if you wish to try to identify multiple peaks on a chromosome. Again, use this feature with caution.

lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5)
##   ci_lo  pos ci_hi
## 1  13.1 25.1  28.4
## 2  31.7 50.1  53.6

Each row is a different peak; the columns are the lower interval endpoint, the estimated QTL position, and the upper interval endpoint.

By default in find_peaks(), bayes_int(), and lod_int(), the interval endpoints are at markers. This is controlled with the argument expand2markers, which defaults to TRUE. To allow the interval endpoints to be at positions between markers (“pseudomarkers”), include expand2markers=FALSE.

lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5, expand2markers=FALSE)
##   ci_lo  pos ci_hi
## 1  13.1 25.1  28.4
## 2  34.1 50.1  53.6

Performing a genome scan with a linear mixed model

To perform a genome scan using a linear mixed model, accounting for relationships among individuals using a random polygenic effect, you also use the function scan1(); you just need to provide the argument kinship, a kinship matrix (or, for the LOCO method, a list of kinship matrices).

out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar)

Again, on a multi-core machine, you can get some speed-up using the cores argument.

out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar, cores=4)

For the LOCO (leave one chromosome out) method, provide the list of kinship matrices as obtained from calc_kinship() with method="loco".

out_pg_loco <- scan1(pr, iron$pheno, kinship_loco, Xcovar=Xcovar)

To plot the results, we again use plot_scan1(), or just type plot().

Here is a plot of the LOD scores, by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.

color <- c("slateblue", "violetred", "green3")
par(mar=c(4.1, 4.1, 1.6, 1.1))
ymx <- max(maxlod(out), maxlod(out_pg), maxlod(out_pg_loco))
for(i in 1:2) {
    plot(out, map, lodcolumn=i, col=color[1], main=colnames(iron$pheno)[i],
              ylim=c(0, ymx*1.02))
    plot(out_pg, map, lodcolumn=i, col=color[2], add=TRUE)
    plot(out_pg_loco, map, lodcolumn=i, col=color[3], add=TRUE, lty=2)
    legend("topleft", lwd=2, col=color, c("H-K", "LMM", "LOCO"), bg="gray90", lty=c(1,1,2))
}