R/qtl2 user guide

R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.

We expect that basic analyses with R/qtl2 will generally be performed in “batch” (for example, on a cluster) rather than interactively. And so the software is split into three parts: qtl2geno for genotype probability calculations, qtl2scan for QTL scans, and qtl2plot for data visualization. A further package, qtl2convert, contains functions for converting data among the R/qtl2, DOQTL, and R/qtl formats, for example to convert genotype probabilities produced by DOQTL to the format needed by qtl2scan, or to convert qtl2scan results to the format produced by scanone in R/qtl, so that they may be graphed with the R/qtl functions.

Installation

R/qtl2 is not yet available on CRAN, but it can be installed from a mini-CRAN at rqtl.org.

install.packages("qtl2", repos="https://rqtl.org/qtl2cran")

The qtl2 package is inspired by the tidyverse package; it is basically empty, but when you install it, the qtl2geno, qtl2scan, qtl2plot, and qtl2convert packages, plus a bunch of dependencies, will be installed.

Alternatively, you can install R/qtl2 from its source on GitHub. (But note that compiling the C++ code can be rather slow.)

On Windows, you’ll need Rtools.

On Mac OS X, you’ll need the command-line developer tools, as well as gfortran.

You then need to install the devtools package, plus a set of package dependencies: yaml, jsonlite, data.table, and RcppEigen. (Additional, secondary dependencies will also be installed.)

install.packages(c("devtools", "yaml", "jsonlite", "data.table", "RcppEigen"))

Finally, install R/qtl2 using devtools::install_github().

library(devtools)
install_github("rqtl/qtl2")

Data file format

The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we have defined a new format for the data files. We’ll describe it here briefly; for details, see the separate vignette on the input file format.

QTL mapping data consists of a set of tables of data: marker genotypes, phenotypes, marker maps, etc. In the new format, these different tables are in separate comma-delimited (CSV) files. In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns. For example, the phenotype data file will have individual IDs in the first column and phenotype names in the first row.

A few important changes in the tabular data:

In addition to the set of CSV files with the primary data, we need a separate “control” file with various control parameters (or metadata), including the names of all of the other data files and the genotype codes used in the genotype data file. The control file is in a specific format using either YAML or JSON; these are human-readable text files for representing relatively complex data.

A big advantage of this control file scheme is that it greatly simplifies the function for reading in the data. That function, read_cross2(), has a single argument: the name (with path) of the control file. So you can read in data like this:

library(qtl2geno)
grav2 <- read_cross2("~/my_data/grav2.yaml")

Note: it can sometimes confusing which packages you need to load. If you install the (largely empty) qtl2 package, you can type just

library(qtl2)

and the three main packages, qtl2geno, qtl2scan, and qtl2plot, will be loaded.

The large number of files is a bit cumbersome, so we’ve made it possible to use a zip file containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:

zip_datafiles("~/my_data/grav2.yaml")

This zip_datafiles() function will read the control file to identify all of the relevant data files and then zip them up into a file with the same name and location, but with the extension .zip rather than .yaml or .json.

To read the data back in, we use the same read_cross2() function, providing the name (and path) of the zip file rather than the control file.

grav2 <- read_cross2("~/my_data/grav2.zip")

This can even be done with remote files.

grav2 <- read_cross2("http://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")

Of course, the other advantage of the zip file is that it is compressed and so smaller than the combined set of CSV files.

The control file may be confusing for some users. To assist in its construction, there’s a function write_control_file() that takes the large set of control parameters as input and then writes the YAML control file in the appropriate format.

Sample data sets

The R/qtl2 web site includes sample data files in the new format. Zipped versions of these datasets are included with the qtl2geno package and can be loaded into R using the read_cross2() function.

In the qtl2geno package source, the sample zip files are located in qtl2geno/inst/extdata. In the installed version of the package, they are in qtl2geno/extdata, within whatever directory your R packages were installed. The R function system.file() can be used to construct the path to these files.

For example, one of the sample data sets concerns a gravitropism phenotype in a set of Arabidopsis recombinant inbred lines (RIL), from Moore et al. (2013) Genetics 195:1077-1086. The data are in qtl2geno/extdata/grav2.zip, which can be loaded as follows:

library(qtl2geno)
grav2 <- read_cross2( system.file("extdata", "grav2.zip", package="qtl2geno") )

Additional sample data sets, including data on Diversity Outbred (DO) mice, are available at https://github.com/rqtl/qtl2data.

Calculating genotype probabilities

The first basic task in QTL analysis is to calculate conditional genotype probabilities, given the observed marker data, at each putative QTL position. This is accomplished with the calc_genoprob() function in the qtl2geno package. Unlike the corresponding function in R/qtl, calc.genoprob(), the result is not inserted back into the input cross object, but is returned as a list of three-dimensional arrays (one per chromosome). Each 3d array of probabilities is arranged as individuals × genotypes × positions.

If we wish to perform QTL calculations at positions between markers (so called “pseudomarkers”), we first need to insert such positions into the genetic map with the function insert_pseudomarkers(). Unlike [R/qtl], the map is kept separate from the genotype probabilities.

We’ll use the iron dataset from Grant et al. (2006) Hepatology 44:174-185 (an intercross) as an example. We first load the data:

library(qtl2geno)
iron <- read_cross2( system.file("extdata", "iron.zip", package="qtl2geno") )

(Note: you can use library(qtl2) to load the three main packages, qtl2geno, qtl2scan, and qtl2plot, all at once.)

We then use insert_pseudomarkers() to insert pseudomarkers into the genetic map, which we grab from the iron object as iron$gmap:

map <- insert_pseudomarkers(iron$gmap, step=1)

And next we use calc_genoprob() to calculate the QTL genotype probabilities.

pr <- calc_genoprob(iron, map, err=0.002)

To speed up the calculations with large datasets on a multi-core machine, you can use the argument cores. With cores=0, the number of available cores will be detected via parallel::detectCores(). Otherwise, specify the number of cores as a positive integer.

pr <- calc_genoprob(iron, map, err=0.002, cores=4)

The genome scan functions (see below) use genotype probabilities as well as a matrix of phenotypes. If you wished to perform a genome scan via an additive allele model, you would first convert the genotype probabilities to allele probabilities, using the function genoprob_to_alleleprob().

apr <- genoprob_to_alleleprob(pr)

Calculating a kinship matrix

If you wish to perform a genome scan by a linear mixed model, accounting for the relationships among individuals (in other words, including a random polygenic effect), you’ll need to calculate a kinship matrix for the individuals. This is accomplished with the calc_kinship() function in qtl2geno. It takes the genotype probabilities as input.

kinship <- calc_kinship(pr)

By default, the genotype probabilities are converted to allele probabilities, and the kinship matrix is calculated as the proportion of shared alleles. To use genotype probabilities instead, use use_allele_probs=FALSE. Further, by default we omit the X chromosome and only use the autosomes. To include the X chromosome, use omit_x=FALSE.

In calculating the kinship matrix, one may wish to eliminate the effect of varying marker density across the genome, and only use the probabilities along the grid of pseudomarkers (defined by the step argument to insert_pseudomarkers(). To do so, we need to first use calc_grid() to determine the grid of pseudomarkers, and then probs_to_grid() to probabilities for positions that are not on the grid.

grid <- calc_grid(iron$gmap, step=1)
pr_grid <- probs_to_grid(pr, grid)
kinship_grid <- calc_kinship(pr_grid)

If, for your linear mixed model genome scan, you wish to use the “leave one chromosome out” (LOCO) method (scan each chromosome using a kinship matrix that is calculated using data from all other chromosomes), use type="loco" in the call to calc_kinship().

kinship_loco <- calc_kinship(pr, "loco")

On a multi-core machine, you can get some speed-up via the cores argument, as with calc_genoprob().

kinship_loco <- calc_kinship(pr, "loco", cores=4)

Special covariates for the X chromosome

In a QTL scan of the X chromosome, special covariates (such as sex) may need to be included under the null hypothesis of no QTL, to avoid spurious evidence of linkage. (See Broman et al. (2006) Genetics 174:2151-2158.)

The particular X chromosome covariates depends on the cross, and can be obtained with the qtl2geno function get_x_covar().

Xcovar <- get_x_covar(iron)

Performing a genome scan

To perform a genome scan by Haley-Knott regression (Haley and Knott 1992), use the function scan1() in qtl2scan. (The functions above were all from qtl2geno; from here forward, the functions are all from qtl2scan.)

scan1() takes as input the genotype probabilities, a matrix of phenotypes, and then optional additive and interactive covariates, and the special X chromosome covariates. Another option is to provide a vector of weights.

library(qtl2scan)
out <- scan1(pr, iron$pheno, Xcovar=Xcovar)

On a multi-core machine, you can get some speed-up via the cores argument, as with calc_genoprob() and calc_kinship().

out <- scan1(pr, iron$pheno, Xcovar=Xcovar, cores=4)

The output of scan1() is a matrix of LOD scores, positions × phenotypes. (Well, actually, the output is a list including "lod" which is this matrix of LOD scores, but also including the genetic map and other details.)

The function plot_scan1() in the qtl2plot package can be used to plot the LOD curves. You can just write plot(), as there’s an S3 method plot.scan1() and the output of scan1() has class "scan1". Use the argument lodcolumn to indicate which column to plot. Unlike the plot.scanone() function in R/qtl, the plot_scan1() function will only plot one set of LOD curves at a time, and you need to provide the marker/pseudomarker map (created by insert_pseudomarkers()).

library(qtl2plot)
par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out) # overall maximum LOD score
plot(out, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out), bg="gray90")

Finding LOD peaks

The function find_peaks() in the qtl2scan package can be used to identify a set of LOD peaks that exceed some threshold. It can also provide LOD support or Bayes credible intervals, by using the arguments drop (the amount to drop in the LOD support intervals) or prob (the nominal coverage for the Bayes credible intervals).

You need to provide both the scan1() output as well as the marker/pseudomarker map.

find_peaks(out, map, threshold=4, drop=1.5)
##   lodindex lodcolumn chr  pos       lod ci_lo ci_hi
## 1        1     liver   2 56.8  4.957564  48.1  73.2
## 2        1     liver   7 50.1  4.050766  13.1  53.6
## 3        1     liver  16 28.6  7.681569   6.6  40.4
## 4        2    spleen   8 13.6  4.302919   0.0  32.7
## 5        2    spleen   9 56.6 12.063226  53.6  61.2

The find_peaks() function can also pick out multiple peaks on a chromosome: each peak must exceed the chosen threshold, and the argument peakdrop indicates the amount that the LOD curve must drop between the lowest of two adjacent peaks. Use this feature with caution.

find_peaks(out, map, threshold=4, peakdrop=1.8, drop=1.5)
##   lodindex lodcolumn chr  pos       lod ci_lo ci_hi
## 1        1     liver   2 56.8  4.957564  48.1  73.2
## 2        1     liver   7 25.1  4.040021  13.1  28.4
## 3        1     liver   7 50.1  4.050766  31.7  53.6
## 4        1     liver  16 28.6  7.681569   6.6  40.4
## 5        2    spleen   8 13.6  4.302919   0.0  32.7
## 6        2    spleen   9 56.6 12.063226  53.6  61.2

The functions lod_int() and bayes_int() can be used to derive the LOD support or Bayes credible intervals for QTL, for a specific chromosome and LOD score column. For example, to obtain the Bayes interval for the locus on chromosome 9 for the second phenotype (“spleen”):

bayes_int(out, map, lodcolumn=2, chr=9, prob=0.95)
##   ci_lo  pos ci_hi
## 1  53.6 56.6  61.2

Both lod_int() and bayes_int() take a peakdrop argument, if you wish to try to identify multiple peaks on a chromosome. Again, use this feature with caution.

lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5)
##   ci_lo  pos ci_hi
## 1  13.1 25.1  28.4
## 2  31.7 50.1  53.6

Each row is a different peak; the columns are the lower interval endpoint, the estimated QTL position, and the upper interval endpoint.

Performing a genome scan with a linear mixed model

To perform a genome scan using a linear mixed model, accounting for relationships among individuals using a random polygenic effect, you also use the function scan1; you just need to provide the argument kinship, a kinship matrix (or, for the LOCO method, a list of kinship matrices).

out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar)

Again, on a multi-core machine, you can get some speed-up using the cores argument.

out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar, cores=4)

For the LOCO (leave one chromosome out) method, provide the list of kinship matrices as obtained from calc_kinship() with method="loco".

out_pg_loco <- scan1(pr, iron$pheno, kinship_loco, Xcovar=Xcovar)

To plot the results, we again use plot_scan1() from the qtl2plot package, or just type plot().

Here is a plot of the LOD scores, by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.

color <- c("slateblue", "violetred", "green3")
par(mar=c(4.1, 4.1, 1.6, 1.1))
ymx <- max(maxlod(out), maxlod(out_pg), maxlod(out_pg_loco))
for(i in 1:2) {
    plot(out, map, lodcolumn=i, col=color[1], main=colnames(iron$pheno)[i],
              ylim=c(0, ymx*1.02))
    plot(out_pg, map, lodcolumn=i, col=color[2], add=TRUE)
    plot(out_pg_loco, map, lodcolumn=i, col=color[3], add=TRUE, lty=2)
    legend("topleft", lwd=2, col=color, c("H-K", "LMM", "LOCO"), bg="gray90", lty=c(1,1,2))
}

For the liver phenotype (top panel), the three methods give quite different results. The linear mixed model with an overall kinship matrix gives much lower LOD scores than the other two methods. On chromosomes with some evidence of a QTL, the LOCO method gives higher LOD scores than Haley-Knott, except on chromosome 16 where it gives lower LOD scores.

For the spleen phenotype (bottom panel), the linear mixed model with an overall kinship matrix again gives much lower LOD scores than the other two methods. However, in this case Haley-Knott regression and the LOCO method give quite similar results.

Performing a genome scan with binary traits

The genome scans above were performed assuming that the residual variation followed a normal distribution. This will often provide reasonable results even if the residuals are not normal, but an important special case is that of a binary trait, with values 0 and 1, which is best treated differently. The scan1 function can perform a genome scan with binary traits by logistic regression, using the argument model="binary". (The default value for the model argument is "normal".) At present, we can not account for relationships among individuals in this analysis.

Let’s first turn our two phenotypes into binary traits by thresholding at the median. One would generally not do this in practice; this is just for illustration.

bin_pheno <- apply(iron$pheno, 2, function(a) as.numeric(a > median(a)))
rownames(bin_pheno) <- rownames(iron$pheno)

We now perform the genome scan as before, including model="binary" to indicates that the phenotypes are binary traits with values 0 and 1.

out_bin <- scan1(pr, bin_pheno, Xcovar=Xcovar, model="binary")

Here is a plot of the two LOD curves.

par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out_bin)
plot(out_bin, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out_bin, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out_bin), bg="gray90")

We can use find_peaks as before.

find_peaks(out_bin, map, threshold=3.5, drop=1.5)
##   lodindex lodcolumn chr  pos      lod ci_lo ci_hi
## 1        1     liver   2 56.8 3.630279  48.1  73.2
## 2        1     liver   8 38.0 3.507654  17.3  69.9
## 3        1     liver  15 49.2 3.778869  16.4  49.2
## 4        1     liver  16 27.6 7.506900   6.6  40.4
## 5        2    spleen   9 57.6 9.252779  53.6  61.2

Performing a permutation test

To perform a permutation test to establish the statistical significance of the results of a genome scan, use the function scan1perm(). (In R/qtl, a single function, scanone(), was used for both performing a genome scan and for getting permutation-based significance thresholds, but in R/qtl2, we’ve decided to make two separate functions).

The scan1perm() function takes the same arguments as scan1(), plus additional arguments to control the permutations:

As with scan1(), you may provide a kinship matrix (or vector of kinship matrices, for the “leave one chromosome out” (loco) approach), in order to fit linear mixed models to account for accounting for the relationships among individuals (in other words, including a random polygenic effect). If kinship is unspecified, the function performs ordinary Haley-Knott regression.

To perform a permutation test with the iron data, we do the following:

operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000)

Note the need to specify special covariates for the X chromosome (via Xcovar), to be included under the null hypothesis of no QTL. And note that when these are provided, the default is to perform a stratified permutation test, using strata defined by the rows in Xcovar. In general, when the X chromosome is considered, one will wish to stratify at least by sex.

Also note that, as with scan1(), you can speed up the calculations on a multi-core machine by specifying the argument cores. With cores=0, the number of available cores will be detected via parallel::detectCores(). Otherwise, specify the number of cores as a positive integer. For large datasets, be mindful of the amount of memory that will be needed; you may need to use fewer than the maximum number of cores, to avoid going beyond the available memory.

operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000, cores=0)

To get estimated significance thresholds, use the function summary().

summary(operm)
## LOD thresholds (1000 permutations)
##      liver spleen
## 0.05  3.46   3.46

The default is to return the 5% significance thresholds. Thresholds for other (or for multiple) significance levels can be obtained via the alpha argument.

summary(operm, alpha=c(0.2, 0.05))
## LOD thresholds (1000 permutations)
##      liver spleen
## 0.2   2.63   2.64
## 0.05  3.46   3.46

To obtain autosome/X chromosome-specific significance thresholds, specify perm_Xsp=TRUE. In this case, you need to provide chromosome lengths, which may be obtained with the function chr_lengths().

operm2 <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000,
                    perm_Xsp=TRUE, chr_lengths=chr_lengths(map))

Separate permutations are performed for the autosomes and X chromosome, and considerably more permutation replicates are needed for the X chromosome. The computations take about twice as much time. See Broman et al. (2006) Genetics 174:2151-2158.

The significance thresholds are again derived via summary():

summary(operm2, alpha=c(0.2, 0.05))
## Autosome LOD thresholds (1000 permutations)
##      liver spleen
## 0.2   2.65   2.54
## 0.05  3.42   3.22
## 
## X chromosome LOD thresholds (28243 permutations)
##      liver spleen
## 0.2    3.1   4.02
## 0.05   3.9   5.18

Permutations for a genome scan with a linear mixed model-based are performed by specifying the kinship argument. We can use the “leave one chromosome out” (loco) method by providing kinship_loco, the list of kinship matrices calculated above with calc_kinship().

operm3 <- scan1perm(pr, iron$pheno, kinship_loco, Xcovar=Xcovar, n_perm=1000,
                    perm_Xsp=TRUE, chr_lengths=chr_lengths(map))

Here are the estimated significance thresholds:

summary(operm3, alpha=c(0.2, 0.05))
## Autosome LOD thresholds (1000 permutations)
##      liver spleen
## 0.2   2.64   2.62
## 0.05  3.29   3.29
## 
## X chromosome LOD thresholds (28243 permutations)
##      liver spleen
## 0.2   3.14   4.37
## 0.05  3.82   5.50

As with scan1, we can use scan1perm with binary traits, using the argument model="binary". Again, this can’t be used with a kinship matrix, but all of the other arguments can be applied.

operm_bin <- scan1perm(pr, bin_pheno, Xcovar=Xcovar, model="binary",
                       n_perm=1000, perm_Xsp=TRUE, chr_lengths=chr_lengths(map))

Here are the estimated 5% and 20% significance thresholds.

summary(operm_bin, alpha=c(0.2, 0.05))
## Autosome LOD thresholds (1000 permutations)
##      liver spleen
## 0.2   2.60   2.63
## 0.05  3.33   3.41
## 
## X chromosome LOD thresholds (28243 permutations)
##      liver spleen
## 0.2   3.16   3.06
## 0.05  3.86   3.77

Estimated QTL effects

The scan1() function return only LOD scores. To obtain estimated QTL effects, use the function scan1coef(). This function takes a single phenotype and the genotype probabilities for a single chromosome and returns a matrix with the estimated coefficients at each putative QTL location along the chromosome.

For example, to get the estimated effects on chromosome 2 for the liver phenotype, we’d do the following:

c2eff <- scan1coef(pr[,"2"], iron$pheno[,"liver"])

The result is a matrix, 39 positions × 3 genotypes. An attribute, "map" contains the positions of the calculations. To plot the effects, use the function plot_coef() from the qtl2plot. There is again an S3 method function plot.scan1coef(), so one can just type plot(). Use the argument columns to indicate which coefficient columns to plot.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff, map["2"], columns=1:3, col=col)
last_coef <- unclass(c2eff)[nrow(c2eff),] # pull out last coefficients
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

The default is to provide phenotype averages for each genotype group. If instead you want additive and dominance effects, you can provide a square matrix of contrasts, as follows:

c2effB <- scan1coef(pr[,"2"], iron$pheno[,"liver"],
                    contrasts=cbind(mu=c(1,1,1), a=c(-1, 0, 1), d=c(-0.5, 1, -0.5)))

The result will then contain the estimates of mu, a, and d. Here’s a plot of the additive and dominance effects, which are in the second and third columns.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
plot(c2effB, map["2"], columns=2:3, col=col)
last_coef <- unclass(c2effB)[nrow(c2effB),2:3] # last two coefficients
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

If you provide a kinship matrix to scan1coef(), it fits a linear mixed model (LMM) to account for a residual polygenic effect. Here let’s use the kinship matrix from the LOCO method.

c2eff_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])

Here’s a plot of the estimates.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff_pg, map["2"], columns=1:3, col=col, ylab="Phenotype average")
last_coef <- unclass(c2eff_pg)[nrow(c2eff_pg),]
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

You can also get estimated additive and dominance effects, using a matrix of contrasts.

c2effB_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]],
                       contrasts=cbind(mu=c(1,1,1), a=c(-1, 0, 1), d=c(-0.5, 1, -0.5)))

Here’s a plot of the results.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
plot(c2effB_pg, map["2"], columns=2:3, col=col)
last_coef <- unclass(c2effB_pg)[nrow(c2effB_pg),2:3]
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

Another option for estimating the QTL effects is to treat them as random effects and calculate Best Linear Unbiased Predictors (BLUPs). This is particularly valuable for multi-parent populations such as the Collaborative Cross and Diversity Outbred mice, where the large number of possible genotypes at a QTL lead to considerable variability in the effect estimates. To calculate BLUPs, use scan1blup(); it takes the same arguments as scan1coef(), including the option of a kinship matrix to account for a residual polygenic effect.

c2blup <- scan1blup(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])

Here is a plot of the BLUPs (as dashed curves) alongside the standard estimates. Note that

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
ylim <- range(c(c2blup, c2eff))+c(-1,1)
plot(c2eff, map["2"], columns=1:3, col=col, ylab="Phenotype average", ylim=ylim,
     xlab="Chr 2 position")
plot(c2blup, map["2"], columns=1:3, col=col, add=TRUE, lty=2)
last_coef <- unclass(c2eff)[nrow(c2eff),]
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

The scan1coef function can also provide estimated QTL effects for binary traits, with model="binary". (However, scan1blup has not yet been implemented for binary traits.)

c2eff_bin <- scan1coef(pr[,"2"], bin_pheno[,"liver"], model="binary")

Here’s a plot of the effects. They’re a bit tricky to interpret, as their basically log odds ratios.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff_bin, map["2"], columns=1:3, col=col)
last_coef <- unclass(c2eff_bin)[nrow(c2eff_bin),] # pull out last coefficients
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

SNP association

For multi-parent crosses, it can be useful to collapse the genotype or allele probabilities according to the founder genotypes of the various SNPs in the region of a QTL.

QTL analysis in Diversity Outbred mice

To illustrate this sort of SNP association analysis, we’ll consider some Diversity Outbred mouse data. The Diversity Outcross (DO) mice are an advanced intercross population derived from the same eight founder strains as the Collaborative Cross (CC). See Svenson et al. (2012) and Gatti et al. (2014).

We’ll consider a subset of the data from Recla et al. (2014), available as part of the qtl2data github repository. (The full data are in DO_Recla; the directory DOex contains a reduced set, with just three chromosomes, one phenotype (OF_immobile_pct, percent immobile in the open field test), and a reduced set of markers.

You can download the data from a single zip file, as follows:

file <- paste0("https://raw.githubusercontent.com/rqtl/",
               "qtl2data/master/DOex/DOex.zip")
DOex <- read_cross2(file)

Let’s quickly whip through a basic analysis.

We first calculate genotype probabilities and convert them to allele probabilities. We’ll just use marker locations and not insert any pseudomarkers.

pr <- calc_genoprob(DOex, error_prob=0.002)
apr <- genoprob_to_alleleprob(pr)

We calculate kinship matrices (using the “loco” method, though with the caveat that here we are only considering genotypes on three chromosomes).

k <- calc_kinship(apr, "loco")

We create a numeric covariate for sex; be sure to include the individual IDs as names.

sex <- (DOex$covar$Sex == "male")*1
names(sex) <- rownames(DOex$covar)

We perform a genome scan with a linear mixed model (adjusting for a residual polygenic effect), with sex as an additive covariate.

out <- scan1(apr, DOex$pheno, k, sex)

Here’s a plot of the results.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out, DOex$gmap)

There’s a strong peak on chromosome 2. Let’s look at the QTL effects. We estimate them with scan1coef(). We need to subset the allele probabilities and the list of kinship matrices.

coef_c2 <- scan1coef(apr[,"2"], DOex$pheno, k[["2"]], sex)

For the DO, with 8 QTL alleles, we can use the function plot_coefCC in the R/qtl2plot package, which plots the 8 allele effects in the “official” Collaborative Cross (CC) colors. (Well, actually slightly modified colors, because I think the official colors are kind of ugly.) The strong locus seems to be mostly due to the NZO allele. Note that CCcolors is a vector of colors included in the qtl2plot package; there’s also a CCorigcolors object with the official colors.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_coefCC(coef_c2, DOex$gmap["2"], bgcolor="gray95")
legend("bottomleft", col=CCcolors, names(CCcolors), ncol=2, lwd=2, bg="gray95")

SNP associations

Okay, now finally we get to the SNP associations. We have a large peak on chromosome 2, and we want to look at individual SNPs in the region of the locus.

Well, actually, we first need to find the location of the inferred QTL. The peak LOD score on chromosome 2 occurs at 52.4 cM. But to find nearby SNPs, we really want to know the Mbp position. The calculations were only performed at the marker positions, and so we need to find the peak marker and then find it’s physical location:

marker <- rownames(max(out, DOex$gmap, chr="2"))
peak_Mbp <- DOex$pmap[["2"]][marker]

The marker is at 97.5 Mbp.

Now we need to identify the SNPs in this region. We’ll focus on a 2 Mbp interval centered at 97.5 Mbp. We’re still working on how best to quickly access SNP data. In the meantime, we can grab a predefined table of SNPs that’s available in the qtl2data repository. It’s saved as an RDS file, which is a slight hassle to load over the web.

tmpfile <- tempfile()
file <- "https://raw.githubusercontent.com/rqtl/qtl2data/master/DOex/c2_snpinfo.rds"
download.file(file, tmpfile, quiet=TRUE)
snpinfo <- readRDS(tmpfile)
unlink(tmpfile)

Here’s the first few rows of the data. The columns are the SNP name, the chromosome, the Mbp position (in Mouse genome build 38), the alleles (with the B6 allele before the | and any other alleles after; in the case of multiple alternate alleles, they are separated by /). Finally, there are eight columns of genotypes for the 8 CC founder strains, coded as 1/3.

head(snpinfo)
##        snp_id chr  pos_Mbp alleles AJ B6 129 NOD NZO CAST PWK WSB
## 1 rs221396738   2 96.50001     C|T  1  1   1   1   1    3   1   1
## 2 rs264175039   2 96.50022     A|C  1  1   1   1   3    1   1   1
## 3 rs227493750   2 96.50028     C|T  1  1   1   1   3    1   1   1
## 4 rs229722012   2 96.50034     C|G  1  1   1   1   3    3   1   3
## 5  rs27379137   2 96.50044     C|T  1  1   1   1   1    3   1   1
## 6 rs227574143   2 96.50067     A|C  1  1   1   1   3    1   1   3

We first convert the founder genotypes to a “strain distribution pattern” (SDP): an integer whose binary encoding corresponds to the 8 founders’ genotypes.

snpinfo$sdp <- calc_sdp(snpinfo[,-(1:4)])

We’ve added the SDP as an additional column.

head(snpinfo)
##        snp_id chr  pos_Mbp alleles AJ B6 129 NOD NZO CAST PWK WSB sdp
## 1 rs221396738   2 96.50001     C|T  1  1   1   1   1    3   1   1  32
## 2 rs264175039   2 96.50022     A|C  1  1   1   1   3    1   1   1  16
## 3 rs227493750   2 96.50028     C|T  1  1   1   1   3    1   1   1  16
## 4 rs229722012   2 96.50034     C|G  1  1   1   1   3    3   1   3 176
## 5  rs27379137   2 96.50044     C|T  1  1   1   1   1    3   1   1  32
## 6 rs227574143   2 96.50067     A|C  1  1   1   1   3    1   1   3 144

(Note that there’s also a function invert_sdp() for converting the SDPs back into founder genotypes.)

To perform the SNP association analysis, we first use the allele probabilities and the founder SNP genotypes to infer the SNP genotypes for the DO mice. That is, at each SNP, we want to collapse the eight founder allele probabilities to two SNP allele probabilities, using the the SNP genotypes of the founders.

We do this assuming that the allele probabilities were calculated sufficiently densely that they can be assumed to be constant in intervals. With this assumption, we then:

We further create a look-up table relating the full set of SNPs to the reduced set (one of each observed SDP in each interval).

We first need to identify the equivalent SNPs, using the function index_snps(). This requires a physical map of the markers/pseudomarkers used to calculate the genotype probabilities. We take this directly from the DOex object, as we’d calculated the allele probabilities only at the observed markers. If we’d also calculated probabilities at pseudomarker positions between markers, we’d need to use interpolation to get Mbp positions for the pseudomarkers. There’s a function interp_map() for assisting with that.

The index_snps() function takes the physical map and the snpinfo data frame, include the strain distribution patterns we calculated above. It inserts three new columns into the data frame ("index", "interval", and "on_map": indexes to a set of non-equivalent SNPs, map intervals in which the SNPs lie, and whether the SNPs correspond to marker/pseudomarker positions).

snpinfo <- index_snps(DOex$pmap, snpinfo)

We can then use the function genoprob_to_snpprob(), which takes the allele probabilities (or the full genotype probabilities, if you want to use a full 3-genotype model at each SNP), to collapse the genotype probabilities to SNP genotype probabilities.

snp_pr <- genoprob_to_snpprob(apr, snpinfo)

The output of this function, snp_pr, has the same form as the input apr object with allele probabilities, and can be used directly in a call to scan1(). And so we can now use the object to perform the SNP association analysis in the region, using the same linear mixed model. We need to be sure to use the correct kinship matrix.

out_snps <- scan1(snp_pr, DOex$pheno, k[["2"]], sex)

The function plot_snpasso() in the qtl2plot package can be used to plot the results, with points at each of the SNPs. The default is to plot all SNPs: We calculated LOD scores only at a set of distinct SNPs, but SNPs in the same interval with the same SDP will have the same LOD score. It takes the scan1() output plus the snpinfo data frame.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_snpasso(out_snps, snpinfo)

To get a table of the SNPs with the largest LOD scores, use the function top_snps(). This will show all SNPs with LOD score within some amount (the default is 1.5) of the maximum SNP LOD score.

top_snps(out_snps, snpinfo)
##            snp_id chr  pos_Mbp alleles AJ B6 129 NOD NZO CAST PWK WSB sdp index interval on_map      lod
## 3264  rs212414861   2 96.75489     C|G  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3265  rs229578122   2 96.75494     T|A  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3266  rs254318131   2 96.75494     C|T  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3267  rs217679969   2 96.75495     G|T  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3268  rs238404461   2 96.75496     T|G  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3269  rs262749925   2 96.75497     C|G  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3271  rs231282689   2 96.75498   C|G/T  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3274  rs260286709   2 96.75518     G|A  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3275   rs27396282   2 96.75522     T|C  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3283  rs263841533   2 96.75541     T|C  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3288  rs231205920   2 96.75557     G|A  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 3289  rs242885221   2 96.75656     T|C  1  1   1   1   3    3   1   1  48  3264       64  FALSE 6.931185
## 5242  rs586746690   2 96.88443     C|T  1  1   3   1   3    3   1   1  52  5242       64  FALSE 5.904999
## 5426   rs49002164   2 96.89204     G|A  1  1   3   1   3    3   1   1  52  5242       64  FALSE 5.904999
## 7170  rs244595995   2 96.96972     A|T  1  1   3   1   3    3   1   1  52  5242       64  FALSE 5.904999
## 16182 rs220351620   2 97.57962     A|G  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 22474  rs52579091   2 98.07299     C|T  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23017 rs243489710   2 98.11061     C|T  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23184 rs244316263   2 98.12529     C|A  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23186 rs219729956   2 98.12534     C|A  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23360 rs235315566   2 98.20433     G|T  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23494 rs250167663   2 98.21481     C|T  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23518 rs234831418   2 98.21704     T|G  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23601 rs240832432   2 98.22253     C|A  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23649 rs220815439   2 98.22511     G|A  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 23830 rs579950897   2 98.24240     C|T  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 26015 rs224994851   2 98.39510     A|T  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 26213 rs248208898   2 98.42249     C|T  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 26554 rs245525925   2 98.45411     C|A  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374
## 26891 rs229631954   2 98.47723     C|T  1  1   1   1   3    1   1   1  16 16182       65  FALSE 5.994374

The top SNPs all have NZO and CAST with a common allele, different from the other 6 founders. The next-best SNPs have NZO with a unique allele. Note that there’s one SNP with two alternate alleles (C|G/T). We are requiring that SNPs have just two alleles, and so we group the alternate alleles together, though there’s not a good reason for this.

We can highlight these top SNPs in the SNP association plot using the drop argument.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_snpasso(out_snps, snpinfo, drop=1.5)