R/qtl2 user guide

R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.

Also see the related packages, qtl2convert (for converting data among the R/qtl2, DOQTL, and R/qtl formats), and qtl2fst (for storing genotype probabilities on disk, for rapid access with reduced memory usage).

R/qtl2 contains a lot of functions. To identify functions of interest, see the organized list of R/qtl2 functions, with brief descriptions.

Installation

Install R/qtl2 from CRAN:

install.packages("qtl2")

Data file format

The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we have defined a new format for the data files. We’ll describe it here briefly; for details, see the separate vignette on the input file format.

QTL mapping data consists of a set of tables of data: marker genotypes, phenotypes, marker maps, etc. In the new format, these different tables are in separate comma-delimited (CSV) files. In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns. For example, the phenotype data file will have individual IDs in the first column and phenotype names in the first row.

A few important changes in the tabular data:

We will use not just the genetic marker map, but also a physical map (if available).
Previously, phenotypes and covariates were combined. In the new format, we separate numeric phenotypes from the often non-numeric covariates.
We define a table of “phenotype covariates.” These are metadata describing the phenotypes. For example, in the case of a phenotype measured over time, one column in the phenotype covariate data could be the time of measurement. For gene expression data, we would have columns representing chromosome and physical position of genes, as well as gene IDs.

In addition to the set of CSV files with the primary data, we need a separate “control” file with various control parameters (or metadata), including the names of all of the other data files and the genotype codes used in the genotype data file. The control file is in a specific format using either YAML or JSON; these are human-readable text files for representing relatively complex data.

A big advantage of this control file scheme is that it greatly simplifies the function for reading in the data. That function, read_cross2(), has a single argument: the name (with path) of the control file. So you can read in data like this:

library(qtl2)
grav2 <- read_cross2("~/my_data/grav2.yaml")

The large number of files is a bit cumbersome, so we’ve made it possible to use a zip file containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:

zip_datafiles("~/my_data/grav2.yaml")

This zip_datafiles() function will read the control file to identify all of the relevant data files and then zip them up into a file with the same name and location, but with the extension .zip rather than .yaml or .json.

To read the data back in, we use the same read_cross2() function, providing the name (and path) of the zip file rather than the control file.

grav2 <- read_cross2("~/my_data/grav2.zip")

This can even be done with remote files.

grav2 <- read_cross2("https://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")

Of course, the other advantage of the zip file is that it is compressed and so smaller than the combined set of CSV files.

The control file may be confusing for some users. To assist in its construction, there’s a function write_control_file() that takes the large set of control parameters as input and then writes the YAML or JSON control file in the appropriate format.

Sample data sets

The R/qtl2 web site includes sample data files in the new format. Zipped versions of these datasets are included with the qtl2 package and can be loaded into R using the read_cross2() function.

In the qtl2 package source, the sample zip files are located in qtl2/inst/extdata. In the installed version of the package, they are in qtl2/extdata, within whatever directory your R packages were installed. The R function system.file() can be used to construct the path to these files.

For example, one of the sample data sets concerns a gravitropism phenotype in a set of Arabidopsis recombinant inbred lines (RIL), from Moore et al. (2013) Genetics 195:1077-1086. The data are in qtl2/extdata/grav2.zip, which can be loaded as follows:

library(qtl2)
grav2 <- read_cross2( system.file("extdata", "grav2.zip", package="qtl2") )

Additional sample data sets, including data on Diversity Outbred (DO) mice, are available at https://github.com/rqtl/qtl2data.

Calculating genotype probabilities

The first basic task in QTL analysis is to calculate conditional genotype probabilities, given the observed marker data, at each putative QTL position. This is accomplished with the calc_genoprob() function. Unlike the corresponding function in R/qtl, calc.genoprob(), the result is not inserted back into the input cross object, but is returned as a list of three-dimensional arrays (one per chromosome). Each 3d array of probabilities is arranged as individuals × genotypes × positions.

If we wish to perform QTL calculations at positions between markers (so called “pseudomarkers”), we first need to insert such positions into the genetic map with the function insert_pseudomarkers(). Unlike R/qtl, the map is kept separate from the genotype probabilities.

We’ll use the iron dataset from Grant et al. (2006) Hepatology 44:174-185 (an intercross) as an example. We first load the data:

library(qtl2)
iron <- read_cross2( system.file("extdata", "iron.zip", package="qtl2") )

We then use insert_pseudomarkers() to insert pseudomarkers into the genetic map, which we grab from the iron object as iron$gmap:

map <- insert_pseudomarkers(iron$gmap, step=1)

And next we use calc_genoprob() to calculate the QTL genotype probabilities.

pr <- calc_genoprob(iron, map, error_prob=0.002)

To speed up the calculations with large datasets on a multi-core machine, you can use the argument cores. With cores=0, the number of available cores will be detected via parallel::detectCores(). Otherwise, specify the number of cores as a positive integer.

pr <- calc_genoprob(iron, map, error_prob=0.002, cores=4)

The genome scan functions (see below) use genotype probabilities as well as a matrix of phenotypes. If you wished to perform a genome scan via an additive allele model, you would first convert the genotype probabilities to allele probabilities, using the function genoprob_to_alleleprob().

apr <- genoprob_to_alleleprob(pr)

Calculating a kinship matrix

If you wish to perform a genome scan by a linear mixed model, accounting for the relationships among individuals (in other words, including a random polygenic effect), you’ll need to calculate a kinship matrix for the individuals. This is accomplished with the calc_kinship() function. It takes the genotype probabilities as input.

kinship <- calc_kinship(pr)

By default, the genotype probabilities are converted to allele probabilities, and the kinship matrix is calculated as the proportion of shared alleles. To use genotype probabilities instead, use use_allele_probs=FALSE. Further, by default we calculate the kinship matrix using both the autosomes and the X chromosome. To omit the X chromosome, use omit_x=TRUE.

In calculating the kinship matrix, one may wish to eliminate the effect of varying marker density across the genome, and only use the probabilities along the grid of pseudomarkers (defined by the step argument to insert_pseudomarkers()). To do so, we need to first use calc_grid() to determine the grid of pseudomarkers, and then probs_to_grid() to omit probabilities for positions that are not on the grid.

grid <- calc_grid(iron$gmap, step=1)
pr_grid <- probs_to_grid(pr, grid)
kinship_grid <- calc_kinship(pr_grid)

If, for your linear mixed model genome scan, you wish to use the “leave one chromosome out” (LOCO) method (scan each chromosome using a kinship matrix that is calculated using data from all other chromosomes), use type="loco" in the call to calc_kinship().

kinship_loco <- calc_kinship(pr, "loco")

On a multi-core machine, you can get some speed-up via the cores argument, as with calc_genoprob().

kinship_loco <- calc_kinship(pr, "loco", cores=4)

Special covariates for the X chromosome

In a QTL scan of the X chromosome, special covariates (such as sex) may need to be included under the null hypothesis of no QTL, to avoid spurious evidence of linkage. (See Broman et al. (2006) Genetics 174:2151-2158.)

The particular X chromosome covariates depend on the cross, and can be obtained with the function get_x_covar().

Xcovar <- get_x_covar(iron)

Performing a genome scan

To perform a genome scan by Haley-Knott regression (Haley and Knott 1992), use the function scan1().

scan1() takes as input the genotype probabilities, a matrix of phenotypes, optional additive and interactive covariates, and the special X chromosome covariates. Another option is to provide a vector of weights.

out <- scan1(pr, iron$pheno, Xcovar=Xcovar)

On a multi-core machine, you can get some speed-up via the cores argument, as with calc_genoprob() and calc_kinship().

out <- scan1(pr, iron$pheno, Xcovar=Xcovar, cores=4)

The output of scan1() is a matrix of LOD scores, positions × phenotypes.

The function plot_scan1() can be used to plot the LOD curves. You can just write plot(), as there’s an S3 method plot.scan1() and the output of scan1() has class "scan1". Use the argument lodcolumn to indicate which column to plot. Unlike the plot.scanone() function in R/qtl, the plot_scan1() function will only plot one set of LOD curves at a time, and you need to provide the marker/pseudomarker map (created by insert_pseudomarkers()).

par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out) # overall maximum LOD score
plot(out, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out), bg="gray90")

Finding LOD peaks

The function find_peaks() can be used to identify a set of LOD peaks that exceed some threshold. It can also provide LOD support or Bayes credible intervals, by using the arguments drop (the amount to drop in the LOD support intervals) or prob (the nominal coverage for the Bayes credible intervals).

You need to provide both the scan1() output as well as the marker/pseudomarker map.

find_peaks(out, map, threshold=4, drop=1.5)

##   lodindex lodcolumn chr  pos     lod ci_lo ci_hi
## 1        1     liver   2 56.8  4.9576  48.1  73.2
## 2        1     liver   7 50.1  4.0508  13.1  53.6
## 3        1     liver  16 28.6  7.6816   6.6  40.4
## 4        2    spleen   8 13.6  4.3029   0.0  32.7
## 5        2    spleen   9 56.6 12.0632  53.6  61.2

The find_peaks() function can also pick out multiple peaks on a chromosome: each peak must exceed the chosen threshold, and the argument peakdrop indicates the amount that the LOD curve must drop below the lowest of two adjacent peaks. Use this feature with caution.

find_peaks(out, map, threshold=4, peakdrop=1.8, drop=1.5)

##   lodindex lodcolumn chr  pos     lod ci_lo ci_hi
## 1        1     liver   2 56.8  4.9576  48.1  73.2
## 2        1     liver   7 25.1  4.0400  13.1  28.4
## 3        1     liver   7 50.1  4.0508  31.7  53.6
## 4        1     liver  16 28.6  7.6816   6.6  40.4
## 5        2    spleen   8 13.6  4.3029   0.0  32.7
## 6        2    spleen   9 56.6 12.0632  53.6  61.2

The functions lod_int() and bayes_int() can be used to derive the LOD support or Bayes credible intervals for QTL, for a specific chromosome and LOD score column. For example, to obtain the Bayes interval for the locus on chromosome 9 for the second phenotype (“spleen”):

bayes_int(out, map, lodcolumn=2, chr=9, prob=0.95)

##   ci_lo  pos ci_hi
## 1  53.6 56.6  61.2

Both lod_int() and bayes_int() take a peakdrop argument, if you wish to try to identify multiple peaks on a chromosome. Again, use this feature with caution.

lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5)

##   ci_lo  pos ci_hi
## 1  13.1 25.1  28.4
## 2  31.7 50.1  53.6

Each row is a different peak; the columns are the lower interval endpoint, the estimated QTL position, and the upper interval endpoint.

By default in find_peaks(), bayes_int(), and lod_int(), the interval endpoints are at markers. This is controlled with the argument expand2markers, which defaults to TRUE. To allow the interval endpoints to be at positions between markers (“pseudomarkers”), include expand2markers=FALSE.

lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5, expand2markers=FALSE)

##   ci_lo  pos ci_hi
## 1  13.1 25.1  28.4
## 2  34.1 50.1  53.6

Performing a genome scan with a linear mixed model

To perform a genome scan using a linear mixed model, accounting for relationships among individuals using a random polygenic effect, you also use the function scan1(); you just need to provide the argument kinship, a kinship matrix (or, for the LOCO method, a list of kinship matrices).

out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar)

Again, on a multi-core machine, you can get some speed-up using the cores argument.

out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar, cores=4)

For the LOCO (leave one chromosome out) method, provide the list of kinship matrices as obtained from calc_kinship() with method="loco".

out_pg_loco <- scan1(pr, iron$pheno, kinship_loco, Xcovar=Xcovar)

To plot the results, we again use plot_scan1(), or just type plot().

Here is a plot of the LOD scores, by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.

color <- c("slateblue", "violetred", "green3")
par(mar=c(4.1, 4.1, 1.6, 1.1))
ymx <- max(maxlod(out), maxlod(out_pg), maxlod(out_pg_loco))
for(i in 1:2) {
    plot(out, map, lodcolumn=i, col=color[1], main=colnames(iron$pheno)[i],
              ylim=c(0, ymx*1.02))
    plot(out_pg, map, lodcolumn=i, col=color[2], add=TRUE)
    plot(out_pg_loco, map, lodcolumn=i, col=color[3], add=TRUE, lty=2)
    legend("topleft", lwd=2, col=color, c("H-K", "LMM", "LOCO"), bg="gray90", lty=c(1,1,2))
}

For the liver phenotype (top panel), the three methods give quite different results. The linear mixed model with an overall kinship matrix gives much lower LOD scores than the other two methods. On chromosomes with some evidence of a QTL, the LOCO method gives higher LOD scores than Haley-Knott, except on chromosome 16 where it gives lower LOD scores.

For the spleen phenotype (bottom panel), the linear mixed model with an overall kinship matrix again gives much lower LOD scores than the other two methods. However, in this case Haley-Knott regression and the LOCO method give quite similar results.

Performing a genome scan with binary traits

The genome scans above were performed assuming that the residual variation followed a normal distribution. This will often provide reasonable results even if the residuals are not normal, but an important special case is that of a binary trait, with values 0 and 1, which is best treated differently. The scan1() function can perform a genome scan with binary traits by logistic regression, using the argument model="binary". (The default value for the model argument is "normal".) At present, we can not account for relationships among individuals in this analysis.

Let’s first turn our two phenotypes into binary traits by thresholding at the median. One would generally not do this in practice; this is just for illustration.

bin_pheno <- apply(iron$pheno, 2, function(a) as.numeric(a > median(a)))
rownames(bin_pheno) <- rownames(iron$pheno)

We now perform the genome scan as before, including model="binary" to indicate that the phenotypes are binary traits with values 0 and 1.

out_bin <- scan1(pr, bin_pheno, Xcovar=Xcovar, model="binary")

Here is a plot of the two sets of LOD curves.

par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out_bin)
plot(out_bin, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out_bin, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out_bin), bg="gray90")

We can use find_peaks as before.

find_peaks(out_bin, map, threshold=3.5, drop=1.5)

##   lodindex lodcolumn chr  pos    lod ci_lo ci_hi
## 1        1     liver   2 56.8 3.6303  48.1  73.2
## 2        1     liver   8 38.0 3.5077  17.3  69.9
## 3        1     liver  15 49.2 3.7789  16.4  49.2
## 4        1     liver  16 27.6 7.5069   6.6  40.4
## 5        2    spleen   9 57.6 9.2528  53.6  61.2

Performing a permutation test

To perform a permutation test to establish the statistical significance of the results of a genome scan, use the function scan1perm(). (In R/qtl, a single function, scanone(), was used for both performing a genome scan and for getting permutation-based significance thresholds, but in R/qtl2, we’ve decided to make two separate functions).

The scan1perm() function takes the same arguments as scan1(), plus additional arguments to control the permutations:

n_perm is the number of permutation replicates.
perm_Xsp controls whether to perform autosome/X chromosome specific permutations (with perm_Xsp=TRUE) or not (the default is to not).
perm_strata is a vector that defines the strata for a stratified permutation test.
chr_lengths is a vector of chromosome lengths, used in the case that perm_Xsp=TRUE.

As with scan1(), you may provide a kinship matrix (or vector of kinship matrices, for the “leave one chromosome out” (LOCO) approach), in order to fit linear mixed models to account for relationships among individuals (in other words, including a random polygenic effect). If kinship is unspecified, the function performs ordinary Haley-Knott regression.

To perform a permutation test with the iron data, we do the following:

operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000)

Note the need to specify special covariates for the X chromosome (via Xcovar), to be included under the null hypothesis of no QTL. And note that when these are provided, the default is to perform a stratified permutation test, using strata defined by the rows in Xcovar. In general, when the X chromosome is considered, one will wish to stratify at least by sex.

Also note that, as with scan1(), you can speed up the calculations on a multi-core machine by specifying the argument cores. With cores=0, the number of available cores will be detected via parallel::detectCores(). Otherwise, specify the number of cores as a positive integer. For large datasets, be mindful of the amount of memory that will be needed; you may need to use fewer than the maximum number of cores, to avoid going beyond the available memory.

operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000, cores=0)

To get estimated significance thresholds, use the function summary().

summary(operm)

## LOD thresholds (1000 permutations)
##      liver spleen
## 0.05  3.46   3.46

The default is to return the 5% significance thresholds. Thresholds for other (or for multiple) significance levels can be obtained via the alpha argument.

summary(operm, alpha=c(0.2, 0.05))

## LOD thresholds (1000 permutations)
##      liver spleen
## 0.2   2.63   2.64
## 0.05  3.46   3.46

To obtain autosome/X chromosome-specific significance thresholds, specify perm_Xsp=TRUE. In this case, you need to provide chromosome lengths, which may be obtained with the function chr_lengths().

operm2 <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000,
                    perm_Xsp=TRUE, chr_lengths=chr_lengths(map))

Separate permutations are performed for the autosomes and X chromosome, and considerably more permutation replicates are needed for the X chromosome. The computations take about twice as much time. See Broman et al. (2006) Genetics 174:2151-2158.

The significance thresholds are again derived via summary():

summary(operm2, alpha=c(0.2, 0.05))

## Autosome LOD thresholds (1000 permutations)
##      liver spleen
## 0.2   2.65   2.54
## 0.05  3.42   3.22
## 
## X chromosome LOD thresholds (28243 permutations)
##      liver spleen
## 0.2    3.1   4.02
## 0.05   3.9   5.18

Permutations for a genome scan with a linear mixed model-based are performed by specifying the kinship argument. We can use the “leave one chromosome out” (LOCO) method by providing kinship_loco, the list of kinship matrices calculated above with calc_kinship().

operm3 <- scan1perm(pr, iron$pheno, kinship_loco, Xcovar=Xcovar, n_perm=1000,
                    perm_Xsp=TRUE, chr_lengths=chr_lengths(map))

Here are the estimated significance thresholds:

summary(operm3, alpha=c(0.2, 0.05))

## Autosome LOD thresholds (1000 permutations)
##      liver spleen
## 0.2   2.64   2.62
## 0.05  3.29   3.29
## 
## X chromosome LOD thresholds (28243 permutations)
##      liver spleen
## 0.2   3.14   4.37
## 0.05  3.82   5.50

As with scan1, we can use scan1perm with binary traits, using the argument model="binary". Again, this can’t be used with a kinship matrix, but all of the other arguments can be applied.

operm_bin <- scan1perm(pr, bin_pheno, Xcovar=Xcovar, model="binary",
                       n_perm=1000, perm_Xsp=TRUE, chr_lengths=chr_lengths(map))

Here are the estimated 5% and 20% significance thresholds.

summary(operm_bin, alpha=c(0.2, 0.05))

## Autosome LOD thresholds (1000 permutations)
##      liver spleen
## 0.2   2.60   2.63
## 0.05  3.33   3.41
## 
## X chromosome LOD thresholds (28243 permutations)
##      liver spleen
## 0.2   3.16   3.06
## 0.05  3.86   3.77

Estimated QTL effects

The scan1() function returns only LOD scores. To obtain estimated QTL effects, use the function scan1coef(). This function takes a single phenotype and the genotype probabilities for a single chromosome and returns a matrix with the estimated coefficients at each putative QTL location along the chromosome.

For example, to get the estimated effects on chromosome 2 for the liver phenotype, we’d do the following:

c2eff <- scan1coef(pr[,"2"], iron$pheno[,"liver"])

The result is a matrix, 39 positions × 4 genotypes. To plot the effects, use the function plot_coef(). There is again an S3 method function plot.scan1coef(), so one can just type plot(), but in either case you need to provide the map for the relevant chromosome. Use the argument columns to indicate which coefficient columns to plot.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff, map["2"], columns=1:3, col=col)
last_coef <- unclass(c2eff)[nrow(c2eff),] # pull out last coefficients
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

The default is to provide phenotype averages for each genotype group. If instead you want additive and dominance effects, you can provide a square matrix of contrasts, as follows:

c2effB <- scan1coef(pr[,"2"], iron$pheno[,"liver"],
                    contrasts=cbind(mu=c(1,1,1), a=c(-1, 0, 1), d=c(0, 1, 0)))

The result will then contain the estimates of mu, a, and d. Here’s a plot of the additive and dominance effects, which are in the second and third columns.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
plot(c2effB, map["2"], columns=2:3, col=col)
last_coef <- unclass(c2effB)[nrow(c2effB),2:3] # last two coefficients
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

If you provide a kinship matrix to scan1coef(), it fits a linear mixed model (LMM) to account for a residual polygenic effect. Here let’s use the kinship matrix from the LOCO method.

c2eff_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])

Here’s a plot of the estimates.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff_pg, map["2"], columns=1:3, col=col, ylab="Phenotype average")
last_coef <- unclass(c2eff_pg)[nrow(c2eff_pg),]
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

You can also get estimated additive and dominance effects, using a matrix of contrasts.

c2effB_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]],
                       contrasts=cbind(mu=c(1,1,1), a=c(-1, 0, 1), d=c(-0.5, 1, -0.5)))

Here’s a plot of the results.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
plot(c2effB_pg, map["2"], columns=2:3, col=col)
last_coef <- unclass(c2effB_pg)[nrow(c2effB_pg),2:3]
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

Another option for estimating the QTL effects is to treat them as random effects and calculate Best Linear Unbiased Predictors (BLUPs). This is particularly valuable for multi-parent populations such as the Collaborative Cross and Diversity Outbred mice, where the large number of possible genotypes at a QTL lead to considerable variability in the effect estimates. To calculate BLUPs, use scan1blup(); it takes the same arguments as scan1coef(), including the option of a kinship matrix to account for a residual polygenic effect.

c2blup <- scan1blup(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])

Here is a plot of the BLUPs (as dashed curves) alongside the standard estimates. Note that the BLUPs are centered at 0, while the coefficient estimates are centered at the phenotype average.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
ylim <- range(c(c2blup, c2eff))+c(-1,1)
plot(c2eff, map["2"], columns=1:3, col=col, ylab="Phenotype average", ylim=ylim,
     xlab="Chr 2 position")
plot(c2blup, map["2"], columns=1:3, col=col, add=TRUE, lty=2)
last_coef <- unclass(c2eff)[nrow(c2eff),]
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

The scan1coef function can also provide estimated QTL effects for binary traits, with model="binary". (However, scan1blup has not yet been implemented for binary traits.)

c2eff_bin <- scan1coef(pr[,"2"], bin_pheno[,"liver"], model="binary")

Here’s a plot of the effects. They’re a bit tricky to interpret, as their basically log odds ratios.

par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff_bin, map["2"], columns=1:3, col=col)
last_coef <- unclass(c2eff_bin)[nrow(c2eff_bin),] # pull out last coefficients
for(i in seq(along=last_coef))
    axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])

Finally, to plot the raw phenotypes against the genotypes at a single putative QTL position, you can use the function plot_pxg(). This takes a vector of genotypes as produced by the maxmarg() function, which picks the most likely genotype from a set of genotype probabilities, provided it is greater than some specified value (the argument minprob). Note that the “marg” in “maxmarg” stands for “marginal”, as this function is selecting the genotype at each position that has maximum marginal probability.

For example, we could get inferred genotypes at the chr 2 QTL for the liver phenotype (at 28.6 cM) as follows:

g <- maxmarg(pr, map, chr=2, pos=28.6, return_char=TRUE)

We use return_char=TRUE to have maxmarg() return a vector of character strings with the genotype labels.

We then plot the liver phenotype against these genotypes as follows:

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_pxg(g, iron$pheno[,"liver"], ylab="Liver phenotype")

We can use swap_axes=TRUE to have the phenotype on the x-axis. And we can use SEmult=2 to include mean ± 2 SE intervals.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_pxg(g, iron$pheno[,"liver"], SEmult=2, swap_axes=TRUE, xlab="Liver phenotype")

SNP association

For multi-parent crosses, it can be useful to collapse the genotype or allele probabilities according to the founder genotypes of the various SNPs in the region of a QTL.

QTL analysis in Diversity Outbred mice

To illustrate this sort of SNP association analysis, we’ll consider some Diversity Outbred mouse data. The Diversity Outcross (DO) mice are an advanced intercross population derived from the same eight founder strains as the Collaborative Cross (CC). See Svenson et al. (2012), Churchill et al. (2012), and Gatti et al. (2014).

We’ll consider a subset of the data from Recla et al. (2014), available as part of the qtl2data github repository. (The full data are in DO_Recla; the directory DOex contains a reduced set, with just three chromosomes, one phenotype (OF_immobile_pct, percent immobile in the open field test), and a reduced set of markers.

You can download the data from a single zip file, as follows:

file <- "https://raw.githubusercontent.com/rqtl/qtl2data/main/DOex/DOex.zip"
DOex <- read_cross2(file)

Let’s quickly whip through a basic analysis.

We first calculate genotype probabilities and convert them to allele probabilities. We’ll just use marker locations and not insert any pseudomarkers.

pr <- calc_genoprob(DOex, error_prob=0.002)
apr <- genoprob_to_alleleprob(pr)

We calculate kinship matrices (using the “LOCO” method, though with the caveat that here we are only considering genotypes on three chromosomes).

k <- calc_kinship(apr, "loco")

We create a numeric covariate for sex; be sure to include the individual IDs as names.

sex <- (DOex$covar$Sex == "male")*1
names(sex) <- rownames(DOex$covar)

Note that you can create the vector and assign the names in one step using the basic R function setNames().

sex <- setNames( (DOex$covar$Sex == "male")*1, rownames(DOex$covar) )

We perform a genome scan with a linear mixed model (adjusting for a residual polygenic effect), with sex as an additive covariate.

out <- scan1(apr, DOex$pheno, k, sex)

Here’s a plot of the results.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out, DOex$gmap)

There’s a strong peak on chromosome 2. Let’s look at the QTL effects. We estimate them with scan1coef(). We need to subset the allele probabilities and the list of kinship matrices.

coef_c2 <- scan1coef(apr[,"2"], DOex$pheno, k[["2"]], sex)

For the DO, with 8 QTL alleles, we can use the function plot_coefCC, which plots the 8 allele effects in the “official” Collaborative Cross (CC) colors. (Well, actually slightly modified colors, because I think the official colors are kind of ugly.) The strong locus seems to be mostly due to the NZO allele. Note that qtl2 package includes the vector of colors for the CC as the object CCcolors; there’s also a CCorigcolors object with the official colors.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_coefCC(coef_c2, DOex$gmap["2"], bgcolor="gray95", legend="bottomleft")

If you provide plot_coefCC() with the genome scan output, it will display the LOD curve below the coefficient estimates.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_coefCC(coef_c2, DOex$gmap["2"], scan1_output=out, bgcolor="gray95", legend="bottomleft")

Connecting to SNP and gene databases

To perform SNP association analysis in the region of a QTL, we’ll need to grab data on all of the SNPs in the region, including their genotypes in the eight founder strains for the CC and DO populations. As a related task, we’ll also want to identify the genes in the region.

We want R/qtl2 to permit different ways of storing this information. For the CC and DO populations, we’ve prepared SQLite database files for the variants in the CC founders and for the MGI mouse gene annotations. But others might wish to use a different kind of database, or may wish to query an online database.

We provide a template for how to use R/qtl2 to connect to SNP and gene databases, with the functions create_variant_query_func() and create_gene_query_func(). Each returns a function for querying variant and gene databases, respectively. The query functions that are returned take just three arguments (chr, start, end end), and themselves return a data frame of variants on the one hand and genes on the other.

If you are analyzing the CC lines or a DO population, you can download the SQLite databases that we have prepared and made available on Figshare:

cc_variants.sqlite doi:10.6084/m9.figshare.5280229.v3, variants in the Collaborative Cross founders (3 GB)
mouse_genes.sqlite doi:10.6084/m9.figshare.5280238.v7 full set of mouse gene annotations (530 MB)
mouse_genes_mgi.sqlite doi:10.6084/m9.figshare.5286019.v8 just the MGI mouse gene annotations (12 MB)

For example, to download cc_variants.sqlite and mouse_genes_mgi.sqlite, we’d go to the figshare sites above and determine the URLs for downloading the files, and then do the following:

download.file("https://ndownloader.figshare.com/files/18533342", "cc_variants.sqlite")
download.file("https://ndownloader.figshare.com/files/17609252", "mouse_genes_mgi.sqlite")

Store them locally on your computer, wherever it’s convenient. For example, I’ve placed them in ~/Data/CCdb/.

To create a function for querying the CC variants, call create_variant_query_func() with the path to the cc_variants.sqlite file:

query_variants <- create_variant_query_func("~/Data/CCdb/cc_variants.sqlite")

To grab the variants in the interval 97-98 Mbp on chromosome 2, you’d then do the following:

variants_2_97.5 <- query_variants(2, 97, 98)

Similarly, to create a function for querying the MGI mouse gene annotations, you call create_gene_query_func() with the path to the mouse_genes_mgi.sqlite file:

query_genes <- create_gene_query_func("~/Data/CCdb/mouse_genes_mgi.sqlite")

To grab the genes overlapping the interval 97-98 Mbp on chromosome 2, you’d then do the following:

genes_2_97.5 <- query_genes(2, 97, 98)

The way we have set this up is a bit complicated, but it allows greatest flexibility on the part of the user. And for our own work, we like to have a local SQLite database, for rapid queries of SNPs and genes.

Database for build GRCm39

The above SQLite databases were prepared with mouse genome build 38 (mm10). If you are using the newer GRCm39 build, download the following database (created by Matt Vincent at the Jackson Lab), which includes both SNP/indel variants and genes:

fv.2021.snps.db3 doi:10.6084/m9.figshare.22630642.v1 (10.2 GB)

To download from R, use:

download.file("https://figshare.com/ndownloader/files/40157572", "fv.2021.snps.db3")

For querying founder variant information, you can again use create_variant_query_func(), though you need to specify not just the path to the SQLite file but also the id_field, as follows:

query_variants <- create_variant_query_func("~/Data/CCdb/fv.2021.snps.db3", id_field="variants_id")
variants_2_97.5 <- query_variants(2, 97, 98)

The variants database includes all SNPs and indels in the Sanger resequencing data, including some with low confidence. If you want to filter to those with full confidence, you could use the filter argument in create_variant_query_func():

query_variants <- create_variant_query_func("~/Data/CCdb/fv.2021.snps.db3", id_field="variants_id",
                                            filter="confidence==255")
variants_2_97.5 <- query_variants(2, 97, 98)

The fv.2021.snps.db3 file also contains ensembl gene information, but the field names are different from the defaults, so use create_gene_query_func() as follows:

query_genes <- create_gene_query_func("~/Data/CCdb/fv.2021.snps.db3", chr_field="chromosome", name_field="symbol",
                                      start_field="start_position", stop_field="end_position")

But you can then use the query_genes() function in the same way as before:

genes_2_97.5 <- query_genes(2, 97, 98)

SNP associations

Okay, now finally we get to the SNP associations. We have a large peak on chromosome 2, and we want to look at individual SNPs in the region of the locus.

Well, actually, we first need to find the location of the inferred QTL. The peak LOD score on chromosome 2 occurs at 52.4 cM. But to find nearby SNPs, we really want to know the Mbp position. The calculations were only performed at the marker positions, and so can use max(), giving both the scan1() output and the physical map, and then pull out the position from the results.

peak_Mbp <- max(out, DOex$pmap)$pos

The marker is at 96.8 Mbp. We’ll focus on a 2 Mbp interval centered at 96.8 Mbp.

We can pull out the variants in the 2 Mbp interval centered at 96.8 on chr 2 using the query function we defined above:

variants <- query_variants(2, peak_Mbp - 1, peak_Mbp + 1)

There are 27355 variants in the interval, including 27108 SNPs, 127 indels, and 120 structural variants. We’re treating all of them as biallelic markers (all but the major allele in the eight founder strains binned into a single allele). In the following, we’re going to just say “SNP” rather than “variant”, even though the variants include indels and structural variants.

After identifying the variants in the interval of interest, we use our genotype probabilities and the founder SNP genotypes to infer the SNP genotypes for the DO mice. That is, at each SNP, we want to collapse the eight founder allele probabilities to two SNP allele probabilities, using the SNP genotypes of the founders.

We do this assuming that the genotype probabilities were calculated sufficiently densely that they can be assumed to be constant in intervals. With this assumption, we then:

Find the interval for each SNP.
Reduce the SNPs to a “distinct” set: if two SNPs have the same strain distribution pattern (SDP; the pattern of alleles in the eight founders) and are in the same interval, by our assumption their allele probabilities will be the same.
Take the average of the allele probabilities at the two endpoints of each interval.
Collapse the 8 allele probabilities to two according to each observed SDP in the interval.

We further create a look-up table relating the full set of SNPs to the reduced set (one of each observed SDP in each interval).

All of these steps are combined into a single function scan1snps(), which takes the genotype probabilities, a physical map of those locations, the phenotype, the kinship matrix, covariates, the query function for grabbing the SNPs in a given interval, and the chromosome, start, and end positions for the interval. (If insert_pseudomarkers() was used to insert pseudomarkers, you will need to use interp_map() to get interpolated Mbp positions for the pseudomarkers.)

out_snps <- scan1snps(pr, DOex$pmap, DOex$pheno, k[["2"]], sex, query_func=query_variants,
                      chr=2, start=peak_Mbp-1, end=peak_Mbp+1, keep_all_snps=TRUE)

The output is a list with two components: lod is a matrix of LOD scores (with a single column, since we’re using just one phenotype), and snpinfo is a data frame with SNP information. With the argument keep_all_snps=TRUE, the snpinfo data frame contains information about all of the variants in the region with an index column indicating the equivalence classes.

The function plot_snpasso() can be used to plot the results, with points at each of the SNPs. The default is to plot all SNPs. In this case, there are 27355 variants in the region, but only 167 distinct ones.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_snpasso(out_snps$lod, out_snps$snpinfo)

We can actually just type plot() rather than plot_snpasso(), because with the snpinfo table in place of a genetic map, plot() detects that a SNP association plot should be created.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out_snps$lod, out_snps$snpinfo)

We can use our query_genes() function to identify the genes in the region, and plot_genes() to plot their locations. But also plot_snpasso() can take the gene locations with the argument genes and then display them below the SNP association results. Here, we are also highlighting the top SNPs in the SNP association plot using the drop_hilit argument. SNPs with LOD score within drop_hilit of the maximum are shown in pink.

genes <- query_genes(2, peak_Mbp - 1, peak_Mbp + 1)
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out_snps$lod, out_snps$snpinfo, drop_hilit=1.5, genes=genes)

To get a table of the SNPs with the largest LOD scores, use the function top_snps(). This will show all SNPs with LOD score within some amount (the default is 1.5) of the maximum SNP LOD score. We’re going to display just a subset of the columns.

top <- top_snps(out_snps$lod, out_snps$snpinfo)
print(top[,c(1, 8:15, 20)], row.names=FALSE)

##                     snp A_J C57BL_6J 129S1_SvImJ NOD_ShiLtJ NZO_HlLtJ CAST_EiJ PWK_PhJ WSB_EiJ    lod
##              rs33297153   1        1           2          1         2        3       1       1 8.2151
##             rs226751615   1        1           1          1         2        2       1       1 7.9496
##             rs230653109   1        1           1          1         2        2       1       1 7.9496
##             rs250407660   1        1           1          1         2        2       1       1 7.9496
##             rs248216267   1        1           1          1         2        2       1       1 7.9496
##             rs237564299   1        1           1          1         2        2       1       1 7.9496
##             rs255011323   1        1           1          1         2        2       1       1 7.9496
##             rs212448521   1        1           1          1         2        2       1       1 7.9496
##              rs27379206   1        1           1          1         2        2       1       1 7.9496
##              rs27379194   1        1           1          1         2        2       1       1 7.9496
##             rs244484646   1        1           1          1         2        2       1       1 7.9496
##             rs215855380   1        1           1          1         2        2       1       1 7.9496
##             rs236942088   1        1           1          1         2        2       1       1 7.9496
##             rs212414861   1        1           1          1         2        2       1       1 7.9496
##             rs229578122   1        1           1          1         2        2       1       1 7.9496
##             rs254318131   1        1           1          1         2        2       1       1 7.9496
##             rs217679969   1        1           1          1         2        2       1       1 7.9496
##             rs238404461   1        1           1          1         2        2       1       1 7.9496
##             rs262749925   1        1           1          1         2        2       1       1 7.9496
##             rs231282689   1        1           1          1         2        2       1       1 7.9496
##             rs260286709   1        1           1          1         2        2       1       1 7.9496
##              rs27396282   1        1           1          1         2        2       1       1 7.9496
##             rs263841533   1        1           1          1         2        2       1       1 7.9496
##             rs231205920   1        1           1          1         2        2       1       1 7.9496
##             rs242885221   1        1           1          1         2        2       1       1 7.9496
##             rs586746690   1        1           2          1         2        2       1       1 8.6351
##              rs49002164   1        1           2          1         2        2       1       1 8.6351
##  SV_2_96945406_96945414   1        1           1          1         2        3       1       1 8.2809
##             rs244595995   1        1           2          1         2        2       1       1 8.6351
##  SV_2_96993116_96993118   1        1           1          1         2        2       1       1 8.2809

The top SNPs all have 129, NZO, and CAST with a common allele, different from the other 5 founders. The next-best SNPs have just NZO and CAST with a common allele.

You can include a visualization of the strain distribution pattern of the SNPs in the association plot with the argument sdp_panel=TRUE.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out_snps$lod, out_snps$snpinfo, drop_hilit=1.5, genes=genes, sdp_panel=TRUE)

The scan1snps() function can also be used to perform a genome-wide SNP association scan, by providing a variant query function but leaving chr, start, and end unspecified. In this case it’s maybe best to use keep_all_snps=FALSE (the default) and only save the index SNPs.

out_gwas <- scan1snps(pr, DOex$pmap, DOex$pheno, k, sex, query_func=query_variants, cores=0)

We can make a Manhattan plot of the results as follows. We use altcol to define a color for alternate chromosomes and gap=0 to have no gap between chromosomes.

par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out_gwas$lod, out_gwas$snpinfo, altcol="green4", gap=0)

Note that while there are LOD scores on the X chromosome that are as large as those on chromosome 2, we’re allowing a genotype × sex interaction on the X chromosome and so the test has 3 degrees of freedom (versus 2 degrees of freedom on the autosomes) and so the LOD scores will naturally be larger. If we’d used the allele probabilities (apr above, calculated from genoprob_to_alleleprob()) rather than the genotype probabilities, we would be performing a test with 1 degree of freedom on both the X chromosome and the autosomes.

Karl Broman