R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.
Also see the related packages, qtl2convert (for converting data among the R/qtl2, DOQTL, and R/qtl formats), and qtl2fst (for storing genotype probabilities on disk, for rapid access with reduced memory usage).
R/qtl2 contains a lot of functions. To identify functions of interest, see the organized list of R/qtl2 functions, with brief descriptions.
The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we have defined a new format for the data files. We’ll describe it here briefly; for details, see the separate vignette on the input file format.
QTL mapping data consists of a set of tables of data: marker genotypes, phenotypes, marker maps, etc. In the new format, these different tables are in separate comma-delimited (CSV) files. In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns. For example, the phenotype data file will have individual IDs in the first column and phenotype names in the first row.
A few important changes in the tabular data:
In addition to the set of CSV files with the primary data, we need a separate “control” file with various control parameters (or metadata), including the names of all of the other data files and the genotype codes used in the genotype data file. The control file is in a specific format using either YAML or JSON; these are human-readable text files for representing relatively complex data.
A big advantage of this control file scheme is that it greatly
simplifies the function for reading in the data. That function,
read_cross2()
, has a single argument: the name
(with path) of the control file. So you can read in data like this:
library(qtl2)
grav2 <- read_cross2("~/my_data/grav2.yaml")
The large number of files is a bit cumbersome, so we’ve made it possible to use a zip file containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:
zip_datafiles("~/my_data/grav2.yaml")
This zip_datafiles()
function will read the control file
to identify all of the relevant data files and then zip them up into a
file with the same name and location, but with the extension
.zip
rather than .yaml
or
.json
.
To read the data back in, we use the same read_cross2()
function, providing the name (and path) of the zip file rather than the
control file.
grav2 <- read_cross2("~/my_data/grav2.zip")
This can even be done with remote files.
grav2 <- read_cross2("https://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")
Of course, the other advantage of the zip file is that it is compressed and so smaller than the combined set of CSV files.
The control file may be confusing for some users. To assist in its
construction, there’s a function write_control_file()
that
takes the large set of control parameters as input and then writes the
YAML or JSON control file in the appropriate format.
The R/qtl2 web site includes sample data
files in the new format. Zipped versions of these datasets are
included with the qtl2
package and can be loaded into R using the read_cross2()
function.
In the qtl2 package
source, the sample zip files are located in qtl2/inst/extdata
.
In the installed version of the package, they are in
qtl2/extdata
, within whatever directory your R packages
were installed. The R function system.file()
can be used to
construct the path to these files.
For example, one of the sample data sets concerns a gravitropism
phenotype in a set of Arabidopsis recombinant inbred lines (RIL), from
Moore et
al. (2013) Genetics 195:1077-1086. The data are in
qtl2/extdata/grav2.zip
, which can be loaded as follows:
library(qtl2)
grav2 <- read_cross2( system.file("extdata", "grav2.zip", package="qtl2") )
Additional sample data sets, including data on Diversity Outbred (DO) mice, are available at https://github.com/rqtl/qtl2data.
The first basic task in QTL analysis is to calculate conditional
genotype probabilities, given the observed marker data, at each putative
QTL position. This is accomplished with the calc_genoprob()
function. Unlike the corresponding function in R/qtl, calc.genoprob()
, the
result is not inserted back into the input cross object, but is returned
as a list of three-dimensional arrays (one per chromosome). Each 3d
array of probabilities is arranged as individuals × genotypes ×
positions.
If we wish to perform QTL calculations at positions between markers
(so called “pseudomarkers”), we first need to insert such positions into
the genetic map with the function insert_pseudomarkers()
.
Unlike R/qtl, the map is kept separate
from the genotype probabilities.
We’ll use the iron dataset from Grant et al. (2006) Hepatology 44:174-185 (an intercross) as an example. We first load the data:
library(qtl2)
iron <- read_cross2( system.file("extdata", "iron.zip", package="qtl2") )
We then use insert_pseudomarkers()
to insert
pseudomarkers into the genetic map, which we grab from the
iron
object as iron$gmap
:
map <- insert_pseudomarkers(iron$gmap, step=1)
And next we use calc_genoprob()
to calculate the QTL
genotype probabilities.
pr <- calc_genoprob(iron, map, error_prob=0.002)
To speed up the calculations with large datasets on a multi-core
machine, you can use the argument cores
. With
cores=0
, the number of available cores will be detected via
parallel::detectCores()
. Otherwise, specify the number of
cores as a positive integer.
pr <- calc_genoprob(iron, map, error_prob=0.002, cores=4)
The genome scan functions (see below) use genotype probabilities as
well as a matrix of phenotypes. If you wished to perform a genome scan
via an additive allele model, you would first convert the genotype
probabilities to allele probabilities, using the function
genoprob_to_alleleprob()
.
apr <- genoprob_to_alleleprob(pr)
If you wish to perform a genome scan by a linear mixed model,
accounting for the relationships among individuals (in other words,
including a random polygenic effect), you’ll need to calculate a kinship
matrix for the individuals. This is accomplished with the
calc_kinship()
function. It takes the genotype
probabilities as input.
kinship <- calc_kinship(pr)
By default, the genotype probabilities are converted to allele
probabilities, and the kinship matrix is calculated as the proportion of
shared alleles. To use genotype probabilities instead, use
use_allele_probs=FALSE
. Further, by default we calculate
the kinship matrix using both the autosomes and the X chromosome. To
omit the X chromosome, use omit_x=TRUE
.
In calculating the kinship matrix, one may wish to eliminate the
effect of varying marker density across the genome, and only use the
probabilities along the grid of pseudomarkers (defined by the
step
argument to insert_pseudomarkers()
). To
do so, we need to first use calc_grid()
to determine the
grid of pseudomarkers, and then probs_to_grid()
to omit
probabilities for positions that are not on the grid.
grid <- calc_grid(iron$gmap, step=1)
pr_grid <- probs_to_grid(pr, grid)
kinship_grid <- calc_kinship(pr_grid)
If, for your linear mixed model genome scan, you wish to use the
“leave one chromosome out” (LOCO) method (scan each chromosome using a
kinship matrix that is calculated using data from all other
chromosomes), use type="loco"
in the call to
calc_kinship()
.
kinship_loco <- calc_kinship(pr, "loco")
On a multi-core machine, you can get some speed-up via the
cores
argument, as with calc_genoprob()
.
kinship_loco <- calc_kinship(pr, "loco", cores=4)
In a QTL scan of the X chromosome, special covariates (such as sex) may need to be included under the null hypothesis of no QTL, to avoid spurious evidence of linkage. (See Broman et al. (2006) Genetics 174:2151-2158.)
The particular X chromosome covariates depend on the cross, and can
be obtained with the function get_x_covar()
.
Xcovar <- get_x_covar(iron)
To perform a genome scan by Haley-Knott regression (Haley and Knott
1992), use the function scan1()
.
scan1()
takes as input the genotype probabilities, a
matrix of phenotypes, optional additive and interactive covariates, and
the special X chromosome covariates. Another option is to provide a
vector of weights.
out <- scan1(pr, iron$pheno, Xcovar=Xcovar)
On a multi-core machine, you can get some speed-up via the
cores
argument, as with calc_genoprob()
and
calc_kinship()
.
out <- scan1(pr, iron$pheno, Xcovar=Xcovar, cores=4)
The output of scan1()
is a matrix of LOD scores,
positions × phenotypes.
The function plot_scan1()
can be used to plot the LOD
curves. You can just write plot()
, as there’s an S3 method
plot.scan1()
and the output of scan1()
has
class "scan1"
. Use the argument lodcolumn
to
indicate which column to plot. Unlike the plot.scanone()
function in R/qtl, the
plot_scan1()
function will only plot one set of LOD curves
at a time, and you need to provide the marker/pseudomarker map (created
by insert_pseudomarkers()
).
par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out) # overall maximum LOD score
plot(out, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out), bg="gray90")
The function find_peaks()
can be used to identify a set
of LOD peaks that exceed some threshold. It can also provide LOD support
or Bayes credible intervals, by using the arguments drop
(the amount to drop in the LOD support intervals) or prob
(the nominal coverage for the Bayes credible intervals).
You need to provide both the scan1()
output as well as
the marker/pseudomarker map.
find_peaks(out, map, threshold=4, drop=1.5)
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 liver 2 56.8 4.9576 48.1 73.2
## 2 1 liver 7 50.1 4.0508 13.1 53.6
## 3 1 liver 16 28.6 7.6816 6.6 40.4
## 4 2 spleen 8 13.6 4.3029 0.0 32.7
## 5 2 spleen 9 56.6 12.0632 53.6 61.2
The find_peaks()
function can also pick out multiple
peaks on a chromosome: each peak must exceed the chosen threshold, and
the argument peakdrop
indicates the amount that the LOD
curve must drop below the lowest of two adjacent peaks. Use this feature
with caution.
find_peaks(out, map, threshold=4, peakdrop=1.8, drop=1.5)
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 liver 2 56.8 4.9576 48.1 73.2
## 2 1 liver 7 25.1 4.0400 13.1 28.4
## 3 1 liver 7 50.1 4.0508 31.7 53.6
## 4 1 liver 16 28.6 7.6816 6.6 40.4
## 5 2 spleen 8 13.6 4.3029 0.0 32.7
## 6 2 spleen 9 56.6 12.0632 53.6 61.2
The functions lod_int()
and bayes_int()
can
be used to derive the LOD support or Bayes credible intervals for QTL,
for a specific chromosome and LOD score column. For example, to obtain
the Bayes interval for the locus on chromosome 9 for the second
phenotype (“spleen”):
bayes_int(out, map, lodcolumn=2, chr=9, prob=0.95)
## ci_lo pos ci_hi
## 1 53.6 56.6 61.2
Both lod_int()
and bayes_int()
take a
peakdrop
argument, if you wish to try to identify multiple
peaks on a chromosome. Again, use this feature with caution.
lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5)
## ci_lo pos ci_hi
## 1 13.1 25.1 28.4
## 2 31.7 50.1 53.6
Each row is a different peak; the columns are the lower interval endpoint, the estimated QTL position, and the upper interval endpoint.
By default in find_peaks()
, bayes_int()
,
and lod_int()
, the interval endpoints are at markers. This
is controlled with the argument expand2markers
, which
defaults to TRUE
. To allow the interval endpoints to be at
positions between markers (“pseudomarkers”), include
expand2markers=FALSE
.
lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5, expand2markers=FALSE)
## ci_lo pos ci_hi
## 1 13.1 25.1 28.4
## 2 34.1 50.1 53.6
To perform a genome scan using a linear mixed model, accounting for
relationships among individuals using a random polygenic effect, you
also use the function scan1()
; you just need to provide the
argument kinship
, a kinship matrix (or, for the LOCO
method, a list of kinship matrices).
out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar)
Again, on a multi-core machine, you can get some speed-up using the
cores
argument.
out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar, cores=4)
For the LOCO (leave one chromosome out) method, provide the list of
kinship matrices as obtained from calc_kinship()
with
method="loco"
.
out_pg_loco <- scan1(pr, iron$pheno, kinship_loco, Xcovar=Xcovar)
To plot the results, we again use plot_scan1()
, or just
type plot()
.
Here is a plot of the LOD scores, by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.
color <- c("slateblue", "violetred", "green3")
par(mar=c(4.1, 4.1, 1.6, 1.1))
ymx <- max(maxlod(out), maxlod(out_pg), maxlod(out_pg_loco))
for(i in 1:2) {
plot(out, map, lodcolumn=i, col=color[1], main=colnames(iron$pheno)[i],
ylim=c(0, ymx*1.02))
plot(out_pg, map, lodcolumn=i, col=color[2], add=TRUE)
plot(out_pg_loco, map, lodcolumn=i, col=color[3], add=TRUE, lty=2)
legend("topleft", lwd=2, col=color, c("H-K", "LMM", "LOCO"), bg="gray90", lty=c(1,1,2))
}
For the liver phenotype (top panel), the three methods give quite different results. The linear mixed model with an overall kinship matrix gives much lower LOD scores than the other two methods. On chromosomes with some evidence of a QTL, the LOCO method gives higher LOD scores than Haley-Knott, except on chromosome 16 where it gives lower LOD scores.
For the spleen phenotype (bottom panel), the linear mixed model with an overall kinship matrix again gives much lower LOD scores than the other two methods. However, in this case Haley-Knott regression and the LOCO method give quite similar results.
The genome scans above were performed assuming that the residual
variation followed a normal distribution. This will often provide
reasonable results even if the residuals are not normal, but an
important special case is that of a binary trait, with values 0 and 1,
which is best treated differently. The scan1()
function can
perform a genome scan with binary traits by logistic regression, using
the argument model="binary"
. (The default value for the
model
argument is "normal"
.) At present, we
can not account for relationships among individuals in this
analysis.
Let’s first turn our two phenotypes into binary traits by thresholding at the median. One would generally not do this in practice; this is just for illustration.
bin_pheno <- apply(iron$pheno, 2, function(a) as.numeric(a > median(a)))
rownames(bin_pheno) <- rownames(iron$pheno)
We now perform the genome scan as before, including
model="binary"
to indicate that the phenotypes are binary
traits with values 0 and 1.
out_bin <- scan1(pr, bin_pheno, Xcovar=Xcovar, model="binary")
Here is a plot of the two sets of LOD curves.
par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out_bin)
plot(out_bin, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out_bin, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out_bin), bg="gray90")
We can use find_peaks
as before.
find_peaks(out_bin, map, threshold=3.5, drop=1.5)
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 liver 2 56.8 3.6303 48.1 73.2
## 2 1 liver 8 38.0 3.5077 17.3 69.9
## 3 1 liver 15 49.2 3.7789 16.4 49.2
## 4 1 liver 16 27.6 7.5069 6.6 40.4
## 5 2 spleen 9 57.6 9.2528 53.6 61.2
To perform a permutation test to establish the statistical
significance of the results of a genome scan, use the function
scan1perm()
. (In R/qtl, a
single function, scanone()
, was used for both performing a
genome scan and for getting permutation-based significance thresholds,
but in R/qtl2, we’ve decided to
make two separate functions).
The scan1perm()
function takes the same arguments as
scan1()
, plus additional arguments to control the
permutations:
n_perm
is the number of permutation replicates.perm_Xsp
controls whether to perform autosome/X
chromosome specific permutations (with perm_Xsp=TRUE
) or
not (the default is to not).perm_strata
is a vector that defines the strata for a
stratified permutation test.chr_lengths
is a vector of chromosome lengths, used in
the case that perm_Xsp=TRUE
.As with scan1()
, you may provide a kinship matrix (or
vector of kinship matrices, for the “leave one chromosome out” (LOCO)
approach), in order to fit linear mixed models to account for
relationships among individuals (in other words, including a random
polygenic effect). If kinship
is unspecified, the function
performs ordinary Haley-Knott regression.
To perform a permutation test with the iron
data, we do
the following:
operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000)
Note the need to specify special covariates for the X chromosome (via
Xcovar
), to be included under the null hypothesis of no
QTL. And note that when these are provided, the default is to perform a
stratified permutation test, using strata defined by the rows in
Xcovar
. In general, when the X chromosome is considered,
one will wish to stratify at least by sex.
Also note that, as with scan1()
, you can speed up the
calculations on a multi-core machine by specifying the argument
cores
. With cores=0
, the number of available
cores will be detected via parallel::detectCores()
.
Otherwise, specify the number of cores as a positive integer. For large
datasets, be mindful of the amount of memory that will be needed; you
may need to use fewer than the maximum number of cores, to avoid going
beyond the available memory.
operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000, cores=0)
To get estimated significance thresholds, use the function
summary()
.
summary(operm)
## LOD thresholds (1000 permutations)
## liver spleen
## 0.05 3.46 3.46
The default is to return the 5% significance thresholds. Thresholds
for other (or for multiple) significance levels can be obtained via the
alpha
argument.
summary(operm, alpha=c(0.2, 0.05))
## LOD thresholds (1000 permutations)
## liver spleen
## 0.2 2.63 2.64
## 0.05 3.46 3.46
To obtain autosome/X chromosome-specific significance thresholds,
specify perm_Xsp=TRUE
. In this case, you need to provide
chromosome lengths, which may be obtained with the function
chr_lengths()
.
operm2 <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000,
perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
Separate permutations are performed for the autosomes and X chromosome, and considerably more permutation replicates are needed for the X chromosome. The computations take about twice as much time. See Broman et al. (2006) Genetics 174:2151-2158.
The significance thresholds are again derived via
summary()
:
summary(operm2, alpha=c(0.2, 0.05))
## Autosome LOD thresholds (1000 permutations)
## liver spleen
## 0.2 2.65 2.54
## 0.05 3.42 3.22
##
## X chromosome LOD thresholds (28243 permutations)
## liver spleen
## 0.2 3.1 4.02
## 0.05 3.9 5.18
Permutations for a genome scan with a linear mixed model-based are
performed by specifying the kinship
argument. We can use
the “leave one chromosome out” (LOCO) method by providing
kinship_loco
, the list of kinship matrices calculated above
with calc_kinship()
.
operm3 <- scan1perm(pr, iron$pheno, kinship_loco, Xcovar=Xcovar, n_perm=1000,
perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
Here are the estimated significance thresholds:
summary(operm3, alpha=c(0.2, 0.05))
## Autosome LOD thresholds (1000 permutations)
## liver spleen
## 0.2 2.64 2.62
## 0.05 3.29 3.29
##
## X chromosome LOD thresholds (28243 permutations)
## liver spleen
## 0.2 3.14 4.37
## 0.05 3.82 5.50
As with scan1
, we can use scan1perm
with
binary traits, using the argument model="binary"
. Again,
this can’t be used with a kinship matrix, but all of the other arguments
can be applied.
operm_bin <- scan1perm(pr, bin_pheno, Xcovar=Xcovar, model="binary",
n_perm=1000, perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
Here are the estimated 5% and 20% significance thresholds.
summary(operm_bin, alpha=c(0.2, 0.05))
## Autosome LOD thresholds (1000 permutations)
## liver spleen
## 0.2 2.60 2.63
## 0.05 3.33 3.41
##
## X chromosome LOD thresholds (28243 permutations)
## liver spleen
## 0.2 3.16 3.06
## 0.05 3.86 3.77
The scan1()
function returns only LOD scores. To obtain
estimated QTL effects, use the function scan1coef()
. This
function takes a single phenotype and the genotype probabilities for a
single chromosome and returns a matrix with the estimated coefficients
at each putative QTL location along the chromosome.
For example, to get the estimated effects on chromosome 2 for the liver phenotype, we’d do the following:
c2eff <- scan1coef(pr[,"2"], iron$pheno[,"liver"])
The result is a matrix, 39 positions × 4 genotypes. To plot the
effects, use the function plot_coef()
. There is again an S3
method function plot.scan1coef()
, so one can just type
plot()
, but in either case you need to provide the map for
the relevant chromosome. Use the argument columns
to
indicate which coefficient columns to plot.
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff, map["2"], columns=1:3, col=col)
last_coef <- unclass(c2eff)[nrow(c2eff),] # pull out last coefficients
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
The default is to provide phenotype averages for each genotype group. If instead you want additive and dominance effects, you can provide a square matrix of contrasts, as follows:
c2effB <- scan1coef(pr[,"2"], iron$pheno[,"liver"],
contrasts=cbind(mu=c(1,1,1), a=c(-1, 0, 1), d=c(0, 1, 0)))
The result will then contain the estimates of mu
,
a
, and d
. Here’s a plot of the additive and
dominance effects, which are in the second and third columns.
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
plot(c2effB, map["2"], columns=2:3, col=col)
last_coef <- unclass(c2effB)[nrow(c2effB),2:3] # last two coefficients
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
If you provide a kinship matrix to scan1coef()
, it fits
a linear mixed model (LMM) to account for a residual polygenic effect.
Here let’s use the kinship matrix from the LOCO method.
c2eff_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])
Here’s a plot of the estimates.
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff_pg, map["2"], columns=1:3, col=col, ylab="Phenotype average")
last_coef <- unclass(c2eff_pg)[nrow(c2eff_pg),]
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
You can also get estimated additive and dominance effects, using a matrix of contrasts.
c2effB_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]],
contrasts=cbind(mu=c(1,1,1), a=c(-1, 0, 1), d=c(-0.5, 1, -0.5)))
Here’s a plot of the results.
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
plot(c2effB_pg, map["2"], columns=2:3, col=col)
last_coef <- unclass(c2effB_pg)[nrow(c2effB_pg),2:3]
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
Another option for estimating the QTL effects is to treat them as
random effects and calculate Best Linear Unbiased Predictors (BLUPs).
This is particularly valuable for multi-parent populations such as the
Collaborative Cross and Diversity Outbred mice, where the large number
of possible genotypes at a QTL lead to considerable variability in the
effect estimates. To calculate BLUPs, use scan1blup()
; it
takes the same arguments as scan1coef()
, including the
option of a kinship matrix to account for a residual polygenic
effect.
c2blup <- scan1blup(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])
Here is a plot of the BLUPs (as dashed curves) alongside the standard estimates. Note that the BLUPs are centered at 0, while the coefficient estimates are centered at the phenotype average.
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
ylim <- range(c(c2blup, c2eff))+c(-1,1)
plot(c2eff, map["2"], columns=1:3, col=col, ylab="Phenotype average", ylim=ylim,
xlab="Chr 2 position")
plot(c2blup, map["2"], columns=1:3, col=col, add=TRUE, lty=2)
last_coef <- unclass(c2eff)[nrow(c2eff),]
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
The scan1coef
function can also provide estimated QTL
effects for binary traits, with model="binary"
. (However,
scan1blup
has not yet been implemented for binary
traits.)
c2eff_bin <- scan1coef(pr[,"2"], bin_pheno[,"liver"], model="binary")
Here’s a plot of the effects. They’re a bit tricky to interpret, as their basically log odds ratios.
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff_bin, map["2"], columns=1:3, col=col)
last_coef <- unclass(c2eff_bin)[nrow(c2eff_bin),] # pull out last coefficients
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
Finally, to plot the raw phenotypes against the genotypes at a single
putative QTL position, you can use the function plot_pxg()
.
This takes a vector of genotypes as produced by the
maxmarg()
function, which picks the most likely genotype
from a set of genotype probabilities, provided it is greater than some
specified value (the argument minprob
). Note that the
“marg
” in “maxmarg
” stands for “marginal”, as
this function is selecting the genotype at each position that has
maximum marginal probability.
For example, we could get inferred genotypes at the chr 2 QTL for the liver phenotype (at 28.6 cM) as follows:
g <- maxmarg(pr, map, chr=2, pos=28.6, return_char=TRUE)
We use return_char=TRUE
to have maxmarg()
return a vector of character strings with the genotype labels.
We then plot the liver phenotype against these genotypes as follows:
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_pxg(g, iron$pheno[,"liver"], ylab="Liver phenotype")
We can use swap_axes=TRUE
to have the phenotype on the
x-axis. And we can use SEmult=2
to include mean ± 2 SE
intervals.
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_pxg(g, iron$pheno[,"liver"], SEmult=2, swap_axes=TRUE, xlab="Liver phenotype")
For multi-parent crosses, it can be useful to collapse the genotype or allele probabilities according to the founder genotypes of the various SNPs in the region of a QTL.
To illustrate this sort of SNP association analysis, we’ll consider some Diversity Outbred mouse data. The Diversity Outcross (DO) mice are an advanced intercross population derived from the same eight founder strains as the Collaborative Cross (CC). See Svenson et al. (2012), Churchill et al. (2012), and Gatti et al. (2014).
We’ll consider a subset of the data from Recla et
al. (2014), available as part of the qtl2data github repository.
(The full data are in DO_Recla
;
the directory DOex
contains a reduced set, with just three chromosomes, one phenotype
(OF_immobile_pct
, percent immobile in the open field test),
and a reduced set of markers.
You can download the data from a single zip file, as follows:
file <- "https://raw.githubusercontent.com/rqtl/qtl2data/main/DOex/DOex.zip"
DOex <- read_cross2(file)
Let’s quickly whip through a basic analysis.
We first calculate genotype probabilities and convert them to allele probabilities. We’ll just use marker locations and not insert any pseudomarkers.
pr <- calc_genoprob(DOex, error_prob=0.002)
apr <- genoprob_to_alleleprob(pr)
We calculate kinship matrices (using the “LOCO” method, though with the caveat that here we are only considering genotypes on three chromosomes).
k <- calc_kinship(apr, "loco")
We create a numeric covariate for sex; be sure to include the individual IDs as names.
sex <- (DOex$covar$Sex == "male")*1
names(sex) <- rownames(DOex$covar)
Note that you can create the vector and assign the names in one step
using the basic R function setNames()
.
sex <- setNames( (DOex$covar$Sex == "male")*1, rownames(DOex$covar) )
We perform a genome scan with a linear mixed model (adjusting for a residual polygenic effect), with sex as an additive covariate.
out <- scan1(apr, DOex$pheno, k, sex)
Here’s a plot of the results.
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out, DOex$gmap)
There’s a strong peak on chromosome 2. Let’s look at the QTL effects.
We estimate them with scan1coef()
. We need to subset the
allele probabilities and the list of kinship matrices.
coef_c2 <- scan1coef(apr[,"2"], DOex$pheno, k[["2"]], sex)
For the DO, with 8 QTL alleles, we can use the function
plot_coefCC
, which plots the 8 allele effects in the
“official” Collaborative Cross (CC) colors. (Well, actually
slightly modified colors, because I think the official colors
are kind of ugly.) The strong locus seems to be mostly due to the NZO
allele. Note that qtl2 package includes the vector of colors for the CC
as the object CCcolors
; there’s also a
CCorigcolors
object with the official colors.
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_coefCC(coef_c2, DOex$gmap["2"], bgcolor="gray95", legend="bottomleft")
If you provide plot_coefCC()
with the genome scan
output, it will display the LOD curve below the coefficient
estimates.
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_coefCC(coef_c2, DOex$gmap["2"], scan1_output=out, bgcolor="gray95", legend="bottomleft")
To perform SNP association analysis in the region of a QTL, we’ll need to grab data on all of the SNPs in the region, including their genotypes in the eight founder strains for the CC and DO populations. As a related task, we’ll also want to identify the genes in the region.
We want R/qtl2 to permit different ways of storing this information. For the CC and DO populations, we’ve prepared SQLite database files for the variants in the CC founders and for the MGI mouse gene annotations. But others might wish to use a different kind of database, or may wish to query an online database.
We provide a template for how to use R/qtl2 to connect to SNP and
gene databases, with the functions
create_variant_query_func()
and
create_gene_query_func()
. Each returns a function
for querying variant and gene databases, respectively. The query
functions that are returned take just three arguments (chr
,
start
, end end
), and themselves return a data
frame of variants on the one hand and genes on the other.
If you are analyzing the CC lines or a DO population, you can download the SQLite databases that we have prepared and made available on Figshare:
cc_variants.sqlite
doi:10.6084/m9.figshare.5280229.v3, variants in the Collaborative
Cross founders (3 GB)mouse_genes.sqlite
doi:10.6084/m9.figshare.5280238.v7 full set of mouse gene
annotations (530 MB)mouse_genes_mgi.sqlite
doi:10.6084/m9.figshare.5286019.v8 just the MGI mouse gene
annotations (12 MB)For example, to download cc_variants.sqlite
and
mouse_genes_mgi.sqlite
, we’d go to the figshare sites above
and determine the URLs for downloading the files, and then do the
following:
download.file("https://ndownloader.figshare.com/files/18533342", "cc_variants.sqlite")
download.file("https://ndownloader.figshare.com/files/17609252", "mouse_genes_mgi.sqlite")
Store them locally on your computer, wherever it’s convenient. For
example, I’ve placed them in ~/Data/CCdb/
.
To create a function for querying the CC variants, call
create_variant_query_func()
with the path to the
cc_variants.sqlite
file:
query_variants <- create_variant_query_func("~/Data/CCdb/cc_variants.sqlite")
To grab the variants in the interval 97-98 Mbp on chromosome 2, you’d then do the following:
variants_2_97.5 <- query_variants(2, 97, 98)
Similarly, to create a function for querying the MGI mouse gene
annotations, you call create_gene_query_func()
with the
path to the mouse_genes_mgi.sqlite
file:
query_genes <- create_gene_query_func("~/Data/CCdb/mouse_genes_mgi.sqlite")
To grab the genes overlapping the interval 97-98 Mbp on chromosome 2, you’d then do the following:
genes_2_97.5 <- query_genes(2, 97, 98)
The way we have set this up is a bit complicated, but it allows greatest flexibility on the part of the user. And for our own work, we like to have a local SQLite database, for rapid queries of SNPs and genes.
The above SQLite databases were prepared with mouse genome build 38 (mm10). If you are using the newer GRCm39 build, download the following database (created by Matt Vincent at the Jackson Lab), which includes both SNP/indel variants and genes:
To download from R, use:
download.file("https://figshare.com/ndownloader/files/40157572", "fv.2021.snps.db3")
For querying founder variant information, you can again use
create_variant_query_func()
, though you need to specify not
just the path to the SQLite file but also the id_field
, as
follows:
query_variants <- create_variant_query_func("~/Data/CCdb/fv.2021.snps.db3", id_field="variants_id")
variants_2_97.5 <- query_variants(2, 97, 98)
The variants database includes all SNPs and indels in the Sanger
resequencing data, including some with low confidence. If you want to
filter to those with full confidence, you could use the
filter
argument in
create_variant_query_func()
:
query_variants <- create_variant_query_func("~/Data/CCdb/fv.2021.snps.db3", id_field="variants_id",
filter="confidence==255")
variants_2_97.5 <- query_variants(2, 97, 98)
The fv.2021.snps.db3
file also contains ensembl gene information, but the
field names are different from the defaults, so use
create_gene_query_func()
as follows:
query_genes <- create_gene_query_func("~/Data/CCdb/fv.2021.snps.db3", chr_field="chromosome", name_field="symbol",
start_field="start_position", stop_field="end_position")
But you can then use the query_genes()
function in the
same way as before:
genes_2_97.5 <- query_genes(2, 97, 98)
Okay, now finally we get to the SNP associations. We have a large peak on chromosome 2, and we want to look at individual SNPs in the region of the locus.
Well, actually, we first need to find the location of the inferred
QTL. The peak LOD score on chromosome 2 occurs at 52.4 cM. But to find
nearby SNPs, we really want to know the Mbp position. The calculations
were only performed at the marker positions, and so can use
max()
, giving both the scan1()
output and the
physical map, and then pull out the position from the results.
peak_Mbp <- max(out, DOex$pmap)$pos
The marker is at 96.8 Mbp. We’ll focus on a 2 Mbp interval centered at 96.8 Mbp.
We can pull out the variants in the 2 Mbp interval centered at 96.8 on chr 2 using the query function we defined above:
variants <- query_variants(2, peak_Mbp - 1, peak_Mbp + 1)
There are 27355 variants in the interval, including 27108 SNPs, 127 indels, and 120 structural variants. We’re treating all of them as biallelic markers (all but the major allele in the eight founder strains binned into a single allele). In the following, we’re going to just say “SNP” rather than “variant”, even though the variants include indels and structural variants.
After identifying the variants in the interval of interest, we use our genotype probabilities and the founder SNP genotypes to infer the SNP genotypes for the DO mice. That is, at each SNP, we want to collapse the eight founder allele probabilities to two SNP allele probabilities, using the SNP genotypes of the founders.
We do this assuming that the genotype probabilities were calculated sufficiently densely that they can be assumed to be constant in intervals. With this assumption, we then:
We further create a look-up table relating the full set of SNPs to the reduced set (one of each observed SDP in each interval).
All of these steps are combined into a single function
scan1snps()
, which takes the genotype probabilities, a
physical map of those locations, the phenotype, the kinship matrix,
covariates, the query function for grabbing the SNPs in a given
interval, and the chromosome, start, and end positions for the interval.
(If insert_pseudomarkers()
was used to insert
pseudomarkers, you will need to use interp_map()
to get
interpolated Mbp positions for the pseudomarkers.)
out_snps <- scan1snps(pr, DOex$pmap, DOex$pheno, k[["2"]], sex, query_func=query_variants,
chr=2, start=peak_Mbp-1, end=peak_Mbp+1, keep_all_snps=TRUE)
The output is a list with two components: lod
is a
matrix of LOD scores (with a single column, since we’re using just one
phenotype), and snpinfo
is a data frame with SNP
information. With the argument keep_all_snps=TRUE
, the
snpinfo
data frame contains information about all
of the variants in the region with an index
column
indicating the equivalence classes.
The function plot_snpasso()
can be used to plot the
results, with points at each of the SNPs. The default is to plot
all SNPs. In this case, there are 27355 variants in the
region, but only 167 distinct ones.
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_snpasso(out_snps$lod, out_snps$snpinfo)
We can actually just type plot()
rather than
plot_snpasso()
, because with the snpinfo
table
in place of a genetic map, plot()
detects that a SNP
association plot should be created.
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out_snps$lod, out_snps$snpinfo)
We can use our query_genes()
function to identify the
genes in the region, and plot_genes()
to plot their
locations. But also plot_snpasso()
can take the gene
locations with the argument genes
and then display them
below the SNP association results. Here, we are also highlighting the
top SNPs in the SNP association plot using the drop_hilit
argument. SNPs with LOD score within drop_hilit
of the
maximum are shown in pink.
genes <- query_genes(2, peak_Mbp - 1, peak_Mbp + 1)
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out_snps$lod, out_snps$snpinfo, drop_hilit=1.5, genes=genes)
To get a table of the SNPs with the largest LOD scores, use the
function top_snps()
. This will show all SNPs with LOD score
within some amount (the default is 1.5) of the maximum SNP LOD score.
We’re going to display just a subset of the columns.
top <- top_snps(out_snps$lod, out_snps$snpinfo)
print(top[,c(1, 8:15, 20)], row.names=FALSE)
## snp A_J C57BL_6J 129S1_SvImJ NOD_ShiLtJ NZO_HlLtJ CAST_EiJ PWK_PhJ WSB_EiJ lod
## rs33297153 1 1 2 1 2 3 1 1 8.2151
## rs226751615 1 1 1 1 2 2 1 1 7.9496
## rs230653109 1 1 1 1 2 2 1 1 7.9496
## rs250407660 1 1 1 1 2 2 1 1 7.9496
## rs248216267 1 1 1 1 2 2 1 1 7.9496
## rs237564299 1 1 1 1 2 2 1 1 7.9496
## rs255011323 1 1 1 1 2 2 1 1 7.9496
## rs212448521 1 1 1 1 2 2 1 1 7.9496
## rs27379206 1 1 1 1 2 2 1 1 7.9496
## rs27379194 1 1 1 1 2 2 1 1 7.9496
## rs244484646 1 1 1 1 2 2 1 1 7.9496
## rs215855380 1 1 1 1 2 2 1 1 7.9496
## rs236942088 1 1 1 1 2 2 1 1 7.9496
## rs212414861 1 1 1 1 2 2 1 1 7.9496
## rs229578122 1 1 1 1 2 2 1 1 7.9496
## rs254318131 1 1 1 1 2 2 1 1 7.9496
## rs217679969 1 1 1 1 2 2 1 1 7.9496
## rs238404461 1 1 1 1 2 2 1 1 7.9496
## rs262749925 1 1 1 1 2 2 1 1 7.9496
## rs231282689 1 1 1 1 2 2 1 1 7.9496
## rs260286709 1 1 1 1 2 2 1 1 7.9496
## rs27396282 1 1 1 1 2 2 1 1 7.9496
## rs263841533 1 1 1 1 2 2 1 1 7.9496
## rs231205920 1 1 1 1 2 2 1 1 7.9496
## rs242885221 1 1 1 1 2 2 1 1 7.9496
## rs586746690 1 1 2 1 2 2 1 1 8.6351
## rs49002164 1 1 2 1 2 2 1 1 8.6351
## SV_2_96945406_96945414 1 1 1 1 2 3 1 1 8.2809
## rs244595995 1 1 2 1 2 2 1 1 8.6351
## SV_2_96993116_96993118 1 1 1 1 2 2 1 1 8.2809
The top SNPs all have 129, NZO, and CAST with a common allele, different from the other 5 founders. The next-best SNPs have just NZO and CAST with a common allele.
You can include a visualization of the strain distribution pattern of
the SNPs in the association plot with the argument
sdp_panel=TRUE
.
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out_snps$lod, out_snps$snpinfo, drop_hilit=1.5, genes=genes, sdp_panel=TRUE)
The scan1snps()
function can also be used to perform a
genome-wide SNP association scan, by providing a variant query function
but leaving chr
, start
, and end
unspecified. In this case it’s maybe best to use
keep_all_snps=FALSE
(the default) and only save the index
SNPs.
out_gwas <- scan1snps(pr, DOex$pmap, DOex$pheno, k, sex, query_func=query_variants, cores=0)
We can make a Manhattan plot of the results as follows. We use
altcol
to define a color for alternate chromosomes and
gap=0
to have no gap between chromosomes.
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out_gwas$lod, out_gwas$snpinfo, altcol="green4", gap=0)
Note that while there are LOD scores on the X chromosome that are as
large as those on chromosome 2, we’re allowing a genotype × sex
interaction on the X chromosome and so the test has 3 degrees of freedom
(versus 2 degrees of freedom on the autosomes) and so the LOD scores
will naturally be larger. If we’d used the allele probabilities
(apr
above, calculated from
genoprob_to_alleleprob()
) rather than the genotype
probabilities, we would be performing a test with 1 degree of freedom on
both the X chromosome and the autosomes.