R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.
Also see the related packages, qtl2convert (for converting data among the R/qtl2, DOQTL, and R/qtl formats), and qtl2fst (for storing genotype probabilities on disk, for rapid access with reduced memory usage).
R/qtl2 contains a lot of functions. To identify functions of interest, see the organized list of R/qtl2 functions, with brief descriptions.
The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we have defined a new format for the data files. We’ll describe it here briefly; for details, see the separate vignette on the input file format.
QTL mapping data consists of a set of tables of data: marker genotypes, phenotypes, marker maps, etc. In the new format, these different tables are in separate comma-delimited (CSV) files. In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns. For example, the phenotype data file will have individual IDs in the first column and phenotype names in the first row.
A few important changes in the tabular data:
In addition to the set of CSV files with the primary data, we need a separate “control” file with various control parameters (or metadata), including the names of all of the other data files and the genotype codes used in the genotype data file. The control file is in a specific format using either YAML or JSON; these are human-readable text files for representing relatively complex data.
A big advantage of this control file scheme is that it greatly
simplifies the function for reading in the data. That function,
read_cross2()
, has a single argument: the name
(with path) of the control file. So you can read in data like this:
library(qtl2)
grav2 <- read_cross2("~/my_data/grav2.yaml")
The large number of files is a bit cumbersome, so we’ve made it possible to use a zip file containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:
zip_datafiles("~/my_data/grav2.yaml")
This zip_datafiles()
function will read the control file
to identify all of the relevant data files and then zip them up into a
file with the same name and location, but with the extension
.zip
rather than .yaml
or
.json
.
To read the data back in, we use the same read_cross2()
function, providing the name (and path) of the zip file rather than the
control file.
grav2 <- read_cross2("~/my_data/grav2.zip")
This can even be done with remote files.
grav2 <- read_cross2("https://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")
Of course, the other advantage of the zip file is that it is compressed and so smaller than the combined set of CSV files.
The control file may be confusing for some users. To assist in its
construction, there’s a function write_control_file()
that
takes the large set of control parameters as input and then writes the
YAML or JSON control file in the appropriate format.
The R/qtl2 web site includes sample data
files in the new format. Zipped versions of these datasets are
included with the qtl2
package and can be loaded into R using the read_cross2()
function.
In the qtl2 package
source, the sample zip files are located in qtl2/inst/extdata
.
In the installed version of the package, they are in
qtl2/extdata
, within whatever directory your R packages
were installed. The R function system.file()
can be used to
construct the path to these files.
For example, one of the sample data sets concerns a gravitropism
phenotype in a set of Arabidopsis recombinant inbred lines (RIL), from
Moore et
al. (2013) Genetics 195:1077-1086. The data are in
qtl2/extdata/grav2.zip
, which can be loaded as follows:
library(qtl2)
grav2 <- read_cross2( system.file("extdata", "grav2.zip", package="qtl2") )
Additional sample data sets, including data on Diversity Outbred (DO) mice, are available at https://github.com/rqtl/qtl2data.
The first basic task in QTL analysis is to calculate conditional
genotype probabilities, given the observed marker data, at each putative
QTL position. This is accomplished with the calc_genoprob()
function. Unlike the corresponding function in R/qtl, calc.genoprob()
, the
result is not inserted back into the input cross object, but is returned
as a list of three-dimensional arrays (one per chromosome). Each 3d
array of probabilities is arranged as individuals × genotypes ×
positions.
If we wish to perform QTL calculations at positions between markers
(so called “pseudomarkers”), we first need to insert such positions into
the genetic map with the function insert_pseudomarkers()
.
Unlike R/qtl, the map is kept separate
from the genotype probabilities.
We’ll use the iron dataset from Grant et al. (2006) Hepatology 44:174-185 (an intercross) as an example. We first load the data:
library(qtl2)
iron <- read_cross2( system.file("extdata", "iron.zip", package="qtl2") )
We then use insert_pseudomarkers()
to insert
pseudomarkers into the genetic map, which we grab from the
iron
object as iron$gmap
:
map <- insert_pseudomarkers(iron$gmap, step=1)
And next we use calc_genoprob()
to calculate the QTL
genotype probabilities.
pr <- calc_genoprob(iron, map, error_prob=0.002)
To speed up the calculations with large datasets on a multi-core
machine, you can use the argument cores
. With
cores=0
, the number of available cores will be detected via
parallel::detectCores()
. Otherwise, specify the number of
cores as a positive integer.
pr <- calc_genoprob(iron, map, error_prob=0.002, cores=4)
The genome scan functions (see below) use genotype probabilities as
well as a matrix of phenotypes. If you wished to perform a genome scan
via an additive allele model, you would first convert the genotype
probabilities to allele probabilities, using the function
genoprob_to_alleleprob()
.
apr <- genoprob_to_alleleprob(pr)
If you wish to perform a genome scan by a linear mixed model,
accounting for the relationships among individuals (in other words,
including a random polygenic effect), you’ll need to calculate a kinship
matrix for the individuals. This is accomplished with the
calc_kinship()
function. It takes the genotype
probabilities as input.
kinship <- calc_kinship(pr)
By default, the genotype probabilities are converted to allele
probabilities, and the kinship matrix is calculated as the proportion of
shared alleles. To use genotype probabilities instead, use
use_allele_probs=FALSE
. Further, by default we calculate
the kinship matrix using both the autosomes and the X chromosome. To
omit the X chromosome, use omit_x=TRUE
.
In calculating the kinship matrix, one may wish to eliminate the
effect of varying marker density across the genome, and only use the
probabilities along the grid of pseudomarkers (defined by the
step
argument to insert_pseudomarkers()
). To
do so, we need to first use calc_grid()
to determine the
grid of pseudomarkers, and then probs_to_grid()
to omit
probabilities for positions that are not on the grid.
grid <- calc_grid(iron$gmap, step=1)
pr_grid <- probs_to_grid(pr, grid)
kinship_grid <- calc_kinship(pr_grid)
If, for your linear mixed model genome scan, you wish to use the
“leave one chromosome out” (LOCO) method (scan each chromosome using a
kinship matrix that is calculated using data from all other
chromosomes), use type="loco"
in the call to
calc_kinship()
.
kinship_loco <- calc_kinship(pr, "loco")
On a multi-core machine, you can get some speed-up via the
cores
argument, as with calc_genoprob()
.
kinship_loco <- calc_kinship(pr, "loco", cores=4)
In a QTL scan of the X chromosome, special covariates (such as sex) may need to be included under the null hypothesis of no QTL, to avoid spurious evidence of linkage. (See Broman et al. (2006) Genetics 174:2151-2158.)
The particular X chromosome covariates depend on the cross, and can
be obtained with the function get_x_covar()
.
Xcovar <- get_x_covar(iron)
To perform a genome scan by Haley-Knott regression (Haley and Knott
1992), use the function scan1()
.
scan1()
takes as input the genotype probabilities, a
matrix of phenotypes, optional additive and interactive covariates, and
the special X chromosome covariates. Another option is to provide a
vector of weights.
out <- scan1(pr, iron$pheno, Xcovar=Xcovar)
On a multi-core machine, you can get some speed-up via the
cores
argument, as with calc_genoprob()
and
calc_kinship()
.
out <- scan1(pr, iron$pheno, Xcovar=Xcovar, cores=4)
The output of scan1()
is a matrix of LOD scores,
positions × phenotypes.
The function plot_scan1()
can be used to plot the LOD
curves. You can just write plot()
, as there’s an S3 method
plot.scan1()
and the output of scan1()
has
class "scan1"
. Use the argument lodcolumn
to
indicate which column to plot. Unlike the plot.scanone()
function in R/qtl, the
plot_scan1()
function will only plot one set of LOD curves
at a time, and you need to provide the marker/pseudomarker map (created
by insert_pseudomarkers()
).
par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out) # overall maximum LOD score
plot(out, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out), bg="gray90")
The function find_peaks()
can be used to identify a set
of LOD peaks that exceed some threshold. It can also provide LOD support
or Bayes credible intervals, by using the arguments drop
(the amount to drop in the LOD support intervals) or prob
(the nominal coverage for the Bayes credible intervals).
You need to provide both the scan1()
output as well as
the marker/pseudomarker map.
find_peaks(out, map, threshold=4, drop=1.5)
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 liver 2 56.8 4.9576 48.1 73.2
## 2 1 liver 7 50.1 4.0508 13.1 53.6
## 3 1 liver 16 28.6 7.6816 6.6 40.4
## 4 2 spleen 8 13.6 4.3029 0.0 32.7
## 5 2 spleen 9 56.6 12.0632 53.6 61.2
The find_peaks()
function can also pick out multiple
peaks on a chromosome: each peak must exceed the chosen threshold, and
the argument peakdrop
indicates the amount that the LOD
curve must drop below the lowest of two adjacent peaks. Use this feature
with caution.
find_peaks(out, map, threshold=4, peakdrop=1.8, drop=1.5)
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 liver 2 56.8 4.9576 48.1 73.2
## 2 1 liver 7 25.1 4.0400 13.1 28.4
## 3 1 liver 7 50.1 4.0508 31.7 53.6
## 4 1 liver 16 28.6 7.6816 6.6 40.4
## 5 2 spleen 8 13.6 4.3029 0.0 32.7
## 6 2 spleen 9 56.6 12.0632 53.6 61.2
The functions lod_int()
and bayes_int()
can
be used to derive the LOD support or Bayes credible intervals for QTL,
for a specific chromosome and LOD score column. For example, to obtain
the Bayes interval for the locus on chromosome 9 for the second
phenotype (“spleen”):
bayes_int(out, map, lodcolumn=2, chr=9, prob=0.95)
## ci_lo pos ci_hi
## 1 53.6 56.6 61.2
Both lod_int()
and bayes_int()
take a
peakdrop
argument, if you wish to try to identify multiple
peaks on a chromosome. Again, use this feature with caution.
lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5)
## ci_lo pos ci_hi
## 1 13.1 25.1 28.4
## 2 31.7 50.1 53.6
Each row is a different peak; the columns are the lower interval endpoint, the estimated QTL position, and the upper interval endpoint.
By default in find_peaks()
, bayes_int()
,
and lod_int()
, the interval endpoints are at markers. This
is controlled with the argument expand2markers
, which
defaults to TRUE
. To allow the interval endpoints to be at
positions between markers (“pseudomarkers”), include
expand2markers=FALSE
.
lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5, expand2markers=FALSE)
## ci_lo pos ci_hi
## 1 13.1 25.1 28.4
## 2 34.1 50.1 53.6
To perform a genome scan using a linear mixed model, accounting for
relationships among individuals using a random polygenic effect, you
also use the function scan1()
; you just need to provide the
argument kinship
, a kinship matrix (or, for the LOCO
method, a list of kinship matrices).
out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar)
Again, on a multi-core machine, you can get some speed-up using the
cores
argument.
out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar, cores=4)
For the LOCO (leave one chromosome out) method, provide the list of
kinship matrices as obtained from calc_kinship()
with
method="loco"
.
out_pg_loco <- scan1(pr, iron$pheno, kinship_loco, Xcovar=Xcovar)
To plot the results, we again use plot_scan1()
, or just
type plot()
.
Here is a plot of the LOD scores, by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.
color <- c("slateblue", "violetred", "green3")
par(mar=c(4.1, 4.1, 1.6, 1.1))
ymx <- max(maxlod(out), maxlod(out_pg), maxlod(out_pg_loco))
for(i in 1:2) {
plot(out, map, lodcolumn=i, col=color[1], main=colnames(iron$pheno)[i],
ylim=c(0, ymx*1.02))
plot(out_pg, map, lodcolumn=i, col=color[2], add=TRUE)
plot(out_pg_loco, map, lodcolumn=i, col=color[3], add=TRUE, lty=2)
legend("topleft", lwd=2, col=color, c("H-K", "LMM", "LOCO"), bg="gray90", lty=c(1,1,2))
}