## R/qtl2 user guide

R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.

We expect that basic analyses with R/qtl2 will generally be performed in “batch” (for example, on a cluster) rather than interactively. And so the software is split into three parts: qtl2geno for genotype probability calculations, qtl2scan for QTL scans, and qtl2plot for data visualization. A further package, qtl2convert, contains functions for converting data among the R/qtl2, DOQTL, and R/qtl formats, for example to convert genotype probabilities produced by DOQTL to the format needed by qtl2scan, or to convert qtl2scan results to the format produced by `scanone`

in R/qtl, so that they may be graphed with the R/qtl functions.

## Installation

R/qtl2 is not yet available on CRAN, but it can be installed from a mini-CRAN at rqtl.org.

`install.packages("qtl2", repos="https://rqtl.org/qtl2cran")`

The qtl2 package is inspired by the tidyverse package; it is basically empty, but when you install it, the qtl2geno, qtl2scan, qtl2plot, and qtl2convert packages, plus a bunch of dependencies, will be installed.

Alternatively, you can install R/qtl2 from its source on GitHub. (But note that compiling the C++ code can be rather slow.)

On *Windows*, you’ll need Rtools.

On *Mac OS X*, you’ll need the command-line developer tools, as well as gfortran.

You then need to install the devtools package, plus a set of package dependencies: yaml, jsonlite, data.table, and RcppEigen. (Additional, secondary dependencies will also be installed.)

`install.packages(c("devtools", "yaml", "jsonlite", "data.table", "RcppEigen"))`

Finally, install R/qtl2 using `devtools::install_github()`

.

```
library(devtools)
install_github("rqtl/qtl2")
```

## Data file format

The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we have defined a new format for the data files. We’ll describe it here briefly; for details, see the separate vignette on the input file format.

QTL mapping data consists of a set of tables of data: marker genotypes, phenotypes, marker maps, etc. In the new format, these different tables are in separate comma-delimited (CSV) files. In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns. For example, the phenotype data file will have individual IDs in the first column and phenotype names in the first row.

A few important changes in the tabular data:

- We will use not just the genetic marker map, but also a physical map (if available).
- Previously, phenotypes and covariates were combined. In the new format, we separate numeric phenotypes from the often non-numeric covariates.
- We define a table of “phenotype covariates.” These are metadata describing the phenotypes. For example, in the case of a phenotype measured over time, one column in the phenotype covariate data could be the time of measurement. For gene expression data, we would have columns representing chromosome and physical position of genes, as well as gene IDs.

In addition to the set of CSV files with the primary data, we need a separate “control” file with various control parameters (or metadata), including the names of all of the other data files and the genotype codes used in the genotype data file. The control file is in a specific format using either YAML or JSON; these are human-readable text files for representing relatively complex data.

A big advantage of this control file scheme is that it greatly simplifies the function for reading in the data. That function, `read_cross2()`

, has a *single* argument: the name (with path) of the control file. So you can read in data like this:

```
library(qtl2geno)
grav2 <- read_cross2("~/my_data/grav2.yaml")
```

*Note*: it can sometimes confusing which packages you need to load. If you install the (largely empty) qtl2 package, you can type just

`library(qtl2)`

and the three main packages, qtl2geno, qtl2scan, and qtl2plot, will be loaded.

The large number of files is a bit cumbersome, so we’ve made it possible to use a zip file containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:

`zip_datafiles("~/my_data/grav2.yaml")`

This `zip_datafiles()`

function will read the control file to identify all of the relevant data files and then zip them up into a file with the same name and location, but with the extension `.zip`

rather than `.yaml`

or `.json`

.

To read the data back in, we use the same `read_cross2()`

function, providing the name (and path) of the zip file rather than the control file.

`grav2 <- read_cross2("~/my_data/grav2.zip")`

This can even be done with remote files.

`grav2 <- read_cross2("http://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")`

Of course, the other advantage of the zip file is that it is *compressed* and so smaller than the combined set of CSV files.

The control file may be confusing for some users. To assist in its construction, there’s a function `write_control_file()`

that takes the large set of control parameters as input and then writes the YAML control file in the appropriate format.

## Sample data sets

The R/qtl2 web site includes sample data files in the new format. Zipped versions of these datasets are included with the qtl2geno package and can be loaded into R using the `read_cross2()`

function.

In the qtl2geno package source, the sample zip files are located in `qtl2geno/inst/extdata`

. In the installed version of the package, they are in `qtl2geno/extdata`

, within whatever directory your R packages were installed. The R function `system.file()`

can be used to construct the path to these files.

For example, one of the sample data sets concerns a gravitropism phenotype in a set of Arabidopsis recombinant inbred lines (RIL), from Moore et al. (2013) Genetics 195:1077-1086. The data are in `qtl2geno/extdata/grav2.zip`

, which can be loaded as follows:

```
library(qtl2geno)
grav2 <- read_cross2( system.file("extdata", "grav2.zip", package="qtl2geno") )
```

Additional sample data sets, including data on Diversity Outbred (DO) mice, are available at https://github.com/rqtl/qtl2data.

## Calculating genotype probabilities

The first basic task in QTL analysis is to calculate conditional genotype probabilities, given the observed marker data, at each putative QTL position. This is accomplished with the `calc_genoprob()`

function in the qtl2geno package. Unlike the corresponding function in R/qtl, `calc.genoprob()`

, the result is not inserted back into the input cross object, but is returned as a list of three-dimensional arrays (one per chromosome). Each 3d array of probabilities is arranged as individuals × genotypes × positions.

If we wish to perform QTL calculations at positions between markers (so called “pseudomarkers”), we first need to insert such positions into the genetic map with the function `insert_pseudomarkers()`

. Unlike [R/qtl], the map is kept separate from the genotype probabilities.

We’ll use the iron dataset from Grant et al. (2006) Hepatology 44:174-185 (an intercross) as an example. We first load the data:

```
library(qtl2geno)
iron <- read_cross2( system.file("extdata", "iron.zip", package="qtl2geno") )
```

(*Note*: you can use `library(qtl2)`

to load the three main packages, qtl2geno, qtl2scan, and qtl2plot, all at once.)

We then use `insert_pseudomarkers()`

to insert pseudomarkers into the genetic map, which we grab from the `iron`

object as `iron$gmap`

:

`map <- insert_pseudomarkers(iron$gmap, step=1)`

And next we use `calc_genoprob()`

to calculate the QTL genotype probabilities.

`pr <- calc_genoprob(iron, map, err=0.002)`

To speed up the calculations with large datasets on a multi-core machine, you can use the argument `cores`

. With `cores=0`

, the number of available cores will be detected via `parallel::detectCores()`

. Otherwise, specify the number of cores as a positive integer.

`pr <- calc_genoprob(iron, map, err=0.002, cores=4)`

The genome scan functions (see below) use genotype probabilities as well as a matrix of phenotypes. If you wished to perform a genome scan via an additive allele model, you would first convert the genotype probabilities to allele probabilities, using the function `genoprob_to_alleleprob()`

.

`apr <- genoprob_to_alleleprob(pr)`

## Calculating a kinship matrix

If you wish to perform a genome scan by a linear mixed model, accounting for the relationships among individuals (in other words, including a random polygenic effect), you’ll need to calculate a kinship matrix for the individuals. This is accomplished with the `calc_kinship()`

function in qtl2geno. It takes the genotype probabilities as input.

`kinship <- calc_kinship(pr)`

By default, the genotype probabilities are converted to allele probabilities, and the kinship matrix is calculated as the proportion of shared alleles. To use genotype probabilities instead, use `use_allele_probs=FALSE`

. Further, by default we omit the X chromosome and only use the autosomes. To include the X chromosome, use `omit_x=FALSE`

.

In calculating the kinship matrix, one may wish to eliminate the effect of varying marker density across the genome, and only use the probabilities along the grid of pseudomarkers (defined by the `step`

argument to `insert_pseudomarkers()`

. To do so, we need to first use `calc_grid()`

to determine the grid of pseudomarkers, and then `probs_to_grid()`

to probabilities for positions that are not on the grid.

```
grid <- calc_grid(iron$gmap, step=1)
pr_grid <- probs_to_grid(pr, grid)
kinship_grid <- calc_kinship(pr_grid)
```

If, for your linear mixed model genome scan, you wish to use the “leave one chromosome out” (LOCO) method (scan each chromosome using a kinship matrix that is calculated using data from all other chromosomes), use `type="loco"`

in the call to `calc_kinship()`

.

`kinship_loco <- calc_kinship(pr, "loco")`

On a multi-core machine, you can get some speed-up via the `cores`

argument, as with `calc_genoprob()`

.

`kinship_loco <- calc_kinship(pr, "loco", cores=4)`

## Special covariates for the X chromosome

In a QTL scan of the X chromosome, special covariates (such as sex) may need to be included under the null hypothesis of no QTL, to avoid spurious evidence of linkage. (See Broman et al. (2006) Genetics 174:2151-2158.)

The particular X chromosome covariates depends on the cross, and can be obtained with the qtl2geno function `get_x_covar()`

.

`Xcovar <- get_x_covar(iron)`

## Performing a genome scan

To perform a genome scan by Haley-Knott regression (Haley and Knott 1992), use the function `scan1()`

in qtl2scan. (The functions above were all from qtl2geno; from here forward, the functions are all from qtl2scan.)

`scan1()`

takes as input the genotype probabilities, a matrix of phenotypes, and then optional additive and interactive covariates, and the special X chromosome covariates. Another option is to provide a vector of weights.

```
library(qtl2scan)
out <- scan1(pr, iron$pheno, Xcovar=Xcovar)
```

On a multi-core machine, you can get some speed-up via the `cores`

argument, as with `calc_genoprob()`

and `calc_kinship()`

.

`out <- scan1(pr, iron$pheno, Xcovar=Xcovar, cores=4)`

The output of `scan1()`

is a matrix of LOD scores, positions × phenotypes. (Well, actually, the output is a list including `"lod"`

which is this matrix of LOD scores, but also including the genetic map and other details.)

The function `plot_scan1()`

in the qtl2plot package can be used to plot the LOD curves. You can just write `plot()`

, as there’s an S3 method `plot.scan1()`

and the output of `scan1()`

has class `"scan1"`

. Use the argument `lodcolumn`

to indicate which column to plot. Unlike the `plot.scanone()`

function in R/qtl, the `plot_scan1()`

function will only plot one set of LOD curves at a time, and you need to provide the marker/pseudomarker map (created by `insert_pseudomarkers()`

).

```
library(qtl2plot)
par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out) # overall maximum LOD score
plot(out, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out), bg="gray90")
```

## Finding LOD peaks

The function `find_peaks()`

in the qtl2scan package can be used to identify a set of LOD peaks that exceed some threshold. It can also provide LOD support or Bayes credible intervals, by using the arguments `drop`

(the amount to drop in the LOD support intervals) or `prob`

(the nominal coverage for the Bayes credible intervals).

You need to provide both the `scan1()`

output as well as the marker/pseudomarker map.

`find_peaks(out, map, threshold=4, drop=1.5)`

```
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 liver 2 56.8 4.957564 48.1 73.2
## 2 1 liver 7 50.1 4.050766 13.1 53.6
## 3 1 liver 16 28.6 7.681569 6.6 40.4
## 4 2 spleen 8 13.6 4.302919 0.0 32.7
## 5 2 spleen 9 56.6 12.063226 53.6 61.2
```

The `find_peaks()`

function can also pick out multiple peaks on a chromosome: each peak must exceed the chosen threshold, and the argument `peakdrop`

indicates the amount that the LOD curve must drop between the lowest of two adjacent peaks. Use this feature with caution.

`find_peaks(out, map, threshold=4, peakdrop=1.8, drop=1.5)`

```
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 liver 2 56.8 4.957564 48.1 73.2
## 2 1 liver 7 25.1 4.040021 13.1 28.4
## 3 1 liver 7 50.1 4.050766 31.7 53.6
## 4 1 liver 16 28.6 7.681569 6.6 40.4
## 5 2 spleen 8 13.6 4.302919 0.0 32.7
## 6 2 spleen 9 56.6 12.063226 53.6 61.2
```

The functions `lod_int()`

and `bayes_int()`

can be used to derive the LOD support or Bayes credible intervals for QTL, for a specific chromosome and LOD score column. For example, to obtain the Bayes interval for the locus on chromosome 9 for the second phenotype (“spleen”):

`bayes_int(out, map, lodcolumn=2, chr=9, prob=0.95)`

```
## ci_lo pos ci_hi
## 1 53.6 56.6 61.2
```

Both `lod_int()`

and `bayes_int()`

take a `peakdrop`

argument, if you wish to try to identify multiple peaks on a chromosome. Again, use this feature with caution.

`lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5)`

```
## ci_lo pos ci_hi
## 1 13.1 25.1 28.4
## 2 31.7 50.1 53.6
```

Each row is a different peak; the columns are the lower interval endpoint, the estimated QTL position, and the upper interval endpoint.

## Performing a genome scan with a linear mixed model

To perform a genome scan using a linear mixed model, accounting for relationships among individuals using a random polygenic effect, you also use the function `scan1`

; you just need to provide the argument `kinship`

, a kinship matrix (or, for the LOCO method, a list of kinship matrices).

`out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar)`

Again, on a multi-core machine, you can get some speed-up using the `cores`

argument.

`out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar, cores=4)`

For the LOCO (leave one chromosome out) method, provide the list of kinship matrices as obtained from `calc_kinship()`

with `method="loco"`

.

`out_pg_loco <- scan1(pr, iron$pheno, kinship_loco, Xcovar=Xcovar)`

To plot the results, we again use `plot_scan1()`

from the qtl2plot package, or just type `plot()`

.

Here is a plot of the LOD scores, by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.

```
color <- c("slateblue", "violetred", "green3")
par(mar=c(4.1, 4.1, 1.6, 1.1))
ymx <- max(maxlod(out), maxlod(out_pg), maxlod(out_pg_loco))
for(i in 1:2) {
plot(out, map, lodcolumn=i, col=color[1], main=colnames(iron$pheno)[i],
ylim=c(0, ymx*1.02))
plot(out_pg, map, lodcolumn=i, col=color[2], add=TRUE)
plot(out_pg_loco, map, lodcolumn=i, col=color[3], add=TRUE, lty=2)
legend("topleft", lwd=2, col=color, c("H-K", "LMM", "LOCO"), bg="gray90", lty=c(1,1,2))
}
```

For the liver phenotype (top panel), the three methods give quite different results. The linear mixed model with an overall kinship matrix gives much lower LOD scores than the other two methods. On chromosomes with some evidence of a QTL, the LOCO method gives higher LOD scores than Haley-Knott, except on chromosome 16 where it gives lower LOD scores.

For the spleen phenotype (bottom panel), the linear mixed model with an overall kinship matrix again gives much lower LOD scores than the other two methods. However, in this case Haley-Knott regression and the LOCO method give quite similar results.

## Performing a genome scan with binary traits

The genome scans above were performed assuming that the residual variation followed a normal distribution. This will often provide reasonable results even if the residuals are not normal, but an important special case is that of a binary trait, with values 0 and 1, which is best treated differently. The `scan1`

function can perform a genome scan with binary traits by logistic regression, using the argument `model="binary"`

. (The default value for the `model`

argument is `"normal"`

.) At present, we *can not* account for relationships among individuals in this analysis.

Let’s first turn our two phenotypes into binary traits by thresholding at the median. One would generally *not* do this in practice; this is just for illustration.

```
bin_pheno <- apply(iron$pheno, 2, function(a) as.numeric(a > median(a)))
rownames(bin_pheno) <- rownames(iron$pheno)
```

We now perform the genome scan as before, including `model="binary"`

to indicates that the phenotypes are binary traits with values 0 and 1.

`out_bin <- scan1(pr, bin_pheno, Xcovar=Xcovar, model="binary")`

Here is a plot of the two LOD curves.

```
par(mar=c(5.1, 4.1, 1.1, 1.1))
ymx <- maxlod(out_bin)
plot(out_bin, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))
plot(out_bin, map, lodcolumn=2, col="violetred", add=TRUE)
legend("topleft", lwd=2, col=c("slateblue", "violetred"), colnames(out_bin), bg="gray90")
```

We can use `find_peaks`

as before.

`find_peaks(out_bin, map, threshold=3.5, drop=1.5)`

```
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 liver 2 56.8 3.630279 48.1 73.2
## 2 1 liver 8 38.0 3.507654 17.3 69.9
## 3 1 liver 15 49.2 3.778869 16.4 49.2
## 4 1 liver 16 27.6 7.506900 6.6 40.4
## 5 2 spleen 9 57.6 9.252779 53.6 61.2
```

## Performing a permutation test

To perform a permutation test to establish the statistical significance of the results of a genome scan, use the function `scan1perm()`

. (In R/qtl, a single function, `scanone()`

, was used for both performing a genome scan and for getting permutation-based significance thresholds, but in R/qtl2, we’ve decided to make two separate functions).

The `scan1perm()`

function takes the same arguments as `scan1()`

, plus additional arguments to control the permutations:

`n_perm`

is the number of permutation replicates.`perm_Xsp`

controls whether to perform autosome/X chromosome specific permutations (with`perm_Xsp=TRUE`

) or not (the default is to not).`perm_strata`

is a vector that defines the strata for a stratified permutation test.`chr_lengths`

is a vector of chromosome lengths, used in the case that`perm_Xsp=TRUE`

.

As with `scan1()`

, you may provide a kinship matrix (or vector of kinship matrices, for the “leave one chromosome out” (loco) approach), in order to fit linear mixed models to account for accounting for the relationships among individuals (in other words, including a random polygenic effect). If `kinship`

is unspecified, the function performs ordinary Haley-Knott regression.

To perform a permutation test with the `iron`

data, we do the following:

`operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000)`

Note the need to specify special covariates for the X chromosome (via `Xcovar`

), to be included under the null hypothesis of no QTL. And note that when these are provided, the default is to perform a stratified permutation test, using strata defined by the rows in `Xcovar`

. In general, when the X chromosome is considered, one will wish to stratify at least by sex.

Also note that, as with `scan1()`

, you can speed up the calculations on a multi-core machine by specifying the argument `cores`

. With `cores=0`

, the number of available cores will be detected via `parallel::detectCores()`

. Otherwise, specify the number of cores as a positive integer. For large datasets, be mindful of the amount of memory that will be needed; you may need to use fewer than the maximum number of cores, to avoid going beyond the available memory.

`operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000, cores=0)`

To get estimated significance thresholds, use the function `summary()`

.

`summary(operm)`

```
## LOD thresholds (1000 permutations)
## liver spleen
## 0.05 3.46 3.46
```

The default is to return the 5% significance thresholds. Thresholds for other (or for multiple) significance levels can be obtained via the `alpha`

argument.

`summary(operm, alpha=c(0.2, 0.05))`

```
## LOD thresholds (1000 permutations)
## liver spleen
## 0.2 2.63 2.64
## 0.05 3.46 3.46
```

To obtain autosome/X chromosome-specific significance thresholds, specify `perm_Xsp=TRUE`

. In this case, you need to provide chromosome lengths, which may be obtained with the function `chr_lengths()`

.

```
operm2 <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000,
perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
```

Separate permutations are performed for the autosomes and X chromosome, and considerably more permutation replicates are needed for the X chromosome. The computations take about twice as much time. See Broman et al. (2006) Genetics 174:2151-2158.

The significance thresholds are again derived via `summary()`

:

`summary(operm2, alpha=c(0.2, 0.05))`

```
## Autosome LOD thresholds (1000 permutations)
## liver spleen
## 0.2 2.65 2.54
## 0.05 3.42 3.22
##
## X chromosome LOD thresholds (28243 permutations)
## liver spleen
## 0.2 3.1 4.02
## 0.05 3.9 5.18
```

Permutations for a genome scan with a linear mixed model-based are performed by specifying the `kinship`

argument. We can use the “leave one chromosome out” (loco) method by providing `kinship_loco`

, the list of kinship matrices calculated above with `calc_kinship()`

.

```
operm3 <- scan1perm(pr, iron$pheno, kinship_loco, Xcovar=Xcovar, n_perm=1000,
perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
```

Here are the estimated significance thresholds:

`summary(operm3, alpha=c(0.2, 0.05))`

```
## Autosome LOD thresholds (1000 permutations)
## liver spleen
## 0.2 2.64 2.62
## 0.05 3.29 3.29
##
## X chromosome LOD thresholds (28243 permutations)
## liver spleen
## 0.2 3.14 4.37
## 0.05 3.82 5.50
```

As with `scan1`

, we can use `scan1perm`

with binary traits, using the argument `model="binary"`

. Again, this can’t be used with a kinship matrix, but all of the other arguments can be applied.

```
operm_bin <- scan1perm(pr, bin_pheno, Xcovar=Xcovar, model="binary",
n_perm=1000, perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
```

Here are the estimated 5% and 20% significance thresholds.

`summary(operm_bin, alpha=c(0.2, 0.05))`

```
## Autosome LOD thresholds (1000 permutations)
## liver spleen
## 0.2 2.60 2.63
## 0.05 3.33 3.41
##
## X chromosome LOD thresholds (28243 permutations)
## liver spleen
## 0.2 3.16 3.06
## 0.05 3.86 3.77
```

## Estimated QTL effects

The `scan1()`

function return only LOD scores. To obtain estimated QTL effects, use the function `scan1coef()`

. This function takes a single phenotype and the genotype probabilities for a single chromosome and returns a matrix with the estimated coefficients at each putative QTL location along the chromosome.

For example, to get the estimated effects on chromosome 2 for the liver phenotype, we’d do the following:

`c2eff <- scan1coef(pr[,"2"], iron$pheno[,"liver"])`

The result is a matrix, 39 positions × 3 genotypes. An attribute, `"map"`

contains the positions of the calculations. To plot the effects, use the function `plot_coef()`

from the qtl2plot. There is again an S3 method function `plot.scan1coef()`

, so one can just type `plot()`

. Use the argument `columns`

to indicate which coefficient columns to plot.

```
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff, map["2"], columns=1:3, col=col)
last_coef <- unclass(c2eff)[nrow(c2eff),] # pull out last coefficients
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
```

The default is to provide phenotype averages for each genotype group. If instead you want additive and dominance effects, you can provide a square matrix of *contrasts*, as follows:

```
c2effB <- scan1coef(pr[,"2"], iron$pheno[,"liver"],
contrasts=cbind(mu=c(1,1,1), a=c(-1, 0, 1), d=c(-0.5, 1, -0.5)))
```

The result will then contain the estimates of `mu`

, `a`

, and `d`

. Here’s a plot of the additive and dominance effects, which are in the second and third columns.

```
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
plot(c2effB, map["2"], columns=2:3, col=col)
last_coef <- unclass(c2effB)[nrow(c2effB),2:3] # last two coefficients
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
```

If you provide a kinship matrix to `scan1coef()`

, it fits a linear mixed model (LMM) to account for a residual polygenic effect. Here let’s use the kinship matrix from the LOCO method.

`c2eff_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])`

Here’s a plot of the estimates.

```
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff_pg, map["2"], columns=1:3, col=col, ylab="Phenotype average")
last_coef <- unclass(c2eff_pg)[nrow(c2eff_pg),]
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
```

You can also get estimated additive and dominance effects, using a matrix of contrasts.

```
c2effB_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]],
contrasts=cbind(mu=c(1,1,1), a=c(-1, 0, 1), d=c(-0.5, 1, -0.5)))
```

Here’s a plot of the results.

```
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
plot(c2effB_pg, map["2"], columns=2:3, col=col)
last_coef <- unclass(c2effB_pg)[nrow(c2effB_pg),2:3]
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
```

Another option for estimating the QTL effects is to treat them as random effects and calculate Best Linear Unbiased Predictors (BLUPs). This is particularly valuable for multi-parent populations such as the Collaborative Cross and Diversity Outbred mice, where the large number of possible genotypes at a QTL lead to considerable variability in the effect estimates. To calculate BLUPs, use `scan1blup()`

; it takes the same arguments as `scan1coef()`

, including the option of a kinship matrix to account for a residual polygenic effect.

`c2blup <- scan1blup(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])`

Here is a plot of the BLUPs (as dashed curves) alongside the standard estimates. Note that

```
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
ylim <- range(c(c2blup, c2eff))+c(-1,1)
plot(c2eff, map["2"], columns=1:3, col=col, ylab="Phenotype average", ylim=ylim,
xlab="Chr 2 position")
plot(c2blup, map["2"], columns=1:3, col=col, add=TRUE, lty=2)
last_coef <- unclass(c2eff)[nrow(c2eff),]
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
```

The `scan1coef`

function can also provide estimated QTL effects for binary traits, with `model="binary"`

. (However, `scan1blup`

has not yet been implemented for binary traits.)

`c2eff_bin <- scan1coef(pr[,"2"], bin_pheno[,"liver"], model="binary")`

Here’s a plot of the effects. They’re a bit tricky to interpret, as their basically log odds ratios.

```
par(mar=c(4.1, 4.1, 1.1, 2.6), las=1)
col <- c("slateblue", "violetred", "green3")
plot(c2eff_bin, map["2"], columns=1:3, col=col)
last_coef <- unclass(c2eff_bin)[nrow(c2eff_bin),] # pull out last coefficients
for(i in seq(along=last_coef))
axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
```

## SNP association

For multi-parent crosses, it can be useful to collapse the genotype or allele probabilities according to the founder genotypes of the various SNPs in the region of a QTL.

### QTL analysis in Diversity Outbred mice

To illustrate this sort of SNP association analysis, we’ll consider some Diversity Outbred mouse data. The Diversity Outcross (DO) mice are an advanced intercross population derived from the same eight founder strains as the Collaborative Cross (CC). See Svenson et al. (2012) and Gatti et al. (2014).

We’ll consider a subset of the data from Recla et al. (2014), available as part of the qtl2data github repository. (The full data are in `DO_Recla`

; the directory `DOex`

contains a reduced set, with just three chromosomes, one phenotype (`OF_immobile_pct`

, percent immobile in the open field test), and a reduced set of markers.

You can download the data from a single zip file, as follows:

```
file <- paste0("https://raw.githubusercontent.com/rqtl/",
"qtl2data/master/DOex/DOex.zip")
DOex <- read_cross2(file)
```

Let’s quickly whip through a basic analysis.

We first calculate genotype probabilities and convert them to allele probabilities. We’ll just use marker locations and not insert any pseudomarkers.

```
pr <- calc_genoprob(DOex, error_prob=0.002)
apr <- genoprob_to_alleleprob(pr)
```

We calculate kinship matrices (using the “loco” method, though with the caveat that here we are only considering genotypes on three chromosomes).

`k <- calc_kinship(apr, "loco")`

We create a numeric covariate for sex; be sure to include the individual IDs as names.

```
sex <- (DOex$covar$Sex == "male")*1
names(sex) <- rownames(DOex$covar)
```

We perform a genome scan with a linear mixed model (adjusting for a residual polygenic effect), with sex as an additive covariate.

`out <- scan1(apr, DOex$pheno, k, sex)`

Here’s a plot of the results.

```
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot(out, DOex$gmap)
```

There’s a strong peak on chromosome 2. Let’s look at the QTL effects. We estimate them with `scan1coef()`

. We need to subset the allele probabilities and the list of kinship matrices.

`coef_c2 <- scan1coef(apr[,"2"], DOex$pheno, k[["2"]], sex)`

For the DO, with 8 QTL alleles, we can use the function `plot_coefCC`

in the R/qtl2plot package, which plots the 8 allele effects in the “official” Collaborative Cross (CC) colors. (Well, actually *slightly* modified colors, because I think the official colors are kind of ugly.) The strong locus seems to be mostly due to the NZO allele. Note that `CCcolors`

is a vector of colors included in the qtl2plot package; there’s also a `CCorigcolors`

object with the *official* colors.

```
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_coefCC(coef_c2, DOex$gmap["2"], bgcolor="gray95")
legend("bottomleft", col=CCcolors, names(CCcolors), ncol=2, lwd=2, bg="gray95")
```

### SNP associations

Okay, now finally we get to the SNP associations. We have a large peak on chromosome 2, and we want to look at individual SNPs in the region of the locus.

Well, actually, we first need to find the location of the inferred QTL. The peak LOD score on chromosome 2 occurs at 52.4 cM. But to find nearby SNPs, we really want to know the Mbp position. The calculations were only performed at the marker positions, and so we need to find the peak marker and then find it’s physical location:

```
marker <- rownames(max(out, DOex$gmap, chr="2"))
peak_Mbp <- DOex$pmap[["2"]][marker]
```

The marker is at 97.5 Mbp.

Now we need to identify the SNPs in this region. We’ll focus on a 2 Mbp interval centered at 97.5 Mbp. We’re still working on how best to quickly access SNP data. In the meantime, we can grab a predefined table of SNPs that’s available in the qtl2data repository. It’s saved as an RDS file, which is a slight hassle to load over the web.

```
tmpfile <- tempfile()
file <- "https://raw.githubusercontent.com/rqtl/qtl2data/master/DOex/c2_snpinfo.rds"
download.file(file, tmpfile, quiet=TRUE)
snpinfo <- readRDS(tmpfile)
unlink(tmpfile)
```

Here’s the first few rows of the data. The columns are the SNP name, the chromosome, the Mbp position (in Mouse genome build 38), the alleles (with the B6 allele before the `|`

and any other alleles after; in the case of multiple alternate alleles, they are separated by `/`

). Finally, there are eight columns of genotypes for the 8 CC founder strains, coded as `1`

/`3`

.

`head(snpinfo)`

```
## snp_id chr pos_Mbp alleles AJ B6 129 NOD NZO CAST PWK WSB
## 1 rs221396738 2 96.50001 C|T 1 1 1 1 1 3 1 1
## 2 rs264175039 2 96.50022 A|C 1 1 1 1 3 1 1 1
## 3 rs227493750 2 96.50028 C|T 1 1 1 1 3 1 1 1
## 4 rs229722012 2 96.50034 C|G 1 1 1 1 3 3 1 3
## 5 rs27379137 2 96.50044 C|T 1 1 1 1 1 3 1 1
## 6 rs227574143 2 96.50067 A|C 1 1 1 1 3 1 1 3
```

We first convert the founder genotypes to a “strain distribution pattern” (SDP): an integer whose binary encoding corresponds to the 8 founders’ genotypes.

`snpinfo$sdp <- calc_sdp(snpinfo[,-(1:4)])`

We’ve added the SDP as an additional column.

`head(snpinfo)`

```
## snp_id chr pos_Mbp alleles AJ B6 129 NOD NZO CAST PWK WSB sdp
## 1 rs221396738 2 96.50001 C|T 1 1 1 1 1 3 1 1 32
## 2 rs264175039 2 96.50022 A|C 1 1 1 1 3 1 1 1 16
## 3 rs227493750 2 96.50028 C|T 1 1 1 1 3 1 1 1 16
## 4 rs229722012 2 96.50034 C|G 1 1 1 1 3 3 1 3 176
## 5 rs27379137 2 96.50044 C|T 1 1 1 1 1 3 1 1 32
## 6 rs227574143 2 96.50067 A|C 1 1 1 1 3 1 1 3 144
```

(Note that there’s also a function `invert_sdp()`

for converting the SDPs back into founder genotypes.)

To perform the SNP association analysis, we first use the allele probabilities and the founder SNP genotypes to infer the SNP genotypes for the DO mice. That is, at each SNP, we want to collapse the eight founder allele probabilities to two SNP allele probabilities, using the the SNP genotypes of the founders.

We do this assuming that the allele probabilities were calculated sufficiently densely that they can be assumed to be constant in intervals. With this assumption, we then:

- Find the interval for each SNP.
- Reduce the SNPs to a “distinct” set: if two SNPs have the same SDP and are in the same interval, by our assumption their allele probabilities will be the same.
- Take the average of the allele probabilities at the two endpoints of each interval.
- Collapse the 8 allele probabilities to two according to each observed SDP in the interval.

We further create a look-up table relating the full set of SNPs to the reduced set (one of each observed SDP in each interval).

We first need to identify the equivalent SNPs, using the function `index_snps()`

. This requires a physical map of the markers/pseudomarkers used to calculate the genotype probabilities. We take this directly from the `DOex`

object, as we’d calculated the allele probabilities only at the observed markers. If we’d also calculated probabilities at pseudomarker positions between markers, we’d need to use interpolation to get Mbp positions for the pseudomarkers. There’s a function `interp_map()`

for assisting with that.

The `index_snps()`

function takes the physical map and the `snpinfo`

data frame, include the strain distribution patterns we calculated above. It inserts three new columns into the data frame (`"index"`

, `"interval"`

, and `"on_map"`

: indexes to a set of non-equivalent SNPs, map intervals in which the SNPs lie, and whether the SNPs correspond to marker/pseudomarker positions).

`snpinfo <- index_snps(DOex$pmap, snpinfo)`

We can then use the function `genoprob_to_snpprob()`

, which takes the allele probabilities (or the full genotype probabilities, if you want to use a full 3-genotype model at each SNP), to collapse the genotype probabilities to SNP genotype probabilities.

`snp_pr <- genoprob_to_snpprob(apr, snpinfo)`

The output of this function, `snp_pr`

, has the same form as the input `apr`

object with allele probabilities, and can be used directly in a call to `scan1()`

. And so we can now use the object to perform the SNP association analysis in the region, using the same linear mixed model. We need to be sure to use the correct kinship matrix.

`out_snps <- scan1(snp_pr, DOex$pheno, k[["2"]], sex)`

The function `plot_snpasso()`

in the qtl2plot package can be used to plot the results, with points at each of the SNPs. The default is to plot **all** SNPs: We calculated LOD scores only at a set of distinct SNPs, but SNPs in the same interval with the same SDP will have the same LOD score. It takes the `scan1()`

output plus the `snpinfo`

data frame.

```
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_snpasso(out_snps, snpinfo)
```

To get a table of the SNPs with the largest LOD scores, use the function `top_snps()`

. This will show all SNPs with LOD score within some amount (the default is 1.5) of the maximum SNP LOD score.

`top_snps(out_snps, snpinfo)`

```
## snp_id chr pos_Mbp alleles AJ B6 129 NOD NZO CAST PWK WSB sdp index interval on_map lod
## 3264 rs212414861 2 96.75489 C|G 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3265 rs229578122 2 96.75494 T|A 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3266 rs254318131 2 96.75494 C|T 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3267 rs217679969 2 96.75495 G|T 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3268 rs238404461 2 96.75496 T|G 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3269 rs262749925 2 96.75497 C|G 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3271 rs231282689 2 96.75498 C|G/T 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3274 rs260286709 2 96.75518 G|A 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3275 rs27396282 2 96.75522 T|C 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3283 rs263841533 2 96.75541 T|C 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3288 rs231205920 2 96.75557 G|A 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 3289 rs242885221 2 96.75656 T|C 1 1 1 1 3 3 1 1 48 3264 64 FALSE 6.931185
## 5242 rs586746690 2 96.88443 C|T 1 1 3 1 3 3 1 1 52 5242 64 FALSE 5.904999
## 5426 rs49002164 2 96.89204 G|A 1 1 3 1 3 3 1 1 52 5242 64 FALSE 5.904999
## 7170 rs244595995 2 96.96972 A|T 1 1 3 1 3 3 1 1 52 5242 64 FALSE 5.904999
## 16182 rs220351620 2 97.57962 A|G 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 22474 rs52579091 2 98.07299 C|T 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23017 rs243489710 2 98.11061 C|T 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23184 rs244316263 2 98.12529 C|A 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23186 rs219729956 2 98.12534 C|A 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23360 rs235315566 2 98.20433 G|T 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23494 rs250167663 2 98.21481 C|T 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23518 rs234831418 2 98.21704 T|G 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23601 rs240832432 2 98.22253 C|A 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23649 rs220815439 2 98.22511 G|A 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 23830 rs579950897 2 98.24240 C|T 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 26015 rs224994851 2 98.39510 A|T 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 26213 rs248208898 2 98.42249 C|T 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 26554 rs245525925 2 98.45411 C|A 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
## 26891 rs229631954 2 98.47723 C|T 1 1 1 1 3 1 1 1 16 16182 65 FALSE 5.994374
```

The top SNPs all have NZO and CAST with a common allele, different from the other 6 founders. The next-best SNPs have NZO with a unique allele. Note that there’s one SNP with two alternate alleles (`C|G/T`

). We are requiring that SNPs have just two alleles, and so we group the alternate alleles together, though there’s not a good reason for this.

We can highlight these top SNPs in the SNP association plot using the `drop`

argument.

```
par(mar=c(4.1, 4.1, 0.6, 0.6))
plot_snpasso(out_snps, snpinfo, drop=1.5)
```