R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.
In this document, we’ll describe the technical details of interest to developers wishing to contribute to R/qtl2.
There are three basic aspects to the software:
QTLCross
. We’re using Rcpp to connect from R to C++.We are redefining the basic data structures to be more general (to handle more complex crosses, such as the Collaborative Cross, MAGIC lines, and Diversity Outcross), and to incorporate more information, such as annotations of the phenotypes (for example, with gene expression data, the gene identifiers, physical locations of genes, and the tissue that was measured).
We are also redefining the input data file formats.
The basic data structure in R/qtl, the "cross"
class, has been redesigned.
A basic principle in the new design is to have the data as close as possible in the form in which it will be used, and to have a more-flat (i.e., less nested) data structure. So don’t entangle the marker maps with the genotype data, and pull out the sex information rather than have to find it within the phenotype data whenever it is needed.
A key design decision concerns whether to split out the genotype data and marker maps by chromosome, or to keep them as a simpler table. I’m choosing to have these data split out by chromosome, as much of the analysis tends to proceed by chromosome, and it tends to be easier to combine then to split.
The data remains a list, now of class "cross2"
. It has many components (some of them optional) in unspecified order; the names of the components are what matter.
"crosstype"
)Previously, the cross type (e.g., "bc"
for backcross and "f2"
for intercross) was part of the "class"
. This was a bad idea.
Now, the data will have a component "crosstype"
that is a single character string with the cross type:
"bc"
for backcross"f2"
for intercross"riself"
for 2-way RIL by selfing"risib"
for 2-way RIL by sib-mating"dh"
for doubled haploids (like a backcross, but with genotype codes AA/BB)"haploid"
for haploids (like a backcross, but with genotype codes A/B)"ail"
for 2-way advanced intercross lines"mwriself"
for 2k-way RIL by selfing"mwrisib"
for 2k-way RIL by sib-mating"preriself"
for partially-inbred 2k-way RIL by selfing"prerisib"
for partially-inbred 2k-way RIL by sib-mating"do"
for diversity outbred"hs"
for 8-way heterogeneous stock"riself4"
, "riself8"
, and "riself16"
for 4-, 8-, and 16-way RIL by selfing"risib4"
and "risib8"
for 4- and 8-way RIL by sib-mating"geno"
)The previous data structure had deep nesting of information; we’re going to split that out and make things more flat. The genotype data will now be a list of matrices of integers. Each component of the list is a chromosome; the names are the chromosome names. Each chromosome has a matrix of n_ind
x n_markers
.
The column names are the marker names, and the row names are the line identifiers (hereafter “IDs”).
The genotypes are observed marker genotypes, represented as integers. Missing values are coded as 0 (not NA
, as in R/qtl). For simple crosses, the autosomal genotypes are coded as before.
1
/2
for (AA/BB or A/B)1
/2
for autosomes (for AA/AB)1
/2
/3
/4
/5
for autosomes (for AA/AB/BB/notBB/notAA)However, we’re changing the encoding of X chromosome genotypes to have males coded as if they were homozygous. In an intercross, the X chromosome encodings are 1
/2
/3
for females (1
/2
for the forward direction and 2
/3
for the reverse), and 1
/3
for males. In a backcross, the females are 1
/2
and the males are 1
/3
.
For crosses with > 2 founders, I’m expecting SNPs, and intending, initially, to assume that the markers are diallelic, encoded 1
/2
/3
(with 2
being the heterozygote).
All of the above is under the assumption that we’re using genotypes calls as the basic marker genotype information. Of course, we will also want to handle genotype-by-sequencing (GBS) data (which might be represented as a pair of allele counts), or array intensity information (which would be represented as a pair of allele intensities). It seems best to have separate data structures for these cases, perhaps named geno_gbs
and geno_int
(int
for “intensity”). These could be three-dimensional arrays (n_ind
× n_markers
× 2
), with the third dimension corresponding to the two alleles.
"founder_geno"
)For crosses with > 2 founders, we will have a separate set of genotypes on the founders. This will again be a list of matrices, each matrix being the data for a chromosome, of size n_founders
× n_markers
. I expect these to be diallelic markers, such as SNPs. We will encode them as 1
/2
/3
(allowing heterozygotes, though I’ll probably treat the hets as missing values).
"is_x_chr"
)A logical vector of length n_chromosomes
indicates which of the chromosomes is the X chromosome.
"is_female"
)For the treatment of the X chromosome, we need access to sex of the individuals. We’ll have a logical vector ("is_female"
) indicating which individuals are female. I prefer the logical vector as it’s less susceptible to confusion. (Is 0 female and 1 male, or the other way around?) This will have length n_ind
, with the name attribute being the individual IDs.
"cross_info"
)For many cross types, and particularly for the treatment of the X chromosome, we need individual-level information about the nature of the cross. For an intercross, this is like pgm
(for paternal grandmother) in R/qtl. For the Collaborative Cross, we need the order of the founders in the set of crosses that led to each line. For the AIL and the Diversity Outcross, we need to know the number of generations of outbreeding.
This "cross_info"
component will be a matrix of integers with n_ind
rows, and with the number of columns depending on the cross type.
This information is highly cross-type-specific. We’ll leave the details to the discussion of the format of input files.
"gmap"
)The genetic map of the markers is a list of numeric vectors; each vector corresponds to a chromosome and gives the locations of markers in centiMorgans (cM), with the names attribute being the marker names. The markers should be in increasing order.
"pmap"
)We will also allow (and perhaps expect) a physical map of the markers. This will have the same form as the genetic map (with the same chromosomes, the same markers, and with markers in the same order), but with positions in Mbp. (Or perhaps we should use vectors of integers, with positions in basepairs?)
"pheno"
)We will separate out the numeric phenotypes from messier covariates. (In many cases, we want to perform QTL analysis on a large set of phenotypes, and having other stuff, like individual IDs, mixed in there can make things cumbersome). The phenotype data will be a numeric matrix of size n_individuals
× n_phenotypes
.
Row names are the individual IDs and column names are the phenotype names.
"covar"
)Covariate information, often non-numeric, will form a separate data frame, of size n_individuals
× n_covariates
. The columns can be of mixed modes (numeric, factors, character strings, etc.).
Row names are the individual IDs and column names are the covariate names.
"phenocovar"
)We will have a separate data frame of “phenotype covariates.” These are metadata describing the phenotypes. The dimension is n_phenotypes
× n_phenocovar
.
For example, in the case of a phenotype measured over time, one column in the phenotype covariate data frame could be the time of measurement.
For gene expression data, we would have columns representing chromosome and physical position of genes, as well as gene IDs. In the case of gene expression on multiple tissues, there could be a column representing the tissue. Or we might have both gene expression and proteomic measurements, and so a column indicating expression vs protein.
"alleles"
)The last bit is a vector of single-character strings, with allele codes for the founders, to be used in various summaries and data visualizations.
"genoprob"
)A critical piece of derived data is the conditional QTL genotype probabilities, given the observed marker data. This will be a list of three-dimensional arrays; each array corresponds to a chromosome and is of dimension n_ind
× n_positions
× n_genotypes
.
An important consideration is the encoding of the genotypes. For crosses with two founder lines, this is straightforward except perhaps for the X chromosome.
For the X chromosome, I’m using phase-known genotypes for the females and separating the male hemizygous genotypes. For example, for a backcross, the genotype codes are 1
/2
/3
/4
for AA/AB/AY/BY. For an intercross, the genotype codes are 1
/2
/3
/4
/5
/6
for AA/AB/BA/BB/AY/BY.
For crosses with > 2 founders and heterozygous offspring (such as the Diversity Outcross), genotypes will be encoded as integers in the following way, for the phase-unknown case:
## A B C D E F G H
## A 1 2 4 7 11 16 22 29
## B 2 3 5 8 12 17 23 30
## C 4 5 6 9 13 18 24 31
## D 7 8 9 10 14 19 25 32
## E 11 12 13 14 15 20 26 33
## F 16 17 18 19 20 21 27 34
## G 22 23 24 25 26 27 28 35
## H 29 30 31 32 33 34 35 36
If \(a_1\) and \(a_2\) are the two alleles, then we take \(m = \max(a_1, a_2)\) and \(d = |a_1 - a_2|\), and the encoding is \(\binom{m+1} 2 - d\).
In the phase-known case, we can fill out the lower triangle the way we filled out the upper triangle and diagonal for the phase-unknown case. With \(n\) denoting the number of alleles, and with \(a_1\), \(a_2\), \(m\), and \(d\) as before, and assuming \(a_1 > a_2\), we take \(\binom{m} 2 - d + \binom{n+1} 2 + 1\).
## A B C D E F G H
## A 1 2 4 7 11 16 22 29
## B 37 3 5 8 12 17 23 30
## C 38 39 6 9 13 18 24 31
## D 40 41 42 10 14 19 25 32
## E 43 44 45 46 15 20 26 33
## F 47 48 49 50 51 21 27 34
## G 52 53 54 55 56 57 28 35
## H 58 59 60 61 62 63 64 36
Decoding the genotypes to the allele pair is relatively straightforward, with a loop over columns.
For simple cross types, we can use the file formats for R/qtl, use qtl::read.cross
to read in the data, and then use a conversion function (qtl2::convert2cross2
) to convert the data into the new format.
For more complex crosses, we need to define a new format. I was persuaded by Aaron Wolen’s idea of a “tidy” format for R/qtl, with three separate CSV files, one for phenotypes, one for genotypes, and one for the genetic map.
Another important idea is from Pjotr Prins’s qtab format: the inclusion of metadata, such as genotype encodings, with the primary data. This will simplify the handling of multiple files and will help to avoid mistakes.
And so the basic idea for the new format is to have a separate file for each part of the primary data (genotypes, founder genotypes, genetic map, physical map, phenotypes, covariates, and phenotype covariates), and then a control file which specifies the names of all of those files, the genotype encodings and missing value codes, and things like the name of the sex column within the covariate data (and the encodings for the sexes) and which chromosome is the X chromosome.
A key advantage of the control file scheme is that it greatly simplifies the function for reading in the data. That function, read_cross2()
, has a single argument: the name (with path) of the control file. The large number of files is a bit cumbersome, so we’ve made it possible to use a [zip file](https://en.wikipedia.org/wiki/Zip_(file_format) containing all of the data files, and to read that zip file directly (with the same function, read_cross2()
). The function zip_datafiles()
can be used to create the zip file.
I describe the details of the input files in a separate vignette. Here I’ll give a brief sketch of the structure of the files.
The control file can be in YAML or JSON formats. Both are human-readable text file for representing relatively complex data. YAML tends to be more readable, but JSON is more robust.
Here’s an example, for a sample intercross dataset, in YAML.
# Data from Grant et al. (2006) Hepatology 44:174-185
# Abstract of paper at PubMed: https://www.ncbi.nlm.nih.gov/pubmed/16799992
# Available as part of R/qtl book package, https://github.com/kbroman/qtlbook
crosstype: f2
geno: iron_geno.csv
pheno: iron_pheno.csv
phenocovar: iron_phenocovar.csv
covar: iron_covar.csv
gmap: iron_gmap.csv
alleles:
- S
- B
genotypes:
SS: 1
SB: 2
BB: 3
sex:
covar: sex
f: female
m: male
cross_info:
covar: cross_direction
(SxB)x(SxB): 0
(BxS)x(BxS): 1
x_chr: X
na.strings:
- '-'
- NA
The order of the information is not important, but the names of things are critical.
crosstype
indicates the cross type. geno
, pheno
, phenocovar
, covar
, and gmap
indicate the names of the files for the different major pieces of data, all expected to be within the same directory as the YAML control file.
alleles
indicates the two single-character allele codes for the founders. The initial dashes are just to indicate that the S
and B
form a vector.
genotypes
gives the genotype codes used in the genotype data file. SS
, SB
, and BB
are the codes used, to be converted to 1
, 2
, and 3
, respectively. The key: value
structure is for an associative array.
sex
contains information about the name of the covariate that represents sex as well as the codes used: the sexes in the iron_covar.csv
file are coded as f
and m
, and we want to indicate which one is female
and which is male
.
The format of the control file is maybe a bit technical for some users, so there’s a function write_control_file()
that takes the control parameters (including file names) as input and writes the YAML file in the correct form.
Again, I don’t want to get into too much detail here. All of the other files are in a simple CSV format. Each is a simple matrix with row names in the first column and column names in the first row.
Genotypes are as individuals × markers, with the first column being individual IDs and the first row being marker names. The founder genotypes are similar, but with founder lines as the rows.
The phenotype and covariate data are as individuals × variables. The phenotype file must be strictly numeric, while the covariate file can be a mixture of types. The first column in each must be the individual IDs.
The phenotype covariate information is a matrix of phenotypes × phenotype covariates. The first column contains the phenotype names and the first row contains the names of the phenotype covariates.
The genetic and physical maps are in separate files with three columns: marker names, chromosome IDs, and positions.
The individual-to-line mapping ("linemap"
) would most likely be a column in the covariate data and would be represented in the YAML file much as sex
is above.
The last piece is "cross_info"
. For an intercross, this can just be a column in the covariate data (as it is for the example above). For more complex crosses, it will be a matrix with the rows being individuals, and with the first column being individual IDs.
For more detail, see the input file format vignette.
est_map
, we potentially need a separate phase-known class (explain why, and how this is done)