R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.

The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we need to define a new input file format. This document describes the details.

For simple cross types, we can continue to use the file formats for R/qtl, use qtl::read.cross() to read in the data, and then use a conversion function (qtl2::convert2cross2()) to convert the data into the new format.

For more complex crosses, we need to define a new format. I was persuaded by Aaron Wolen’s idea of a “tidy” format for R/qtl, with three separate CSV files, one for phenotypes, one for genotypes, and one for the genetic map.

Another important idea is from Pjotr Prins’s qtab format: the inclusion of metadata, such as genotype encodings, with the primary data. This will simplify the handling of multiple files and will help to avoid mistakes.

And so the basic idea for the new format is to have a separate file for each part of the primary data (genotypes, founder genotypes, genetic map, physical map, phenotypes, covariates, and phenotype covariates), and then a control file (in YAML or JSON format) which specifies the names of all of those files, the genotype encodings and missing value codes, and things like the name of the sex column within the covariate data (and the encodings for the sexes) and which chromosome is the X chromosome.

Reading the data files

Before discussing the boring file specifications, let’s consider briefly how the data are read into R.

A key advantage of the control file scheme is that it greatly simplifies the function for reading in the data. That function, read_cross2(), has a single argument: the name (with path) of the control file. So you can read in data like this:

library(qtl2)
grav2 <- read_cross2("~/my_data/grav2.yaml")

The large number of files is a bit cumbersome, so we’ve made it possible to use a zip file containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:

zip_datafiles("~/my_data/grav2.yaml")

The zip_datafiles() function will read the control file to identify all of the relevant data files and then zip them up into a file with the same name and location, but with the extension .zip rather than .yaml or .json.

To read the data back in, we use the same read_cross2() function, providing the name (and path) of the zip file rather than the control file.

grav2 <- read_cross2("~/my_data/grav2.zip")

This can even be done with remote files.

grav2 <- read_cross2("https://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")

Of course, the other advantage of the zip file is that it is compressed and so smaller than the combined set of CSV files.

Format of the data files

The bulk of the data is in a set of comma-delimited (CSV) files. In addition, a control file (in YAML or JSON format), contained in the same directory as the CSV files, specifies the file names and other control parameters (such as genotype and sex encodings). Sample data files are available at the R/qtl2 website. We’ll discuss the CSV files first.

CSV files

The comma-delimited (CSV) files are each in the form of a simple matrix, with the first column being a set of IDs and the first row being a set of variable names.

Missing value codes will be specified in the control file (as na.strings, with default value "NA") and will apply across all files, so a missing value code for one file cannot be an allowed value in another file.

The CSV files can include a header with a set of comment lines initiated by a value specified in the control file as comment.char (with default value "#"). The first such line could be a description of the contents of the file. These comment lines can include the expected number of rows and columns, like this:

# This file contains blah, blah, blah...
# nrow 25012
# ncol 91

The number of rows (nrow) includes only the data rows (not the comment rows, nor the row with variable names). On the other hand, the number of columns (ncol) does include the column with individual IDs.

All of these CSV files may be transposed relative to the form described below. You just need to include, in the control file, a line like

geno_transposed: true

Genotype and founder genotype data

The genotype data file is a matrix of individuals × markers. The first column is the individual IDs; the first row is the marker names. The founder genotypes (if needed) are in the same form, with founder lines as rows and markers as columns, and with founder IDs in the first column.

Phenotype and covariate data

We split the numeric phenotypes from the mixed-mode covariates, as two separate CSV files. Each file forms a matrix of individuals × phenotypes (or covariates), with the first column being individual IDs and the first row being phenotype or covariate names. Sex and line IDs (if needed) can be columns in the covariate data.

Phenotype covariates

A separate CSV file contains phenotype covariate data, as phenotypes × phenotype covariates. The first column contains phenotype names, and the first row contains the names of the phenotype covariates.

Note: The genotype, founder genotype, phenotype, covariate, and phenotype covariate data can be split across multiple files. For example, the genotype data could be split by chromosome. The individual IDs must appear in each file; these are used to combine the files.

Genetic and physical maps

Genetic and physical maps of the genotyped markers will be as separate CSV files, each with three columns: marker, chromosome, and position. The first row should be marker,chr,pos but will be ignored. In the genetic map file, positions should be in centiMorgans (cM). In the physical map file, positions should be in megabasepairs (Mbp).

Cross information

The "cross_info" data specifies details of the cross that generated each individual and is a numeric matrix with individuals as rows (the same number of rows as in the genotype data) and with columns depending on the cross type.

For simple cross types (e.g., "f2", an intercross between two inbred lines), this cross information may be included as a column in the covariate data. More generally, the cross information will be a separate CSV file. For example, for a set of Collaborative Cross (CC) lines, we will want a matrix with eight columns, which indicate the order of the founders in the crosses that generated each CC line.

So, in general, the cross information will be in a CSV file with individuals as rows and a set of columns that define the cross information for that cross type. The first column contains individual IDs and the first row contains column names. Details on the column information are provided in the cross-type-specific information, below.

Control file

The new input file format includes a text-based control file (in YAML or JSON format) to specify the names of all of the other files as well as various control parameters such as genotype and sex encodings and codes for missing values. We use YAML because it is flexible, readable, and easy to import into R. We also allow JSON; though it is often less human-readable, it can be less prone to errors.

The format of the control file is a bit technical. We describe the details here and also provide a function write_control_file() that takes the detailed specifications as input and contructs the control file in the correct format.

We’ll start with an example: the control file for the sample intercross data.

# Data from Grant et al. (2006) Hepatology 44:174-185
# Abstract of paper at PubMed: https://www.ncbi.nlm.nih.gov/pubmed/16799992
# Available as part of R/qtl book package, https://github.com/kbroman/qtlbook
crosstype: f2
geno: iron_geno.csv
pheno: iron_pheno.csv
phenocovar: iron_phenocovar.csv
covar: iron_covar.csv
gmap: iron_gmap.csv
alleles:
- S
- B
genotypes:
SS: 1
SB: 2
BB: 3
sex:
covar: sex
f: female
m: male
cross_info:
covar: cross_direction
(SxB)x(SxB): 0
(BxS)x(BxS): 1
x_chr: X
na.strings:
- '-'
- NA

Any line that begins with a “#” is treated as a comment and ignored. It’s good to include some comments at the top of the file, describing the dataset.

The order of things within the file is not important, but the names of things are critical.

Much of the information is represented as key-value pairs, as “key: value.” For example, the cross type is indicated with a line like

crosstype: f2

The “key” is “crosstype” and the “value” is “f2.” This indicates that the data are for an F2 intercross between two inbred lines.

File names

The names of the basic CSV files are indicated with lines like

geno: iron_geno.csv

This indicates that the genotype data are in the file iron_geno.csv. The files are expected to be in the same directory as the control file. They could be placed in separate directories, with the file names being paths relative to the location of the control file, but this is not recommended (or well tested).

The “keys” for the different files are the following:

geno:         genotype_filename
founder_geno: founder_genotype_filename
pheno:        phenotype_filename
covar:        covariate_filename
phenocovar:   phenotype_covariate_filename
gmap:         genetic_map_filename
pmap:         physical_map_filename

Most of these files are optional; if a particular file is not used, the corresponding key can be omitted from the control file.

If the data for a section is split into multiple files (for example, if the genotypes are split into chromosome-specific files), then a vector of file names should be provided. For example:

geno:
- geno1.csv
- geno2.csv
- genoX.csv
founder_geno:
- founder_geno1.csv
- founder_geno2.csv
- founder_genoX.csv

X chromosome

If one of the chromosomes is to be treated as the X chromosome, there should be a line like

x_chr: X

This specifies the chromosome ID for the X chromosome (X in this case).

Allele labels

To add labels in summary tables and plots, provide a vector of single-character allele labels, with one for each founder line. For example,

alleles:
- S
- B

This list of items, each beginning with a hyphen and a space, is the YAML format for a vector. It is equivalent to the R code c("S", "B").

You could also write this line as

alleles: [S, B]

which is an alternative format for vectors in YAML.

Genotype codes

The control file should contain a record with “genotypes:” that specifies the genotype encodings. Here’s an example:

genotypes:
SS: 1
SB: 2
BB: 3

For each possible genotype code, indent and provide a “key: value” pair, with the key being the code used in the genotype and founder genotype files, and the value being an integer to which the genotype should be converted.

The above example would be suitable for a backcross or intercross. For a backcross, the second homozygote (BB in this case) is only needed in the case that there are X chromosome genotypes for males.

For RIL, we would use something like

genotypes:
BB: 1
DD: 2

For crosses with multiple parents, the genotype file should contain genotype calls for a set of SNPs, and there should be a corresponding founder genotype file with genotypes of the founders at those SNPs. A common set of genotype codes needs to be used for all SNPs. In particular, the genotypes cannot be encoded as AA, CC, GG, TT, AC, AG, because then, e.g., CC would need to be treated as 1 for some SNPs and 3 for others. Instead, code the genotypes with something like AA, AB, BB, and then include the following in the control file:

genotypes:
AA: 1
AB: 2
BB: 3

Sex

Sex can be provided as a column in the covariate file or as a separate file.

If it is a column in the covariate file, the control file should have a section that looks like this:

sex:
covar: sex
f: female
m: male

Here, “covar: sex” indicates that the column name used in the covariate file is “sex.” If the column name were “Sex,” you would write “covar: Sex.”

The other two “key: value” pairs are the encodings used for sex, with the “keys” being the codes used in the covariate file and the “values” being female and male. So this indicates that sex was encoded as f for females and m for males. If, instead, the sex covariate had 0 for females and 1 for males, you would use:

sex:
covar: sex
0: female
1: male

Sex information can also be provided as a separate file. In this case, the file should have two columns: individual ID, and sex. Further, the part of the control file dealing with sex should look like this:

sex:
file: sex_filename
f: female
m: male

So instead of a line with “covar:,” use “file:” followed by the name of the file (e.g., “file: iron_sex.csv”). You must still provide the sex encodings, as before.

Cross information

For simple crosses (e.g., an intercross), cross information can be a single column within the covariate file. In this case, include something like the following in the control file:

cross_info:
covar: cross_direction
(SxB)x(SxB): 0
(BxS)x(BxS): 1

This is much like the information for sex. The “covar:” line indicates the name of the column in the covariate data that corresponds to the cross information. The other two lines indicate the encodings of the cross information as “key: value” pairs, where “key” is the code used in the cross information column and “value” is the integer to which it should be converted.

More generally, the cross information would be contained in a separate comma-delimited file. For simple crosses, in which the cross information is a single column, we allow it to be encoded differently from what is needed, and the control file information should look like this:

cross_info:
file: crossinfo_filename
(SxB)x(SxB): 0
(BxS)x(BxS): 1

For more complex crosses (e.g., the Collaborative Cross), the cross information spans multiple columns and we require that the user have set this up in advance (i.e., no translation of encodings will be performed). In this case the relevant section of the control file looks like this:

cross_info:
file: crossinfo_filename

Or, more simply, you could write:

cross_info: crossinfo_filename

Missing value codes

To indicate the set of codes that are to be treated as missing values in the genotype, founder genotype, phenotype, covariate, and phenotype covariate files, define na.strings within the control file:

na.strings:
- NA
- '-'

A hyphen needs to be surrounded in single- or double-quotes. Many other character strings (such as NA) do not. This is a similar contruction as for the allele codes above; the list with hyphens followed a space is the YAML format for a vector. You could also write:

na.strings: [NA, '-']

which is another way to define a vector with YAML.

Field separator

If the data files use a separator other than a comma (e.g., a semi-colon, or the vertical bar (|) which I like because it is seldom present in data), indicate the separator within the control file, as follows:

sep: '|'

A vertical bar needs to be surrounded by single- or double-quotes. A semicolon doesn’t, but it doesn’t hurt if you do.

Detailed specifications for each cross type

Backcross

The cross type for a backcross is "bc". In the case that the X chromosome is included, a sex covariate needs to be included. The genotype codes are 1 for homozygotes and 2 for heterozygotes, and so the yaml control file will contain lines like the following:

genotypes:
AA: 1
AB: 2

The values on the left-hand side are the genotype codes in the genotype data file; the numbers on the right are the numeric codes they’ll be converted into.

The equivalent genotype code specification for a json control file is:

"genotypes": {"AA": 1, "AB": 2}

If the X chromosome is included, we will need a sex covariate. No cross information covariate is needed.

F2 intercross

The cross type for an intercross between two inbred strains is "f2".

The genotypes will be encoded as 1, 2, and 3 for say AA, AB, and BB. The yaml control file will contain lines like the following:

genotypes:
AA: 1
AB: 2
BB: 3

The values on the left-hand side are the genotype codes in the genotype data file; the numbers on the right are the numeric codes they’ll be converted into. The equivalent json specification is:

"genotypes": {"AA": 1, "AB": 2, "BB": 3}

If the X chromosome is included in the data, we will need a sex covariate as well as cross information. The cross information for an intercross indicates the direction of the cross, with 0 corresponding to the forward direction, for example (A×B)×(A×B) with females listed first, and 1 corresponding to the reverse direction, for example (B×A)×(B×A). What matters is the direction for the cross that led to the males, since the F2 females get an intact X chromosome from their father, which is A for the forward direction and B for the reverse. Different individuals can come from different directions, and you indicate the encodings with lines like the following:

cross_info:
covar: cross_direction
(AxB)x(AxB): 0
(BxA)x(BxA): 1

Here, we’re saying that the covariate file has a column named cross_direction that contains the cross information, and that the forward direction is encoded as (AxB)x(AxB) and the reverse direction as (BxA)x(BxA).

You can have multiple codes for the cross direction information. For example, the cross_direction covariate might use four possible codes, including not just (AxB)x(AxB) and (BxA)x(BxA), but also (BxA)x(AxB), and (AxB)x(BxA).

In this case, the yaml control file should contain the following:

cross_info:
covar: cross_direction
(AxB)x(AxB): 0
(BxA)x(AxB): 0
(BxA)x(BxA): 1
(AxB)x(BxA): 1

The equivalent json specification is:

"cross_info":{"covar":"cross_direction", "(AxB)x(AxB)":0, "(BxA)x(AxB)":0, "(BxA)x(BxA)":1, "(AxB)x(BxA)":1}

If you’ve coded the cross direction as in R/qtl, with a covariate pgm with values 0 and 1, your yaml file would contain the following lines.

cross_info:
covar: pgm
0: 0
1: 1

The equivalent json is:

"cross_info": {"covar": "pgm", "0": 0, "1": 1}

Here are some example R/qtl2 data files for F2 intercrosses:

Recombinant inbred lines (RIL)

The cross types for recombinant inbred lines derived from two inbred strains are "riself" and "risib" for RIL by selfing and sib-mating, respectively.

The genotypes will be encoded as 1 and 2, and so the yaml control file needs lines like:

genotypes:
AA: 1
BB: 2

The json equivalent is:

"genotypes": {"AA":1, "BB":2}

For RIL by selfing ("riself"), no X chromosome is allowed. For RIL by sib-mating ("risib") with an X chromosome, no sex covariate is needed, as we’re working with genotypes that correspond to lines rather than individuals, and the males and females within a line will have the same X chromosome; it’s just that the females will have two copies of it.

But for RIL by sib-mating with the X chromosome, we do need a cross information covariate which indicates the direction in which the cross was performed: 0 for the forward direction A×B (where a female A was crossed to a male B) and 1 for the reverse direction B×A (where a female B was crossed to a male A). If the cross information column is absent, we assume that all individuals came from the forward direction.

The yaml control file needs lines to indicate the cross information covariate and the encodings; if there is a covariate named cross_direction with values AxB and BxA, the yaml lines would be like this:

cross_info:
covar: cross_direction
AxB: 0
BxA: 1

The equivalent json is:

"cross_info": {"covar": "cross_direction", "AxB": 0, "BxA": 1}

We don’t have example data files for RIL by sib-mating, but the qtl2 package includes data for a set of arabidopsis RIL by selfing from Moore et al. (2013) Genetics 195:1077-1086, grav2. Here’s a zip file: grav2.zip.

Doubled haploids (DH)

The cross type for doubled haploids is "dh". We don’t allow an X chromosome, and don’t require sex or cross information. The genotypes will be encoded as 1 and 2, just as for recombinant inbred lines, and the yaml or json control file will have lines for genotypes that look just like those for riself and risib.

Haploids

The cross type for haploids is "haploid". As with doubled haploids, we don’t require sex or cross information, and the genotypes are coded as 1 and 2. The only difference between the "haploid" and "dh" cross types is the labels that are attached to the two genotypes in the output of various analyses: single-letter codes for haploids and double-letter homozygotes for doubled haploids.

Advanced intercross lines (AIL)

The cross type for advanced intercross lines derived from two inbred strains is "ail". Genotypes are coded 1, 2, and 3, just as with the F2 intercross (cross type "f2"). If an X chromosome is included, we need a sex covariate just as for the intercross.

Cross information needs to include two columns: the number of generations (for example, 12 for an AIL individual from generation F12), and the cross direction. Cross direction takes 3 possible values: 0 if the initial F1 individuals all came from an A×B cross (with a female A crossed to a male B), 1 if the initial F1 individuals all came from a B×A cross (with a female B crossed to a male A), and 2 if the initial F1 individuals were a balanced mixture of the two cross directions.

Because the cross information needs two columns, it can’t be provided through a covariate but rather needs to be included as a separate file, and the columns won’t be re-coded in any way; the first column needs to contain the generation numbers and the second column needs to contain cross direction coded as 0, 1, and 2.

If cross information is stored in a file mycross_crossinfo.csv, the yaml control file should contain a line like:

cross_info: mycross_crossinfo.csv

The json equivalent is the same but needs quotes:

"cross_info": "mycross_crossinfo.csv"

Heterogeneous stock (HS)

The cross type for heterogenous stock (HS) derived from 8 inbred strains is "hs". HS is similar to AIL, but starting with 8 inbred strains rather than 2.

For HS and other multi-parent populations, we need SNP genotypes for both the individuals and the founders. They’ll be coded as 1, 2, and 3, with 1 and 3 being the two homozygotes and 2 the heterozygote. In the genotype and founder genotype files, the genotypes need to have a consistent coding scheme (like A, H, and B) for all markers. While the data might arrive with nucleotide codes, they need to be recoded before they can be read into R. (The R/qtl2convert package includes some tools to assist with this.)

The yaml control file will includes lines similar to those for the F2 intercross:

genotypes:
A: 1
H: 2
B: 3

The json equivalent is the following:

"genotypes": {"A":1, "H": 2, "B":3}

The genotype codes need not be single-character codes; for example, you could use AA, AB, and BB.

A single cross info covariate is needed: the generation number for each individual. This could be included with the other covariates, or it can be in a separate file.

If the X chromosome is included, a sex covariate is also needed.

Diversity Outbreds (DO)

The cross type for the Diversity Outbred mouse population is "do". This is similar to HS, but with a specific design in which random mating started with individuals taken from intermediate generations in the formation of a set of 8-way recombinant inbred lines. (See Svenson et al (2012) Genetics 190:437-447.)

The needs regarding genotypes, founder genotypes, cross information, and sex are all the same as for HS. In particular, the cross information is the number of generations of random mating, and sex is needed if the X chromosome is included.

Here are some example R/qtl2 data files for two populations of DO mice:

Multi-parent recombinant inbred lines

R/qtl2 can handle multi-parent recombinant inbred lines by selfing in the case of 4, 8, or 16 founder lines ("riself4", "riself8", and "riself16"), and by sib-mating in the case of 4 or 8 founder lines ("risib4" and "risib8"). The Collaborative Cross (CC) is an example of 8-way RIL by sib-mating ("risib8"). Many MAGIC populations are also examples of multi-way RIL by sibling mating.

As with "hs" and "do", we need SNP genotypes for both the individuals and the founders. They’ll be coded as 1 and 3 in these cases (any heterozygotes, 2, in either the individuals’ or founders’ genotypes will be ignored). In the genotype and founder genotype files, they might be coded as A and B, in which case the yaml control file will need lines like:

genotypes:
A: 1
B: 3

The json equivalent is the following:

"genotypes": {"A":1, "B":3}

For these crosses, we need cross information, in a file separate from any other covariates, that indicates the order in which the founder lines were crossed when producing each line. This will have as many columns as there are founders (for example, 8 columns for "risib8") and each row should be a permutation of the integers from 1 to the number of founders. For example, for 8-way RIL with founders A–H, say that a line was derived by the cross [(A×H) × (C×D)] × [(E×B) × (F×G)], with females always listed first. Then the row in the cross information for that line would be:

line_1,1,8,3,4,5,2,6,7

The first field (“line_1”) is the line ID. The integers 1–8 indicate the eight founder lines, as listed in the order in the founder genotype data files.

As with 2-way RIL by sib-mating ("risib"), no sex covariate is needed or will be used.

19-way MAGIC lines

The "magic19" cross type is specific for the 19-way Arabidopsis MAGIC lines of Kover et al (2009) PLoS Genet 5:e1000551. These were derived from a full diallel (in both directions) among 19 strains followed by 3 generations of random mating (to F4) and then selfing for 6 generations.

As with "hs", "do", and the multi-parent RIL, we need both individual genotypes and genotypes on the 19 founder lines, and these must be SNPs and will be coded as 1 and 3 (with any heterozygotes treated as missing).

No cross info is needed, because the population was derived using a full diallel followed by random mating.

Here is an example of R/qtl2 data files for this population:

6-way doubled haploids

The "dh6" cross type is for a set of maize MAGIC populations developed at the Wisconsin Crop Innovation Center: A diallel of 6 founders followed by random mating for some number of generations and then derivation of doubled haploids.

Again, we need both individual genotypes and genotypes on the 6 founder lines, and these must be SNPs and will be coded as 1 and 3 (with any heterozygotes treated as missing).

We need one cross information covariate: the number of generations of random mating prior to forming the doubled haploids. The yaml control file might includes lines like the following:

cross_info:
covar: n_gen

The json equivalent is:

"cross_info": {"covar": "n_gen"}

DOF1

The "dof1" cross type is for a population formed by crossing a single inbred strain to each of multiple diversity outbred (DO) mice. And so these offspring have one chromosome from a DO mouse and one chromosome from the other inbred strain.

The founder genotype data should include nine strains: the eight founders of the DO, followed by the inbred strain to which the DO mice were crossed.

As with the DO, a single cross information covariate is needed, with the number of generations for the DO parent of each DOF1 individual. Sex will also be needed if the X chromosome is included.

The genotype probabilities calculated for DOF1 mice will have just eight columns: we just keep track of the founder origin on the DO chromosome.

3-way advanced intercross lines

The "ail3" cross type is for an advanced intercross population formed from three inbred strains.

The founder genotype data should include three strains. A single cross information covariate is needed, the number of AIL generations. A sex covariate is needed if the X chromosome is included.

General advanced intercross lines

The "genail" cross type seeks to serve as an approximation for a variety of multi-parent heterozygous populations. The specific cross type to use is "genail[n]" where [n] is to be replaced by the number of founder lines, for example "genail8" for an 8-way AIL population.

We imagine that each individual is derived from a specified number of generations of random outbreeding with large population sizes, starting with a population that consists of the founders in specified proportions, $$\alpha_i$$ for founder $$i$$.

The cross information file should have one row per individual and should give the number of generations of outbreeding followed by integer values $$a_i$$ such that $$\alpha_i = a_i / \sum a_i$$. None of the $$a_i$$ may be 0.

Founder genotype data are also required.

General recombinant inbred lines

The "genril" cross type is like the "genail" cross type except for homozygous lines. It seeks to serve as an approximation for a variety of multi-parent recombinant inbred populations. The specific cross type to use is "genril[n]" where [n] is to be replaced by the number of founder lines, for example "genril8" for an 8-way RIL population.

We imagine that the individual lines are doubled haploids derived following a specified number of generations of random outbreeding with large population sizes, starting with a population that consists of the founders in specified proportions, $$\alpha_i$$ for founder $$i$$.

The cross information file should have one row per line and should give the number of generations of outbreeding followed by integer values $$a_i$$ such that $$\alpha_i = a_i / \sum a_i$$. None of the $$a_i$$ may be 0.

Founder genotype data are also required.