R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.
The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we need to define a new input file format. This document describes the details.
For simple cross types, we can continue to use the file formats for R/qtl, use
qtl::read.cross() to read in the data, and then use a conversion function (
qtl2geno::convert2cross2()) to convert the data into the new format.
For more complex crosses, we need to define a new format. I was persuaded by Aaron Wolen’s idea of a “tidy” format for R/qtl, with three separate CSV files, one for phenotypes, one for genotypes, and one for the genetic map.
Another important idea is from Pjotr Prins’s qtab format: the inclusion of metadata, such as genotype encodings, with the primary data. This will simplify the handling of multiple files and will help to avoid mistakes.
And so the basic idea for the new format is to have a separate file for each part of the primary data (genotypes, founder genotypes, genetic map, physical map, phenotypes, covariates, and phenotype covariates), and then a control file (in YAML or JSON format) which specifies the names of all of those files, the genotype encodings and missing value codes, and things like the name of the sex column within the covariate data (and the encodings for the sexes) and which chromosome is the X chromosome.
Before discussing the boring file specifications, let’s consider briefly how the data are read into R.
A key advantage of the control file scheme is that it greatly simplifies the function for reading in the data. That function,
read_cross2(), has a single argument: the name (with path) of the control file. So you can read in data like this:
library(qtl2geno) grav2 <- read_cross2("~/my_data/grav2.yaml")
The large number of files is a bit cumbersome, so we’ve made it possible to use a zip file containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:
zip_datafiles() function will read the control file to identify all of the relevant data files and then zip them up into a file with the same name and location, but with the extension
.zip rather than
To read the data back in, we use the same
read_cross2() function, providing the name (and path) of the zip file rather than the control file.
grav2 <- read_cross2("~/my_data/grav2.zip")
This can even be done with remote files.
grav2 <- read_cross2("http://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")
Of course, the other advantage of the zip file is that it is compressed and so smaller than the combined set of CSV files.
The bulk of the data is in a set of comma-delimited (CSV) files. In addition, a control file (in YAML or JSON format), contained in the same directory as the CSV files, specifies the file names and other control parameters (such as genotype and sex encodings). Sample data files are available at the R/qtl2 website. We’ll discuss the CSV files first.
The comma-delimited (CSV) files are each in the form of a simple matrix, with the first column being a set of IDs and the first row being a set of variable names.
Missing value codes will be specified in the control file (as
na.strings, with default value
"NA") and will apply across all files, so a missing value code for one file cannot be an allowed value in another file.
The CSV files can include a header with a set of comment lines initiated by a value specified in the control file as
comment.char (with default value
"#"). The first such line could be a description of the contents of the file. These comment lines can include the expected number of rows and columns, like this:
# This file contains blah, blah, blah... # nrow 25012 # ncol 91
The number of rows (
nrow) includes only the data rows (not the comment rows, nor the row with variable names). On the other hand, the number of columns (
ncol) does include the column with individual IDs.
All of these CSV files may be transposed relative to the form described below. You just need to include, in the control file, a line like
The genotype data file is a matrix of individuals × markers. The first column is the individual IDs; the first row is the marker names. The founder genotypes (if needed) are in the same form, with founder lines as rows and markers as columns, and with founder IDs in the first column.
We split the numeric phenotypes from the mixed-mode covariates, as two separate CSV files. Each file forms a matrix of individuals × phenotypes (or covariates), with the first column being individual IDs and the first row being phenotype or covariate names. Sex and line IDs (if needed) can be columns in the covariate data.
A separate CSV file contains phenotype covariate data, as phenotypes × phenotype covariates. The first column contains phenotype names, and the first row contains the names of the phenotype covariates.
Note: The genotype, founder genotype, phenotype, covariate, and phenotype covariate data can be split across multiple files. For example, the genotype data could be split by chromosome. The individual IDs must appear in each file; these are used to combine the files.
Genetic and physical maps of the genotyped markers will be as separate CSV files, each with three columns: marker, chromosome, and position. The first row should be
marker,chr,pos but will be ignored. In the genetic map file, positions should be in centiMorgans (cM). In the physical map file, positions should be in megabasepairs (Mbp).
"cross_info" data specifies details of the cross that generated each individual and is a numeric matrix with individuals as rows (the same number of rows as in the genotype data) and with columns depending on the cross type.
For simple cross types (e.g.,
"f2", an intercross between two inbred lines), this cross information may be included as a column in the covariate data. More generally, the cross information will be a separate CSV file. For example, for a set of Collaborative Cross (CC) lines, we will want a matrix with eight columns, which indicate the order of the founders in the crosses that generated each CC line.
So, in general, the cross information will be in a CSV file with individuals as rows and a set of columns that define the cross information for that cross type. The first column contains individual IDs and the first row contains column names. Details on the column information are provided in the cross-type-specific information, below.
The new input file format includes a text-based control file (in YAML or JSON format) to specify the names of all of the other files as well as various control parameters such as genotype and sex encodings and codes for missing values. We use YAML because it is flexible, readable, and easy to import into R. We also allow JSON; though it is often less human-readable, it can be less prone to errors.
The format of the control file is a bit technical. We describe the details here and also provide a function
write_control_file() that takes the detailed specifications as input and contructs the control file in the correct format.
We’ll start with an example: the control file for the sample intercross data.
# Data from Grant et al. (2006) Hepatology 44:174-185 # Abstract of paper at PubMed: http://www.ncbi.nlm.nih.gov/pubmed/16799992 # Available as part of R/qtl book package, https://github.com/kbroman/qtlbook crosstype: f2 geno: iron_geno.csv pheno: iron_pheno.csv phenocovar: iron_phenocovar.csv covar: iron_covar.csv gmap: iron_gmap.csv alleles: - S - B genotypes: SS: 1 SB: 2 BB: 3 sex: covar: sex f: female m: male cross_info: covar: cross_direction (SxB)x(SxB): 0 (BxS)x(BxS): 1 x_chr: X na.strings: - '-' - NA
Any line that begins with a “
#” is treated as a comment and ignored. It’s good to include some comments at the top of the file, describing the dataset.
The order of things within the file is not important, but the names of things are critical.
Much of the information is represented as key-value pairs, as “
key: value.” For example, the cross type is indicated with a line like
key” is “
crosstype” and the “
value” is “
f2.” This indicates that the data are for an F2 intercross between two inbred lines.
The names of the basic CSV files are indicated with lines like
This indicates that the genotype data are in the file
iron_geno.csv. The files are expected to be in the same directory as the control file. They could be placed in separate directories, with the file names being paths relative to the location of the control file, but this is not recommended (or well tested).
The “keys” for the different files are the following:
geno: genotype_filename founder_geno: founder_genotype_filename pheno: phenotype_filename covar: covariate_filename phenocovar: phenotype_covariate_filename gmap: genetic_map_filename pmap: physical_map_filename
Most of these files are optional; if a particular file is not used, the corresponding key can be omitted from the control file.
If the data for a section is split into multiple files (for example, if the genotypes are split into chromosome-specific files), then a vector of file names should be provided. For example:
geno: - geno1.csv - geno2.csv - genoX.csv founder_geno: - founder_geno1.csv - founder_geno2.csv - founder_genoX.csv
If one of the chromosomes is to be treated as the X chromosome, there should be a line like
This specifies the chromosome ID for the X chromosome (
X in this case).
To add labels in summary tables and plots, provide a vector of single-character allele labels, with one for each founder line. For example,
alleles: - S - B
This list of items, each beginning with a hyphen and a space, is the YAML format for a vector. It is equivalent to the R code
You could also write this line as
alleles: [S, B]
which is an alternative format for vectors in YAML.
The control file should contain a record with “
genotypes:” that specifies the genotype encodings. Here’s an example:
genotypes: SS: 1 SB: 2 BB: 3
For each possible genotype code, indent and provide a “
key: value” pair, with the key being the code used in the genotype and founder genotype files, and the value being an integer to which the genotype should be converted.
The above example would be suitable for a backcross or intercross. For a backcross, the second homozygote (
BB in this case) is only needed in the case that there are X chromosome genotypes for males.
For RIL, we would use something like
genotypes: BB: 1 DD: 2
For crosses with multiple parents, the genotype file should contain genotype calls for a set of SNPs, and there should be a corresponding founder genotype file with genotypes of the founders at those SNPs. A common set of genotype codes needs to be used for all SNPs. In particular, the genotypes cannot be encoded as
AG, because then, e.g.,
CC would need to be treated as
1 for some SNPs and
3 for others. Instead, code the genotypes with something like
BB, and then include the following in the control file:
genotypes: AA: 1 AB: 2 BB: 3
Sex can be provided as a column in the covariate file or as a separate file.
If it is a column in the covariate file, the control file should have a section that looks like this:
sex: covar: sex f: female m: male
covar: sex” indicates that the column name used in the covariate file is “
sex.” If the column name were “
Sex,” you would write “
The other two “
key: value” pairs are the encodings used for sex, with the “keys” being the codes used in the covariate file and the “values” being
male. So this indicates that sex was encoded as
f for females and
m for males. If, instead, the sex covariate had
0 for females and
1 for males, you would use:
sex: covar: sex 0: female 1: male
Sex information can also be provided as a separate file. In this case, the file should have two columns: individual ID, and sex. Further, the part of the control file dealing with sex should look like this:
sex: file: sex_filename f: female m: male
So instead of a line with “
covar:,” use “
file:” followed by the name of the file (e.g., “
file: iron_sex.csv”). You must still provide the sex encodings, as before.
For simple crosses (e.g., an intercross), cross information can be a single column within the covariate file. In this case, include something like the following in the control file:
cross_info: covar: cross_direction (SxB)x(SxB): 0 (BxS)x(BxS): 1
This is much like the information for sex. The “
covar:” line indicates the name of the column in the covariate data that corresponds to the cross information. The other two lines indicate the encodings of the cross information as “
key: value” pairs, where “
key” is the code used in the cross information column and “
value” is the integer to which it should be converted.
More generally, the cross information would be contained in a separate comma-delimited file. For simple crosses, in which the cross information is a single column, we allow it to be encoded differently from what is needed, and the control file information should look like this:
cross_info: file: crossinfo_filename (SxB)x(SxB): 0 (BxS)x(BxS): 1
For more complex crosses (e.g., the Collaborative Cross), the cross information spans multiple columns and we require that the user have set this up in advance (i.e., no translation of encodings will be performed). In this case the relevant section of the control file looks like this:
cross_info: file: crossinfo_filename
Or, more simply, you could write:
To indicate the set of codes that are to be treated as missing values in the genotype, founder genotype, phenotype, covariate, and phenotype covariate files, define
na.strings within the control file:
na.strings: - NA - '-'
A hyphen needs to be surrounded in single- or double-quotes. Many other character strings (such as
NA) do not. This is a similar contruction as for the allele codes above; the list with hyphens followed a space is the YAML format for a vector. You could also write:
na.strings: [NA, '-']
which is another way to define a vector with YAML.
If the data files use a separator other than a comma (e.g., a semi-colon, or the vertical bar (
|) which I like because it is seldom present in data), indicate the separator within the control file, as follows:
A vertical bar needs to be surrounded by single- or double-quotes. A semicolon doesn’t, but it doesn’t hurt if you do.
[to be provided, eventually]