The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we need to define a new input file format. This document describes the details.
For simple cross types, we can continue to use the file formats for R/qtl, use
qtl::read.cross() to read in the data, and then use a conversion function (
qtl2::convert2cross2()) to convert the data into the new format.
For more complex crosses, we need to define a new format. I was persuaded by Aaron Wolen’s idea of a “tidy” format for R/qtl, with three separate CSV files, one for phenotypes, one for genotypes, and one for the genetic map.
Another important idea is from Pjotr Prins’s qtab format: the inclusion of metadata, such as genotype encodings, with the primary data. This will simplify the handling of multiple files and will help to avoid mistakes.
And so the basic idea for the new format is to have a separate file for each part of the primary data (genotypes, founder genotypes, genetic map, physical map, phenotypes, covariates, and phenotype covariates), and then a control file (in YAML or JSON format) which specifies the names of all of those files, the genotype encodings and missing value codes, and things like the name of the sex column within the covariate data (and the encodings for the sexes) and which chromosome is the X chromosome.