Preparing Diversity Outbred (DO) mouse data for R/qtl2
Studies using Diversity Outbred (DO) mice often involve SNP genotype data from the MegaMUGA or GigaMUGA arrays from GeneSeek. To analyze such data with R/qtl2, you first need to do some data reorganization and pre-processing. In this document, we will attempt to explain what to do.
Founder genotypes and SNP maps
With DO data, an important component of the R/qtl2 input files is the SNP genotypes for the eight founder lines. We also need the genetic and physical positions of the SNP markers.
has prepared such data for the MegaMUGA and GigaMUGA arrays and made
them available at ftp.jax.org/MUGA. The files with names
MM_ are for the MegaMUGA array; the files with names
GM_ are for the GigaMUGA array.
MM_primary_files.zip(doi:10.6084/m9.figshare.5404717.v1), containing all of the files for the MegaMUGA array
GM_primary_files.zip(doi:10.6084/m9.figshare.5404675.v1), containing all of the files for the GigaMUGA array
The files at ftp.jax.org/MUGA and in these
zip files are all in
.Rdata format, for
R. For the GigaMUGA data, the key files
GM_snps.Rdata, containing SNP locations, and
GM_geno.Rdata, containing SNP genotypes for the Collaborative Cross
(CC) founder strains (which are also the founders of the DO), as
To use these data with R/qtl2, we need to reformat the files and
recode the SNP genotypes (for example, as
placed an R script to do this at the figshare site,
This uses the R/qtl2convert
package, in particular
find_consensus_geno() to determine consensus
genotypes for each founder strain at each SNP from the multiple
individuals that were genotyped,
find_unique_geno() to determine the
SNP alleles, and
encode_geno() to encode the SNP genotypes as
The processed files are also at figshare:
Note: if you’re using the latest mouse genome build (GRCm39), you’ll want the revised version of these files:
GM_ files are for the MegaMUGA and GigaMUGA arrays,
MMnGM_ file (produced with the R script
merges the SNPs for the two arrays, for DO projects that have some
mice genotyped with the MegaMUGA array and others genotyped with the
Each of these zip files contain a series of CSV files. For example, for the GigaMUGA data, we have:
GM_allelecodes.csv, the allele encodings for each SNP
GM_foundergeno*.csv, the CC founder genotypes, encoded as
-, with one file for each chromosome
GM_gmap*.csv, the genetic map locations for the SNPS (in cM), with one file for each chromosome
GM_pmap*.csv, the physical map locations for the SNPS (in Mbp), with one file for each chromosome
GM_info.csv, information for all SNPs, including the genetic and physical locations and
tier(an indication of SNP quality)
If your DO data includes just the GigaMUGA array, you’ll
If you’ve used just the MegaMUGA array, you’ll need the
If your project includes some MegaMUGA and some GigaMUGA arrays,
you’ll need the
(Again, for mouse genome build GRCm39, you’ll want the revised version
of these files, mentioned above and also on figshare.)
Download the appropriate file and unzip it somewhere.
Encoding the DO genotypes
Now to the actual DO genotypes. GeneSeek will provide a zip file that
contains a series of files. We’ll use only the
which is the biggest of them, with one line for each SNP and for each
sample. For 200 mice, it’ll be about 1 GB, and maybe a quarter of that
size when compressed as a zip file.
We provide an example R script,
geneseek2qtl2.R, to convert this file
into what’s needed for R/qtl2. It does two things:
- Encodes the genotypes as
A/H/B/-and writes them to a series of CSV files, one per chromosome
- Extracts the SNP array intensities for the SNPs on the X and Y chromosome, which we find useful for verifying the sex of the mice.
To use this script, you’ll need to edit three lines near the top:
codefiledefines the path to the
MMnGM_version, if you’re using MegaMUGA or both arrays in your project)
ifilesdefines the path to your
FinalReport.txtfile. This can be a vector of such file paths, if your genotyping was performed in batches
ostemdefines the path and file “stem” for the output files. For example, if you use
ostem <- "Data/forqtl2", the output files will be placed in the
Datasubdirectory and will have names like
One potential issue to contend with, in reading in the genotype files from GeneSeek, is that these files may have varying headers at the top that you may need to skip over.
Another issue regards possible
recoding of the sample identifiers. For example, you may have one batch
where the samples are labeled like
DO-146 and another where they’re
146 and another where they’re labeled
Search for the following line, and do some reformatting of the sample
# Note: may need to revise the IDs in the second column
You’ll need to have the R/qtl2convert package installed. And note that, since this script reads in the full data into memory, you’ll need a computer with appreciable RAM.
The result of the
script will be a series of CSV files containing the re-coded
genotypes, with one file per chromosome, plus two files containing
the SNP array intensities for the X and Y chromosomes (one file per
allele per chromosome).
Preparing the phenotype and covariate data
You next need to prepare files with the phenotype data (with strictly numeric phenotypes) and covariate data (which can include non-numeric variables). Each should be arranged with samples as rows and variables as columns, and with the first column being the sample IDs, exactly as used in the genotype data.
The covariate data needs to contain a
sex column, plus a column
giving the generation number (e.g.,
ngen) for each DO animal.
For help in determining the generation numbers, see
lists the generation number for different dates of distribution of
mice from the Jackson Lab.
Preparing the control file
Finally, you need to prepare a control file which includes details of the file names and the various encodings and variable names.
write_control_file() to make the control
file. It’s a bit complicated, because there’s a lot of stuff to
specify, and some of the pieces are sort of confusing.
library(qtl2) chr <- c(1:19, "X") write_control_file("forqtl2.json", crosstype="do", description="My awesome DO project", founder_geno_file=paste0("GM/GM_foundergeno", chr, ".csv"), founder_geno_transposed=TRUE, gmap_file=paste0("GM/GM_gmap", chr, ".csv"), pmap_file=paste0("GM/GM_pmap", chr, ".csv"), geno_file=paste0("forqtl2_geno", chr, ".csv"), geno_transposed=TRUE, geno_codes=list(A=1, H=2, B=3), xchr="X", pheno_file="forqtl2_pheno.csv", covar_file="forqtl2_covar.csv", sex_covar="sex", sex_codes=list(F="Female", M="Male"), crossinfo_covar="ngen")
The first three arguments are the name of the file to be created (with
.yaml), the cross type (
"do" for DO
mice), and a description.
founder_geno_file gives the name of the file that contains the founder
genotypes, here a vector of file names since we have the genotypes
split by chromosome. Then
founder_geno_transposed=TRUE indicates that we
have the founder strains as columns and the markers as rows, rather
than the default orientation with strains as the rows.
pmap_file indicate the names of the
files that contain the genetic (
gmap) and physical (
positions, respectively. These are again each a vector of file names, since the
files are split by chromosome.
geno_file indicates the name of the file with the DO genotype data (again, a
vector of file names as they’re split by chromosome), and we use
geno_tranposed=TRUE because the
geneseek2qtl2.R script discussed above
writes the CSV files with individuals as columns and markers as rows.
geno_codes indicates the encodings for both the DO genotypes
and the founder genotypes. This may be a bit confusing, but we want to
make a list with the names being the codes used in the data and the
values being the integers 1, 2, and 3. So
xchr indicates the chromosome identifier for the X
covar_file arguments indicate the names of the
files with the phenotype and covariate data, respectively. We then
provide some details about specific covariates that are important: sex
sex_codes to explain the encodings, with
M being the codes used in the data and the values
"Male" indicating which sexes the codes correspond
to. And finally
crossinfo_covar indicates the name of the covariate
column that contains the cross information (
"ngen" for number of
generations, for cross type
It’s sometimes useful to create chromosome-specific control
files, for the case that you want to just load one chromosome worth of
data. You can do this by first defining
chr to be a single value,
chr <- "1". You can then use exactly the same
write_control_file() code, though perhaps changing the first line,
with the name of the file to be produced. For example:
library(qtl2) chr <- "1" write_control_file("forqtl2_chr1.json", crosstype="do", description="My awesome DO project", founder_geno_file=paste0("GM/GM_foundergeno", chr, ".csv"), founder_geno_transposed=TRUE, gmap_file=paste0("GM/GM_gmap", chr, ".csv"), pmap_file=paste0("GM/GM_pmap", chr, ".csv"), geno_file=paste0("forqtl2_geno", chr, ".csv"), geno_transposed=TRUE, geno_codes=list(A=1, H=2, B=3), xchr="X", pheno_file="forqtl2_pheno.csv", covar_file="forqtl2_covar.csv", sex_covar="sex", sex_codes=list(F="Female", M="Male"), crossinfo_covar="ngen")