Here we consider the data from Srivastava et al. (2017) on Collaborative Cross mouse data, and compare the use of an exact hidden Markov model (HMM) to a more approximate one. The data are available at zenodo, with additional supplementary information in Table S2 at the journal website.

The compiled files that we used are at GitHub.

Load packages and data

We first load R/qtl2 as well as the qtl2convert, qtl2fst, broman, data.table, readxl, and here packages.

library(qtl2)
library(qtl2convert)
library(broman)
library(here)
library(data.table)
library(readxl)

We create a cache directory to store stuff.

cachedir <- here("CCapp/_cache")
if(!dir.exists(cachedir)) dir.create(cachedir)

We download the data from the GitHub as a zip file with the data in R/qtl2 format.

url <- "https://raw.githubusercontent.com/rqtl/qtl2data/master/CC/cc.zip"
file <- file.path(cachedir, basename(url))
if(!file.exists(file)) download.file(url, file)

We load it with read_cross2().

cc <- read_cross2(file)

The data concern 69 Collaborative Cross mouse lines They were genotyped with the GigaMUGA array. There are a total of 110,054 informative markers.

We create a "genril8" version, to use the general (but approximate) HMM.

genril <- qtl2convert::cross2_ril_to_genril(cc)

Cross design information

The CC lines are 8-way recombinant inbred lines; each was formed by crossing eight founders in a specific “funnel” to bring the eight genomes together as rapidly as possible. The cross design turns out to not matter for the autosomes in eight-way RIL by sibling mating, but it is critical for the X chromosome, as only five of the eight alleles can contribute to the X chromosome. (In the cross [(A×B)×(C×D)]×[(E×F)×(G×H)], with the female parents listed first, only alleles A, B, C, E, and F should be present on the X chromosome.)

We download Table S2 from Srivastava et al. (2017), as well as the SupplementalData.zip file from zenodo, which contains CCStrains.csv.

url <- "http://www.genetics.org/highwire/filestream/438137/field_highwire_adjunct_files/10/TableS2.xlsx"
tabs2_file <- here("CCapp", basename(url))
if(!file.exists(tabs2_file)) download.file(url, tabs2_file)
url <- "https://zenodo.org/record/377036/files/SupplementalData.zip?download=1"
zipfile <- here("CCapp/SupplementalData.zip")
csvfile <- here("CCapp/CCStrains.csv")
if(!file.exists(csvfile)) {
    csvfile_in_zip <- "SupplmentalData/CCStrains.csv"
    if(!file.exists(zipfile)) download.file(url, zipfile)
    unzip(zipfile, csvfile_in_zip, exdir=dirname(csvfile), junkpaths=TRUE)
}
tabS2 <- as.data.frame( readxl::read_excel(tabs2_file) )
ccstr <- data.table::fread(csvfile, data.table=FALSE)

Table S2 provides funnel codes for 55 of the 69 CC lines. Both files also give information about the original of the mitochondria and X chromosome, and the total number of founders that contribute to each line. (13 lines have 6 or 7 founders: 8 have 6 founders and 5 have 7 founders.)

Key notes:

We used genotype probabilities provided by Srivastava et al. (2017) to construct cross funnels compatible with the inferred X chromosome genotypes as well as the mitochondria and Y chromosome where possible (see the qtl2data repository at GitHub).

Calculate genotype probabilities

We now turn to the genotype probability calculations. We insert pseudomarkers into the marker map so that no two markers/pseudomarkers are > 0.1 cM apart. We use interpolation to find the corresponding physical positions.

gmap <- insert_pseudomarkers(cc$gmap, step=0.1, stepwidth="max")
pmap <- interp_map(gmap, cc$gmap, cc$pmap)

Now we do the genome reconstruction, first with the more “exact” HMM. We assume 0.2% genotyping error and use the Carter-Falconer map function.

pr_cc <- calc_genoprob(cc, gmap, error_prob=0.002, map_function="c-f", cores=0)

We now do the same thing using the "genril8" cross type that gives the approximate HMM.

pr_genril <- calc_genoprob(genril, gmap, error_prob=0.002, map_function="c-f", cores=0)

Note that the calculation of the genotype probabilities using an 8-core linux laptop with 64 GB RAM, it took 28 sec with the "risib8" cross type and 17 sec with the "genril8" cross type, so the approximate method took about 62% as much time.

Observed differences

The differences in the genotype probabilities are negligible on the autosomes, but can be substantial on the X chromosome.

p_d <- sapply(names(pr_cc), function(chr) apply(abs(pr_cc[[chr]] - pr_genril[[chr]]), 1, max))
p_d_chr <- apply(p_d, 2, max)

The maximum difference on the autosomes was 0.0006. On the X chromosome, there are 7 lines with differences > 0.25, and 15 lines with differences > 0.1. These differences are largely cases where some founder strains are indistinguishable for a small region, and where the cross funnel for a CC line excludes one or more of such founders.

If we take the inferred allele to be that with the greatest probability, provided that it has probability > 0.95, we see no differences in the inferred alleles. That is, there are no cases where the more-exact model and the approximate model give high probability to different probabilities.

v_cc <- maxmarg(pr_cc, cores=0)
v_genril <- maxmarg(pr_genril, cores=0)

vv_cc <- do.call("cbind", v_cc)
vv_genril <- do.call("cbind", v_genril)

stopifnot( sum(vv_cc != vv_genril, na.rm=TRUE)==0 )

Also note that the with the approximate model (which puts no constraints on the allowed alleles on the X chromosome), there are no cases where the inferred allele differs from what would be expected under the assumed cross design.

obs_alleles <- t(apply(v_genril$X, 1, function(a) !is.na(match(1:8, a))))
exp_alleles <- t(apply(cc$cross_info, 1, function(a) {
                    result <- rep(FALSE, 8)
                    result[a[c(1,2,3,5,6)]] <- TRUE
                    result }))
stopifnot( !any(obs_alleles & !exp_alleles) )

There are some important X chromosome differences in the genotype probabilities by the two models, but they all look to be cases where two or more founder strains cannot be distinguished for some interval on the chromosome, and one or more are excluded by the cross design assumed in the more-exact model.

For example, consider the X chromosome genotype probabilities strain CC038/GeniUnc by the two methods:

par(mfrow=c(2,1), mar=c(4.1, 4.1, 2.1, 1.1), las=1)
str <- "CC038/GeniUnc"
plot_genoprob(pr_cc, pmap, str, "X", main="CC038 with more-exact model", yaxt="n",
              xlab="Chr X position (Mbp)")
axis(side=2, at=8:1, names(CCcolors))
plot_genoprob(pr_genril, pmap, str, "X", main="CC038 with approximate model", yaxt="n",
              xlab="Chr X position (Mbp)")
axis(side=2, at=8:1, names(CCcolors))

The results look identical except for a small segment at around 135 Mbp, which the more-exact model infers to be from NOD, while the approximate model splits it evenly between NOD and B6. Let’s look at the founder genotypes across that region.

# find the region (high prob for NOD and between 120-140 Mbp)
mar <- names(pmap$X)[pmap$X >= 120 & pmap$X <= 140 & pr_cc$X["CC038/GeniUnc","DD",] > 0.95]

# pull out the founder genotypes (no missing data)
fg <- cc$founder_geno$X[,mar]
stopifnot(all(!is.na(fg) & (fg == 1 | fg==3)) )
# number of differences in founder genotypes over the region
d <- matrix(0, nrow(fg), nrow(fg))
for(i in 1:(nrow(fg)-1)) for(j in (i+1):nrow(fg)) d[i,j] <- d[j,i] <- sum(fg[i,] != fg[j,])
nd <- length(mar)-d

In the region from 134.0 to 135.0 Mbp, these two founder strains are identical for 60 of 60 markers, while all of the other 6 strains differ from them at ≥ 23 markers (A/J and 129 are also identical over this region, but they differ from B6 and NOD at 26 markers.) The B6 founder is excluded based on the cross design for this strain (funnel code CEDHGABF).

Contrast these results with what we get if we use an incorrect cross edesign, for example assuming the cross where these two alleles are excluded, say that for CC020/GeniUnc, GFHDEACB.

# calc probability
cc038 <- cc["CC038/GeniUnc","X"]
cc038$cross_info[] <- as.numeric(cc["CC020/GeniUnc","X"]$cross_info[])
pr_cc038_wrong <- calc_genoprob(cc038, gmap["X"], error_prob=0.002, map_function="c-f")

# save for the paper
saveRDS(pr_cc038_wrong, file.path(cachedir, "probs_cc038_wrong.rds"))

# make the plot
par(mar=c(4.1, 4.1, 2.1, 1.1), las=1)
plot_genoprob(pr_cc038_wrong, pmap, chr="X", main="CC038 with wrong cross", yaxt="n",
              xlab="Chr X position (Mbp)")
axis(side=2, at=8:1, names(CCcolors))

# inferred crossovers
nxo_orig <- count_xo(v_cc)["CC038/GeniUnc", "X"]
nxo_wrong <- count_xo(maxmarg(pr_cc038_wrong))["CC038/GeniUnc", "X"]
nxo_genril <- count_xo(v_genril)["CC038/GeniUnc", "X"]

As you can see, the whole chromosome becomes a complete mess, as the cross design excludes not just B6 and NOD, but also 129. With this incorrect cross design, there is an inferred 39 crossovers on the X chromosome, versus 5 crossovers when the correct cross design is used. (The approximate model, which doesn’t exclude any alleles from the X chromosome, gives an inferred 3 crossovers, because it leaves that B6/NOD segment as un-inferred and our crude count of crossovers ignores the obvious double-crossover.)

Other differences

Are the other differences all like this? Let’s look at the 15 cases with probability differences > 0.1

On the left is a bivariate heatmap of the probabilities, with red indicating high probability with the more-exact model alone, blue indicating high probability with the approximation model alone, and purple indicating high probability by both models.

On the right are the differences in the probabilities.

main <- function(str="CC002/Unc") {
    ccinf <- cc$cross_info[str,]
    excl <- names(CCcolors)[sort(ccinf[c(4,7,8)])]
    incr <- names(CCcolors)[ccinf[3]]

    paste0(str, " (", incr, " high; ", paste(excl,collapse=","), " excluded)")
}

add_text <- function(pos, founder, pos_mid, val) {
    adj <- c(ifelse(pos > pos_mid, 1, 0), 0.5)
    xd <- ifelse(pos > pos_mid, -1, +1)*0.02*pos_rng
    text(pos + xd, val, names(CCcolors)[founder], adj=adj, col=CCcolors[founder])
    }

o <- order(p_d[,"X"], decreasing=TRUE)
par(mar=c(3.1,3.1,2.1,1.1), mfrow=c(1,2))
for(i in 1:sum(p_d[,"X"] > 0.1)) {
    str <- ind_ids(cc)[o[i]]

    plot_genoprobcomp(pr_cc, pr_genril, pmap, str, "X",
                      main=main(str), yaxt="n", xaxt="n", xlab="")
    axis(side=1, at=seq(20, 160, by=20), mgp=c(0, 0.3, 0), tick=FALSE)
    axis(side=2, at=8:1, labels=names(CCcolors), mgp=c(0, 0.2, 0), tick=FALSE, las=1)
    title(xlab="X chr position (Mbp)", mgp=c(1.8, 0, 0))

    d <- pr_cc[["X"]][o[i],,] - pr_genril[["X"]][o[i],,]
    grayplot(pmap$X, d[1,],
             ylim=c(-1, 1)*max(p_d[,"X"]), type="l", lwd=2, col=CCcolors[1],
             xlab="X chr position (Mbp)", ylab="difference in probability",
             main=str)
    for(j in 2:8) {
        lines(pmap$X, d[j,],
              lwd=2, col=CCcolors[j])
    }

    wh_max <- which(d==max(d), arr=TRUE)
    wh_min <- which(d==min(d), arr=TRUE)

    pos_max <- pmap$X[wh_max[2]]
    pos_min <- pmap$X[wh_min[2]]
    pos_mid <- median(range(pmap$X))
    pos_rng <- diff(range(pmap$X))

    add_text(pos_max, wh_max[1], pos_mid, d[wh_max[1], wh_max[2]])
    add_text(pos_min, wh_min[1], pos_mid, d[wh_min[1], wh_min[2]])
}

Summary

There are negligible differences on the autosomes.

The differences on the X chromosome are due to the use of the cross design, which can be helpful to exclude certain founder alleles, especially for regions where multiple founders are identical by descent. The other contributor to the differences concern the 2:1 prior for one of the founders (in the third slot of the cross funnel) versus the others.

Session info

Here are the details on the version of R and the versions of the R packages that I used.

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 21.04
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] readxl_1.3.1       data.table_1.14.0  here_1.0.1         broman_0.72-4      qtl2convert_0.25-1
## [6] qtl2_0.25-8       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        highr_0.9         pillar_1.6.2      bslib_0.2.5.1     compiler_4.1.0   
##  [6] cellranger_1.1.0  jquerylib_0.1.4   tools_4.1.0       digest_0.6.27     bit_4.0.4        
## [11] jsonlite_1.7.2    RSQLite_2.2.7     evaluate_0.14     memoise_2.0.0     tibble_3.1.3     
## [16] lifecycle_1.0.0   pkgconfig_2.0.3   rlang_0.4.11      DBI_1.1.1         yaml_2.2.1       
## [21] parallel_4.1.0    xfun_0.24         fastmap_1.1.0     stringr_1.4.0     knitr_1.33       
## [26] vctrs_0.3.8       sass_0.4.0        bit64_4.0.5       rprojroot_2.0.2   R6_2.5.0         
## [31] fansi_0.5.0       qtl_1.48-1        rmarkdown_2.9     blob_1.2.2        magrittr_2.0.1   
## [36] htmltools_0.5.1.1 ellipsis_0.3.2    assertthat_0.2.1  utf8_1.2.2        stringi_1.7.3    
## [41] cachem_1.0.5      crayon_1.4.1