R/qtl2geno includes a reimplementation of the basic hidden Markov model code for the calculation of conditional genotype probabilities (given the available multipoint marker genotype data) and the estimation of genetic maps.

The new HMM code can potentially handle more general cases and uses C++ classes to simplify the extension to new cross types. I’ve also split out the forward and backward equations as separate functions, to reduce code duplication. And the basic calculations are separated by individual, which may allow for low-level parallel processing (i.e., concurrency). (I’m considered using RcppParallel for this.)

The new code for calculating conditional genotype probabilities is about the same as that in R/qtl. The new code for estimating the genetic map, on the other hand, is considerably slower than that in R/qtl. The present document provides detailed benchmarks on this point.

Genotype probabilities

First let’s look at the calculation of conditional genotype probabilities (given multiple genetic marker genotypes). We’ll consider simulated data, first of a backcross with 500 individuals (250 males and 250 females) and with markers at a 1 cM spacing on 20 chromosomes (including the X chromosome), and 5% missing data. We use qtl::sim.map() and qtl:sim.cross() to simulate the data.

set.seed(8440561)
library(qtl)
data(map10)
chrL <- round(summary(map10)$length[1:20])
map <- sim.map(len=chrL, n.mar=chrL+1, include.x=TRUE, eq.spacing=TRUE)
n <- 500
bc <- sim.cross(map, n.ind=n, type="bc", missing.prob=0.05)
bc$pheno$sex <- rep(0:1, each=n/2)

There’s a total of 1685 markers.

We use qtl2geno::convert2cross2() to convert the data to the format for R/qtl2. Also, with R/qtl2, I need to use insert_pseudomarkers to create a map with pseudomarkers inserted, prior to running calc_genoprob.

library(qtl2geno)
bc2 <- convert2cross2(bc)
bc2_pmarkmap <- qtl2geno::insert_pseudomarkers(bc2$gmap, step=0.5)

We use the microbenchmark package and run each of qtl::calc.genoprob() and qtl2geno::calc_genoprob2() 25 times, with calculations at a 0.5 cM spacing. Even though the data were simulated without genotyping errors, I’m going to do the calculations assuming a 0.2% genotype error rate.

library(microbenchmark)
probbc <- microbenchmark(qtl =  qtl::calc.genoprob(bc, step=0.5, error.prob=0.002),
                         qtl2 = qtl2geno::calc_genoprob(bc2, bc2_pmarkmap, error_prob=0.002),
                         times=25)
print(probbc, unit="s")
## Unit: seconds
##  expr       min       lq      mean    median        uq      max neval cld
##   qtl 1.0420229 1.049512 1.0897630 1.0588333 1.0749538 1.583892    25   b
##  qtl2 0.8447251 0.860328 0.8879964 0.8666601 0.8758329 1.076109    25  a

Let’s do the same for an intercross, also including the X chromosome, and with crosses in both directions and with both sexes. Again, with R/qtl2, I need to use insert_pseudomarkers to create a map with pseudomarkers inserted, prior to running calc_genoprob.

f2 <- sim.cross(map, n.ind=n, type="f2", missing.prob=0.05)
f2$pheno$sex <- rep(0:1, each=n/2)
f2$pheno$pgm <- rep(0:1, n/2)
f2_2 <- convert2cross2(f2)
f2_2_pmarkmap <- qtl2geno::insert_pseudomarkers(f2_2$gmap, step=0.5)

Now for the benchmarks; we’ll do the calculations just 10 times.

probf2 <- microbenchmark(qtl =  qtl::calc.genoprob(f2, step=0.5, error.prob=0.002),
                         qtl2 = qtl2geno::calc_genoprob(f2_2, f2_2_pmarkmap, error_prob=0.002),
                         times=10)
print(probf2, unit="s")
## Unit: seconds
##  expr      min       lq     mean   median       uq      max neval cld
##   qtl 2.506977 2.638704 2.644074 2.643813 2.650559 2.781655    10   b
##  qtl2 1.896741 1.906769 1.925714 1.911452 1.918307 2.036555    10  a

Genetic map estimation

Now let’s consider the estimation of inter-marker distances in the genetic map, with the functions qtl::est.map and qtl2geno::est_map.

First the backcross; to speed things up, we’ll look at just chromosomes 1 and X.

bc <- bc[c(1, "X"),]
bc2 <- convert2cross2(bc)

Now for the benchmarks. We’ll just do 5 replicates.

mapbc <- microbenchmark(qtl =  qtl::est.map(bc, error.prob=0.002, tol=1e-6, maxit=10000),
                        qtl2 = qtl2geno::est_map(bc2, error_prob=0.002, tol=1e-6, maxit=10000),
                        times=5)
print(mapbc, unit="s")
## Unit: seconds
##  expr      min       lq     mean   median       uq      max neval cld
##   qtl 1.505558 1.512556 1.594503 1.514041 1.517071 1.923289     5  a
##  qtl2 1.832302 1.835156 1.869817 1.848742 1.906856 1.926029     5   b

Now to the intercross. Again, we’ll look at just chromosomes 1 and X, and we’ll perform just 5 replicates.

f2 <- f2[c(1, "X"),]
f2_2 <- convert2cross2(f2)
mapf2 <- microbenchmark(qtl =  qtl::est.map(f2, error.prob=0.002, tol=1e-6, maxit=10000),
                        qtl2 = qtl2geno::est_map(f2_2, error_prob=0.002, tol=1e-6, maxit=10000),
                        times=5)
print(mapf2, unit="s")
## Unit: seconds
##  expr      min       lq     mean   median       uq      max neval cld
##   qtl 4.209004 4.216213 4.220859 4.218208 4.227645 4.233225     5   b
##  qtl2 3.468959 3.478439 3.512348 3.478599 3.481871 3.653874     5  a

For the backcross, the new code takes 1.2 times longer; for the intercross, the new code takes 0.8 times longer.

Why the differences in speed?

I made a number of changes to the HMM code between R/qtl and R/qtl2.

  • Split out the forward/backward equations as separate functions (to avoid code duplication and to make the code potentially clearer).
  • Calculations are by individual, with an eye towards low-level parallel processing
  • Things are a bit more general. For example, the transition probabilities in the Markov chain can be completely different for each individual, depending on the cross_info data. Also, the estimation of recombination fractions is potentially cross-specific; this allows us to handle RIL with the same basic code as a backcross.
  • Using R-style matrix objects (via Rcpp) rather than low-level C-type arrays.

We could just rely on parallel processing to speed things up. (I’ve implemented parallel calculations at the level of chromosomes, using the parallel package, but we could make more efficient use of available processors by splitting things out at a lower level.) Other alternatives are to have separate implementations for different groups of crosses, or to identify the set of observed cross_info values and pre-calculate emission and transition probabilities for each.

It’s also possible that some basic aspects of my code could be improved, even keeping the overall design fixed. I’m new to C++ and not entirely sure that I’m doing things in the best way.

Session information

The following shows the R and package versions I was using.

devtools::session_info()
## Session info --------------------------------------------------------------------------------------
##  setting  value
##  version  R version 3.4.1 (2017-06-30)
##  system   x86_64, darwin15.6.0
##  ui       X11
##  language en_US.UTF-8
##  collate  en_US.UTF-8
##  tz       America/Chicago
##  date     2017-09-15
## Packages ------------------------------------------------------------------------------------------
##  package        * version date       source
##  assertthat       0.2.0   2017-04-11 CRAN (R 3.4.0)
##  backports        1.1.0   2017-05-22 CRAN (R 3.4.0)
##  base           * 3.4.1   2017-07-07 local
##  broman         * 0.65-4  2017-05-28 local
##  codetools        0.2-15  2016-10-05 CRAN (R 3.4.0)
##  colorspace       1.3-2   2016-12-14 CRAN (R 3.4.0)
##  compiler         3.4.1   2017-07-07 local
##  data.table       1.10.4  2017-02-01 CRAN (R 3.4.0)
##  datasets       * 3.4.1   2017-07-07 local
##  devtools         1.13.3  2017-08-02 CRAN (R 3.4.1)
##  digest           0.6.12  2017-01-27 CRAN (R 3.4.0)
##  evaluate         0.10.1  2017-06-24 CRAN (R 3.4.0)
##  ggplot2          2.2.1   2016-12-30 CRAN (R 3.4.0)
##  graphics       * 3.4.1   2017-07-07 local
##  grDevices      * 3.4.1   2017-07-07 local
##  grid             3.4.1   2017-07-07 local
##  gtable           0.2.0   2016-02-26 CRAN (R 3.4.0)
##  htmltools        0.3.6   2017-04-28 CRAN (R 3.4.0)
##  knitr            1.17    2017-08-10 CRAN (R 3.4.1)
##  lattice          0.20-35 2017-03-25 CRAN (R 3.4.0)
##  lazyeval         0.2.0   2016-06-12 CRAN (R 3.4.0)
##  magrittr       * 1.5     2014-11-22 CRAN (R 3.4.0)
##  MASS             7.3-47  2017-02-26 CRAN (R 3.4.0)
##  Matrix           1.2-11  2017-08-16 CRAN (R 3.4.1)
##  memoise          1.1.0   2017-04-21 CRAN (R 3.4.0)
##  methods        * 3.4.1   2017-07-07 local
##  microbenchmark * 1.4-2.1 2015-11-25 CRAN (R 3.4.0)
##  multcomp         1.4-6   2016-07-14 CRAN (R 3.4.0)
##  munsell          0.4.3   2016-02-13 CRAN (R 3.4.0)
##  mvtnorm          1.0-6   2017-03-02 CRAN (R 3.4.0)
##  parallel         3.4.1   2017-07-07 local
##  plyr             1.8.4   2016-06-08 CRAN (R 3.4.0)
##  qtl            * 1.42-2  2017-09-13 local
##  qtl2geno       * 0.5-32  2017-09-15 local
##  Rcpp             0.12.12 2017-07-15 CRAN (R 3.4.1)
##  rlang            0.1.2   2017-08-09 CRAN (R 3.4.1)
##  rmarkdown        1.6     2017-06-15 CRAN (R 3.4.0)
##  rprojroot        1.2     2017-01-16 CRAN (R 3.4.0)
##  sandwich         2.4-0   2017-07-26 CRAN (R 3.4.1)
##  scales           0.5.0   2017-08-24 CRAN (R 3.4.1)
##  splines          3.4.1   2017-07-07 local
##  stats          * 3.4.1   2017-07-07 local
##  stringi          1.1.5   2017-04-07 CRAN (R 3.4.0)
##  stringr          1.2.0   2017-02-18 CRAN (R 3.4.0)
##  survival         2.41-3  2017-04-04 CRAN (R 3.4.0)
##  TH.data          1.0-8   2017-01-23 CRAN (R 3.4.0)
##  tibble           1.3.4   2017-08-22 CRAN (R 3.4.1)
##  tools            3.4.1   2017-07-07 local
##  utils          * 3.4.1   2017-07-07 local
##  withr            2.0.0   2017-07-28 CRAN (R 3.4.1)
##  yaml             2.1.14  2016-11-12 CRAN (R 3.4.0)
##  zoo              1.8-0   2017-04-12 CRAN (R 3.4.0)