R/qtl2 includes a reimplementation of the basic hidden Markov model code for the calculation of conditional genotype probabilities (given the available multipoint marker genotype data) and the estimation of genetic maps.
The new HMM code can potentially handle more general cases and uses C++ classes to simplify the extension to new cross types. I’ve also split out the forward and backward equations as separate functions, to reduce code duplication. And the basic calculations are separated by individual, which may allow for low-level parallel processing (i.e., concurrency). (I’m considered using RcppParallel for this.)
The new code for calculating conditional genotype probabilities is about the same as that in R/qtl. The new code for estimating the genetic map, on the other hand, is considerably slower than that in R/qtl. The present document provides detailed benchmarks on this point.
First let’s look at the calculation of conditional genotype probabilities (given multiple genetic marker genotypes). We’ll consider simulated data, first of a backcross with 500 individuals (250 males and 250 females) and with markers at a 1 cM spacing on 20 chromosomes (including the X chromosome), and 5% missing data. We use qtl::sim.map()
and qtl:sim.cross()
to simulate the data.
set.seed(8440561)
library(qtl)
data(map10)
chrL <- round(summary(map10)$length[1:20])
map <- sim.map(len=chrL, n.mar=chrL+1, include.x=TRUE, eq.spacing=TRUE)
n <- 500
bc <- sim.cross(map, n.ind=n, type="bc", missing.prob=0.05)
bc$pheno$sex <- rep(0:1, each=n/2)
There’s a total of 1685 markers.
We use qtl2::convert2cross2()
to convert the data to the format for R/qtl2. Also, with R/qtl2, I need to use insert_pseudomarkers
to create a map with pseudomarkers inserted, prior to running calc_genoprob
.
library(qtl2)
bc2 <- convert2cross2(bc)
bc2_pmarkmap <- qtl2::insert_pseudomarkers(bc2$gmap, step=0.5)
We use the microbenchmark package and run each of qtl::calc.genoprob()
and qtl2::calc_genoprob2()
25 times, with calculations at a 0.5 cM spacing. Even though the data were simulated without genotyping errors, I’m going to do the calculations assuming a 0.2% genotype error rate.
library(microbenchmark)
probbc <- microbenchmark(qtl = qtl::calc.genoprob(bc, step=0.5, error.prob=0.002),
qtl2 = qtl2::calc_genoprob(bc2, bc2_pmarkmap, error_prob=0.002),
times=25)
print(probbc, unit="s")
## Unit: seconds
## expr min lq mean median uq max neval cld
## qtl 1.1485780 1.1534480 1.1867754 1.1569168 1.1590954 1.721926 25 b
## qtl2 0.9386523 0.9415078 0.9612191 0.9451787 0.9483334 1.164079 25 a
Let’s do the same for an intercross, also including the X chromosome, and with crosses in both directions and with both sexes. Again, with R/qtl2, I need to use insert_pseudomarkers
to create a map with pseudomarkers inserted, prior to running calc_genoprob
.
f2 <- sim.cross(map, n.ind=n, type="f2", missing.prob=0.05)
f2$pheno$sex <- rep(0:1, each=n/2)
f2$pheno$pgm <- rep(0:1, n/2)
f2_2 <- convert2cross2(f2)
f2_2_pmarkmap <- qtl2::insert_pseudomarkers(f2_2$gmap, step=0.5)
Now for the benchmarks; we’ll do the calculations just 10 times.
probf2 <- microbenchmark(qtl = qtl::calc.genoprob(f2, step=0.5, error.prob=0.002),
qtl2 = qtl2::calc_genoprob(f2_2, f2_2_pmarkmap, error_prob=0.002),
times=10)
print(probf2, unit="s")
## Unit: seconds
## expr min lq mean median uq max neval cld
## qtl 2.630252 2.634652 2.772639 2.809436 2.881451 2.926254 10 b
## qtl2 1.907427 1.920029 1.999916 1.928127 2.099477 2.218559 10 a
Now let’s consider the estimation of inter-marker distances in the genetic map, with the functions qtl::est.map
and qtl2::est_map
.
First the backcross; to speed things up, we’ll look at just chromosomes 1 and X.
bc <- bc[c(1, "X"),]
bc2 <- convert2cross2(bc)
Now for the benchmarks. We’ll just do 5 replicates.
mapbc <- microbenchmark(qtl = qtl::est.map(bc, error.prob=0.002, tol=1e-6, maxit=10000),
qtl2 = qtl2::est_map(bc2, error_prob=0.002, tol=1e-6, maxit=10000),
times=5)
print(mapbc, unit="s")
## Unit: seconds
## expr min lq mean median uq max neval cld
## qtl 1.492345 1.507568 1.557910 1.508575 1.511898 1.769164 5 a
## qtl2 1.889971 1.900069 1.921923 1.916461 1.944371 1.958744 5 b
Now to the intercross. Again, we’ll look at just chromosomes 1 and X, and we’ll perform just 5 replicates.
f2 <- f2[c(1, "X"),]
f2_2 <- convert2cross2(f2)
mapf2 <- microbenchmark(qtl = qtl::est.map(f2, error.prob=0.002, tol=1e-6, maxit=10000),
qtl2 = qtl2::est_map(f2_2, error_prob=0.002, tol=1e-6, maxit=10000),
times=5)
print(mapf2, unit="s")
## Unit: seconds
## expr min lq mean median uq max neval cld
## qtl 4.199172 4.200006 4.201417 4.200639 4.202758 4.204511 5 b
## qtl2 3.434648 3.507602 3.576080 3.578263 3.671013 3.688875 5 a
For the backcross, the new code takes 1.3 times longer; for the intercross, the new code takes 0.9 times longer.
I made a number of changes to the HMM code between R/qtl and R/qtl2.
cross_info
data. Also, the estimation of recombination fractions is potentially cross-specific; this allows us to handle RIL with the same basic code as a backcross.We could just rely on parallel processing to speed things up. (I’ve implemented parallel calculations at the level of chromosomes, using the parallel package, but we could make more efficient use of available processors by splitting things out at a lower level.) Other alternatives are to have separate implementations for different groups of crosses, or to identify the set of observed cross_info
values and pre-calculate emission and transition probabilities for each.
It’s also possible that some basic aspects of my code could be improved, even keeping the overall design fixed. I’m new to C++ and not entirely sure that I’m doing things in the best way.
The following shows the R and package versions I was using.
devtools::session_info()
## Session info --------------------------------------------------------------------------------------
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language en_US.UTF-8
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-01-05
## Packages ------------------------------------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## bit 1.1-12 2014-04-09 CRAN (R 3.4.0)
## bit64 0.9-7 2017-05-08 CRAN (R 3.4.0)
## blob 1.1.0 2017-06-17 CRAN (R 3.4.0)
## broman * 0.67-4 2017-12-09 local
## codetools 0.2-15 2016-10-05 CRAN (R 3.4.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.3 2017-12-07 local
## data.table 1.10.4-3 2017-10-27 CRAN (R 3.4.2)
## datasets * 3.4.3 2017-12-07 local
## DBI 0.7 2017-06-18 CRAN (R 3.4.0)
## devtools 1.13.4 2017-11-09 CRAN (R 3.4.2)
## digest 0.6.13 2017-12-14 CRAN (R 3.4.3)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.0)
## ggplot2 2.2.1 2016-12-30 CRAN (R 3.4.0)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## grid 3.4.3 2017-12-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## knitr 1.18 2017-12-27 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.0)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## magrittr * 1.5 2014-11-22 CRAN (R 3.4.0)
## MASS 7.3-48 2017-12-25 CRAN (R 3.4.3)
## Matrix 1.2-12 2017-11-15 CRAN (R 3.4.2)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## microbenchmark * 1.4-2.1 2015-11-25 CRAN (R 3.4.0)
## multcomp 1.4-8 2017-11-08 CRAN (R 3.4.2)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## mvtnorm 1.0-6 2017-03-02 CRAN (R 3.4.0)
## parallel 3.4.3 2017-12-07 local
## pillar 1.0.1 2017-11-27 CRAN (R 3.4.3)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## qtl * 1.42-2 2017-09-13 local
## qtl2 * 0.7-7 2018-01-05 local
## Rcpp 0.12.14 2017-11-23 CRAN (R 3.4.3)
## rlang 0.1.6 2017-12-21 CRAN (R 3.4.3)
## rmarkdown 1.8 2017-11-17 CRAN (R 3.4.2)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## RSQLite 2.0 2017-06-19 CRAN (R 3.4.0)
## sandwich 2.4-0 2017-07-26 CRAN (R 3.4.1)
## scales 0.5.0 2017-08-24 CRAN (R 3.4.1)
## splines 3.4.3 2017-12-07 local
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
## stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
## survival 2.41-3 2017-04-04 CRAN (R 3.4.0)
## TH.data 1.0-8 2017-01-23 CRAN (R 3.4.0)
## tibble 1.4.1 2017-12-25 CRAN (R 3.4.3)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.1 2017-12-19 CRAN (R 3.4.3)
## yaml 2.1.16 2017-12-12 CRAN (R 3.4.3)
## zoo 1.8-0 2017-04-12 CRAN (R 3.4.0)