My goal with R/lmmlite is a light implementation of the fit of linear mixed models for QTL analysis, to further my understanding of how to do this.
I benefited greatly from Artem Tarasov’s series of blog posts that work through the details of FaST-LMM and pylmm. And I’m closely following Nick Furlotte’s pylmm code, which is remarkably clear and easy to follow.
The sort of model I want to fit is \(y = X\beta + \epsilon\) where \(\epsilon\) is multivariate normal with mean 0 and variance \(\sigma^2_g K + \sigma^2_e I\), where \(K\) is a known kinship matrix and \(I\) is the identity. In pylmm, the focus is on \(\sigma^2 = \sigma^2_g + \sigma^2_e\) and the heritability \(h^2 = \sigma^2_g / \sigma^2\).
To make sure that my code is correct, here I’ll compare the lmmlite results to those of the R package regress (which, as we’ll see, gives rather different results) and pylmm (and fortunately pylmm and lmmlite results do generally correspond).
The source for this vignette is at GitHub.
I’ll focus on the recla
dataset that I include with the package. This is data on a set of diversity outcross mice, from Recla et al. (2014) and Logan et al. (2013). The data are originally from the QTL Archive/Mouse Phenotype Database and I’ve also placed them online in the format used by R/qtl2. I used R/qtl2 to calculate genotype probabilities from the MUGA SNP array data. The data include 261 mice, and form a list with three components: the estimated kinship matrix, a matrix with 26 phenotypes, and a covariate matrix that contains an intercept (all 1’s) and sex (0 for female and 1 for male).
library(lmmlite)
data(recla)
Let’s first run the lmmlite code and fit the LMM for each phenotype, by REML and maximum likelihood (ML). Currently, lmmlite doesn’t deal with missing data at all, so we need to do a bit of extra work: for each phenotype, we drop any individuals with missing phenotypes.
I’ll first create an object to contain the results.
n_phe <- ncol(recla$pheno)
lmmlite <- data.frame(index=rep(NA,2*n_phe),
method=rep("", 2*n_phe),
hsq=rep(NA,2*n_phe),
intercept=rep(NA,2*n_phe),
sex=rep(NA,2*n_phe),
sigmasq=rep(NA,2*n_phe),
loglik=rep(NA,2*n_phe), stringsAsFactors=FALSE)
Now the actual fitting. The first batch of code omits individuals with missing phenotype. I then call eigen_rotation()
which does an eigen decomposition of the kinship matrix and “rotates” the phenotypes and covariates by the transpose of the matrix of eigenvectors. This turns what would be a general least squares problem into a simpler weighted least squares problem. Then I fit the model, first by REML and then by ML, and grab the key results.
all_e <- vector("list", n_phe)
for(i in 1:n_phe) {
keep <- !is.na(recla$pheno[,i])
y <- recla$pheno[keep,i,drop=FALSE]
x <- recla$covar[keep,]
k <- recla$kinship[keep,keep]
all_e[[i]] <- e <- eigen_rotation(k, y, x)
res <- fitLMM(e$Kva, e$y, e$X, reml=TRUE)
lmmlite[i*2-1,1] <- i
lmmlite[i*2-1,2] <- "reml"
lmmlite[i*2-1,-(1:2)] <- c(res$hsq, res$beta, res$sigmasq, res$loglik)
res <- fitLMM(e$Kva, e$y, e$X, reml=FALSE)
lmmlite[i*2,1] <- i
lmmlite[i*2,2] <- "ml"
lmmlite[i*2,-(1:2)] <- c(res$hsq, res$beta, res$sigmasq, res$loglik)
}
I’ll now run pylmm in the same way. First, I need to save the data to a set of CSV files.
for(i in c("kinship", "pheno", "covar"))
write.table(recla[[i]], paste0(i, ".csv"), sep=",", quote=FALSE,
row.names=FALSE, col.names=FALSE)
Now I can run pylmm, writing the output to "pylmm_results.csv"
.
# this chunk is python code
import lmm
import numpy
# load the datasets
kinship = numpy.genfromtxt('kinship.csv', delimiter=',')
pheno = numpy.genfromtxt('pheno.csv', delimiter=',')
covar = numpy.genfromtxt('covar.csv', delimiter=',')
# open file for writing
f = open('pylmm_results.csv', 'w')
# function to print results
def print_result(result, index, method):
result_arr = [result[0], result[1][0,0], result[1][1,0], result[2][0,0], result[3]]
for j in range(len(result_arr)):
result_arr[j] = str(result_arr[j])
f.write(','.join([str(index), method] + result_arr) + '\n')
# print header row
f.write('index,method,hsq,intercept,sex,sigmasq,loglik\n')
# loop over the columns in pheno
# fit the LMM by REML or ML and print the results to STDIN
for i in range(pheno.shape[1]):
result = lmm.LMM(pheno[:,[i]], kinship, X0=covar).fit(REML=True)
print_result(result, i+1, 'reml')
result = lmm.LMM(pheno[:,[i]], kinship, X0=covar).fit(REML=False)
print_result(result, i+1, 'ml')
Now I’ll read the data (into R).
pylmm <- read.csv("pylmm_results.csv", as.is=TRUE)
The log likelihood values provided by lmmlite are shifted relative to the pylmm ones, because I dropped a few terms that depend only on the sample size.
For both REML and ML, the difference should be \(n[\log(2 \pi) + 1 - \log n]/2\). Let’s make the adjustment and plot the differences. (I’m going to suppress all of the code to make plots; you can see it in the source for this vignette.)
The biggest difference in log likelihood, after the adjustment, is 0.08. For the larger discrepancies, lmmlite is giving a larger likelihood value.
The points come in pairs (blue for REML, and then pink for ML for the same phenotype). We can see that there are 8 phenotypes with non-negligible differences, and in each case the REML difference is larger than the ML one, and lmmlite is giving larger likelihood values than pylmm.
For completeness, let’s look at the log likelihoods from lmmlite calculated at the heritability estimates from pylmm. I use the function calcLL
, which takes a specific value (or vector of values) for the heritability.
rev_loglik <- rep(NA, 2*n_phe)
for(i in 1:n_phe) {
e <- all_e[[i]]
rev_loglik[i*2-1] <- calcLL(pylmm$hsq[i*2-1], e$Kva, e$y, e$X, reml=TRUE)
rev_loglik[i*2] <- calcLL(pylmm$hsq[i*2], e$Kva, e$y, e$X, reml=FALSE)
}
Actually the results basically don’t change. It’s not even worth making a plot. The largest absolute change in log likelihood was 0.08, while the differences between lmmlite and pylmm log likelihoods are still as large as 5.6.
Let’s turn to the parameter estimates. First, the heritability. Here the results are quite similar (maximum absolute difference 0.01).
There seem to be very few differences except at the high end, but actually there are 18 points there, where the two heritability estimates differ by about 0.01. These are all cases where pylmm gives estimated heritability exactly 0.99. In 17 cases, lmmlite gives an estimate that is very close to 1; in the other case, lmmlite gives an estimate around 0.98. So I think this just has to do with the optimization near the boundary — that pylmm is not seeking to optimize the likelihood further when the initial heritability estimate is 0.99.
Here’s a plot of the differences by index, and then repeating the plot of the differences in log likelihoods. You can see that these are the same set of 8 phenotypes when these small differences in estimated heritability and also differences in log likelihood, plus another phenotype with a small difference in estimated heritability but a negligible difference in log likelihood.
The other parameters show similar largely negligible differences. Here’s a plot of the differences of the intercept and sex coefficients, against the estimate from pylmm.
Finally, let’s look at the estimates of \(\sigma^2 = \sigma^2_g + \sigma^2_e\). The phenotypes have huge differences in scale, so we’ll look at the differences as a percent of the estimate from pylmm, and we’ll put the x-axis on the log scale.
The biggest differences is 0.4%. And the larger differences are for exactly the phenotypes we’d expect. Here’s the percent difference in \(\hat{\sigma}^2\) by index, with the differences in estimated heritabilities for reference.
Originally, I bigger differences by the two programs in the estimates of \(\sigma^2\) with ML, but that was because I was using \(1/n\) while pylmm uses \(1/(n-p)\) where \(p\) is the number of covariates. I’ve revised the code to match pylmm.
It seems that pylmm and lmmlite are giving the same results, with the only difference being the optimization of the heritability. To check this, I’ll calculate the log likelihood for \(h^2\) at a fine grid.
First, the calculations with lmmlite. I’ll include the log likelihood adjusted to match pylmm in the results.
hsq_v <- seq(0, 1, by=0.01)
lmmlite_ll <- NULL
for(i in 1:n_phe) {
e <- all_e[[i]]
n <- length(e$Kva)
res <- calcLL(hsq_v, e$Kva, e$y, e$X, reml=TRUE)
lmmlite_ll <- rbind(lmmlite_ll,
data.frame(index=i, method="reml",
hsq=hsq_v,
loglik=res,
loglik_adj=res - n*(log(2*pi) + 1 - log(n))/2,
stringsAsFactors=FALSE))
res <- calcLL(hsq_v, e$Kva, e$y, e$X, reml=FALSE)
lmmlite_ll <- rbind(lmmlite_ll,
data.frame(index=i, method="ml",
hsq=hsq_v,
loglik=res,
loglik_adj=res - n*(log(2*pi) + 1 - log(n))/2,
stringsAsFactors=FALSE))
}
Now I’ll do the same calculations with pylmm:
# this chunk is python code
import lmm
import numpy
# load the datasets
kinship = numpy.genfromtxt('kinship.csv', delimiter=',')
pheno = numpy.genfromtxt('pheno.csv', delimiter=',')
covar = numpy.genfromtxt('covar.csv', delimiter=',')
# open file for writing
f = open('pylmm_llvals.csv', 'w')
# values 0, 0.01, 0.02, ..., 1.0
n_val = 101
hsq_vec = [x/(n_val-1) for x in range(n_val)]
# write header
f.write('index,method,hsq,loglik\n')
for i in range(pheno.shape[1]):
obj = lmm.LMM(pheno[:,[i]], kinship, X0=covar)
ll = [obj.LL(hsq, REML=True)[0] for hsq in hsq_vec]
for j in range(n_val):
f.write(",".join([str(i+1), "reml", str(hsq_vec[j]), str(ll[j])]) + '\n')
ll = [obj.LL(hsq, REML=False)[0] for hsq in hsq_vec]
for j in range(n_val):
f.write(",".join([str(i+1), "ml", str(hsq_vec[j]), str(ll[j])]) + '\n')
Now read those results into R:
pylmm_ll <- read.csv("pylmm_llvals.csv")
The results are equivalent; the maximum difference in log likelihoods is 10-8.
Here are plots of a few of them, with lmmlite in blue and pylmm in red (dashed):
Conclusion: lmmlite and pylmm are giving the same results, except for small differences when the estimated heritability is near 1 and pylmm gives an estimate of exactly 0.99. This difference appears to be due to pylmm’s scheme for the one-dimensional optimization of the heritability parameter.
The R package regress is another option for fitting this sort of model, and it’s super easy to use and so deserving comparison. The package has a single function exposed to the user, regress()
. It takes two formulas, one for the phenotype mean, and the second for the variance. In the second formula, the inclusion of the identity matrix is assumed. (An advantage of the regress package is that you can use multiple kinship-type matrices.)
The default is to use REML; to use ML use give kernel=0
. At least that’s what the documentation seems to say. The argument pos
can be used to constrain the variance components to be positive, which is what we want. (Thanks to Petr Simecek for pointing this out; initially I was doing this unconstrained and having a lot of problems, including errors.)
library(regress)
regr <- data.frame(index=rep(NA,2*n_phe),
method=rep("", 2*n_phe),
hsq=rep(NA,2*n_phe),
intercept=rep(NA,2*n_phe),
sex=rep(NA,2*n_phe),
sigmasq=rep(NA,2*n_phe),
loglik=rep(NA,2*n_phe), stringsAsFactors=FALSE)
recla$pheno <- scale(recla$pheno)
for(i in (1:ncol(recla$pheno))) {
out <- regress(recla$pheno[,i] ~ recla$covar, ~ recla$kinship,
pos=c(TRUE, TRUE), tol=1e-8)
regr[i*2-1,1] <- i
regr[i*2-1,2] <- "reml"
sigmasq <- sum(out$sigma)
regr[i*2-1,-(1:2)] <- c(out$sigma[1]/sigmasq, out$beta[,1], sigmasq, out$llik)
out <- regress(recla$pheno[,i] ~ recla$covar, ~ recla$kinship, kernel=0,
pos=c(TRUE, TRUE), tol=1e-8)
regr[i*2,1] <- i
regr[i*2,2] <- "ml"
sigmasq <- sum(out$sigma)
regr[i*2,-(1:2)] <- c(out$sigma[1]/sigmasq, out$beta[,1], sigmasq, out$llik)
}
I had a lot of problems with errors and found that it helped to standardize the phenotypes to have mean 0 and SD 1. But even after that, one of the phenotypes gives an error with regress()
so I just skipped it.
Because I’ve scaled the phenotypes, I need to re-run lmmlite. Fortunately it’s a lot faster.
all_e <- vector("list", n_phe)
for(i in 1:ncol(recla$pheno)) {
keep <- !is.na(recla$pheno[,i])
y <- recla$pheno[keep,i,drop=FALSE]
x <- recla$covar[keep,]
k <- recla$kinship[keep,keep]
all_e[[i]] <- e <- eigen_rotation(k, y, x)
res <- fitLMM(e$Kva, e$y, e$X, reml=TRUE)
lmmlite[i*2-1,1] <- i
lmmlite[i*2-1,2] <- "reml"
lmmlite[i*2-1,-(1:2)] <- c(res$hsq, res$beta, res$sigmasq, res$loglik)
res <- fitLMM(e$Kva, e$y, e$X, reml=FALSE)
lmmlite[i*2,1] <- i
lmmlite[i*2,2] <- "ml"
lmmlite[i*2,-(1:2)] <- c(res$hsq, res$beta, res$sigmasq, res$loglik)
}
The results are not exactly the same, but they are pretty close. Here are some plots.
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.2.2 (2015-08-14)
## system x86_64, darwin13.4.0
## ui X11
## language en_US.UTF-8
## collate en_US.UTF-8
## tz America/Chicago
## date 2015-11-24
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.2.0)
## broman * 0.61-3 2015-11-22 local
## devtools 1.9.1 2015-09-11 CRAN (R 3.2.0)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
## formatR 1.2.1 2015-09-18 CRAN (R 3.2.0)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
## knitr 1.11.16 2015-11-16 Github (yihui/knitr@6e8ce0c)
## lmmlite * 0.1-9 2015-11-24 local
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
## Rcpp 0.12.2 2015-11-15 CRAN (R 3.2.2)
## rmarkdown 0.8.1 2015-10-10 CRAN (R 3.2.2)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)