Linear mixed models (LMMs) have become widely used for dealing with population structure in human GWAS, and they’re becoming increasing important for QTL mapping in model organisms, particularly for the analysis of advanced intercross lines (AIL), which often exhibit variation in the relationships among individuals.
In my efforts on R/qtl2, a reimplementation R/qtl to better handle high-dimensional data and more complex cross designs, it was clear that I’d need to figure out LMMs. But while papers explaining the fit of LMMs seem quite explicit and clear, I’d never quite turned the corner to actually seeing how I’d implement it. In both reading papers and studying code (e.g., lme4), I’d be going along fine and then get completely lost part-way through.
But I now finally understand LMMs, or at least a particular, simple LMM, and I’ve been able to write an implementation: the R package lmmlite.
It seemed worthwhile to write down some of the details.
The model I want to fit is y = X b + e, where var(e) = sK + tI, where K is a known kinship matrix and I is the identity matrix. Think of y as a vector of phenotypes and X as a matrix of covariates. Let v = s+t be the residual variance, and let h = s/(s+t) = s/v be the heritability.
First, a shout to Artem Tarasov, who wrote a series of blog posts walking through and explaining the source code for FaST-LMM and pylmm, and to Nick Furlotte, whose pylmm code is especially clear and easy-to-read. Only by reading their work did I come to understand these LMMs.
Back to the model fit:
-
For a fixed value of the heritability, h, we have var(e) = v[hK + (1-h)I] = vV where V is known. And so we end up with a general least squares problem, which we can fit in order to estimate b and v.
-
And actually, if you take the eigen decomposition of K, say K = UDU', it turns out that you can write hK + (1-h)I = hUDU' + (1-h)UU' = U[hD + (1-h)I]U'. That is, the eigenvectors of K are the same as the eigenvectors of hK + (1-h)I. And so if you pre-multiply y and X by U', you end up with a weighted least squares problem, which is way faster to fit than a general least squares problem.
-
Having fit the weighted least squares problem to estimate b and v, you can then calculate the corresponding log likelihood (or, better, the restricted log likelihood, if you want to do REML).
-
You’re then left with a one-dimensional optimization problem (optimizing the log likelihood over h), which you can solve by Brent’s method.
-
That’s it!
It seems quite obvious in retrospect. It’s a bit embarrassing that it’s taken me so long to come to this understanding.
In lmmlite, I implemented this algorithm (closely following the code in pylmm) twice: in plain R, and then in C++ (using RcppEigen, which is an interface to the Eigen C++ linear algebra library). The plain R code is a bit slower then pylmm; the C++ code is a bit faster. In the C++ code, almost all of the computation time is devoted to the eigen decomposition of the kinship matrix. Once that’s done, the rest is super quick.