R/lmmlite is an R package for fitting linear mixed models in the context of genome-wide association studies (GWAS) and quantitative trait locus (QTL) mapping. At present, it is not intended for “production” use. I wrote this to try to get a better understanding of how to fit these models, and I’ve not yet put any effort into checking arguments and omitting cases with missing values.
The code closely follows Nick Furlotte’s pylmm.
The sort of model I want to fit is \(y = X\beta + \epsilon\) where \(\epsilon\) is multivariate normal with mean 0 and variance \(\sigma^2_g K + \sigma^2_e I\), where \(K\) is a known kinship matrix and \(I\) is the identity. In pylmm, the focus is on \(\sigma^2 = \sigma^2_g + \sigma^2_e\) and the heritability \(h^2 = \sigma^2_g / \sigma^2\).
We’ll focus here on the recla dataset that I include
with the package. This is data on a set of diversity outcross mice, from
Recla et
al. (2014) and Logan et
al. (2013). The data are originally from the QTL Archive/Mouse
Phenotype Database and I’ve also placed
them online in the format used by R/qtl2. I used qtl2geno to calculate
genotype probabilities from the MUGA SNP array data. The data include
261 mice, and form a list with three components: the estimated kinship
matrix, a matrix with 26 phenotypes, and a covariate matrix that
contains an intercept (all 1’s) and sex (0 for female and 1 for
male).
The package includes two implementations of the fit of linear mixed models (LMMs): plain R, and C++ (using the Rcpp and RcppEigen packages, the latter being an interface to the Eigen C++ linear algebra library).
R/lmmlite includes just four functions:
eigen_rotation(), getMLsoln(),
calcLL(), and fitLMM(). In each case, use
use_cpp=FALSE to use the plain R implementation and
use_cpp=TRUE (the default) to use the C++
implementation.
The first function, eigen_rotation(), calculates the
eigen decomposition of the kinship matrix and “rotates” the phenotypes
and covariates by pre-multiplying them with the tranpose of the matrix
of eigenvectors. This rotation greatly simplifies later calculations by
turning a general least squares problem into a weighted least squares
problem.
Care must be taken with missing values in the phenotypes and covariates. In practice, one must work with batches of phenotypes that have a common missing data pattern, and first remove individuals with missing values. Here, we’ll focus on the first phenotype.
The output is a list containing four components: Kva is
the vector of eigenvalues of the kinship matrix, Kva_t is
the transposed matrix of eigenvectors, y is the rotated
phenotype matrix, and X is the rotated covariate
matrix.
The next function getMLsoln() will estimates the
coefficients (\(\beta\)) and residual
variance (\(\sigma^2 = \sigma^2_g +
\sigma^2_e\)) for a given value of the heritability (\(h^2 = \sigma^2_g / \sigma^2\)). It takes,
as input, a single value for the heritability, and then
Kva, y, and X, as returned from
eigen_rotation(). A further argument reml,
indicates whether REML will later be used, in which case the log
determinant of the matrix \(X' W
X\), where \(W\) is a diagonal
matrix of weights, is calculated.
The result is a list containing beta and
sigmasq, with the residual sum of squares (RSS) and log
det(\(X' W X\)) included as
attributes.
The function calcLL() calculates the log likelihood for
a given value of the heritability. (It can also take a vector of
heritabilities.) It takes the same arguments as
getMLsoln(), and in practice one would probably not call
getMLsoln() directly.
We can plot the results as follows.
If you call calcLL() with a single heritability value,
the results include the estimates of beta and
sigmasq as attributes.
## [1] 1413.98691 -28.17931
## [1] 276158.5
The final function, fitLMM(), optimizes the log
likelihood to estimate the heritability. The result is a list containing
beta, sigmasq, hsq,
sigmasq_g, sigmasq_e, and
loglik.
## $beta
## [1] 1417.98330 -23.24218
##
## $sigmasq
## [1] 295340.5
##
## $hsq
## [1] 0.7642841
##
## $sigmasq_g
## [1] 225724.1
##
## $sigmasq_e
## [1] 69616.46
##
## $loglik
## [1] -2332.84
The fitLMM() function includes an argument
check_boundary; if TRUE (the default), the 0
and 1 boundaries are checked explicitly, in seeking to estimate the
heritability.