Introduction

Sample mix-ups, mislabelings, or contamination are hard to avoid, and they interfere with our ability to identify genotype:phenotype associations. However with eQTL data, and other high-throughput genomic data, there is often the opportunity to identify and correct such mix-ups. For example, see Westra et al. (2011) and Broman et al. (2015).

Here, we seek to illustrate our approach for identifying sample mix-ups in Diversity Outbred (DO) mouse data. We will consider data from Chick et al. (2016), on a set of DO mice with genotype data from the MegaMUGA array, liver gene expression data by RNA-seq, and liver proteomics data from mass spectrometry. The data are available via the Churchill lab qtlviewer site; see the Svenson DO High Fat Diet viewer, or download the data directly (as a .RData file) from https://churchilllab.jax.org/qtlviewer/svenson/DOHFD/rdata.

Raw RNA-seq data are also available from GEO GSE72759, including processed gene expression data with genotypes. Proteomics data is available at Proteomexchange, PXD002801, though not in a form that is easy to deal with. Genotype and clinical phenotype data are also available at DODB.

The basic approach we take is similar to Figure 1 of Broman et al. (2015).

Consider two large datasets whose rows are supposed to correspond (panel A). We look at the association between column pairs to find pairs that are highly associated (panel B). We then reduce our focus to just those columns (panel C) and measure the similarity between rows, creating a similarity matrix (panel D).

The two datasets could be genotypes and gene expression, or they could be gene expression in two different tissues, or they could be gene expression and protein levels. The key thing is that they contain strong association signals that allows us to distinguish rows that belong to the same individual from rows that belong to different individuals. And our basic approach is to first pull out the columns with strong association and toss the rest away, and then look at similarity of rows in that subset of columns.

Preparations

We will first load a set of R packages. We use R/qtl2 for the QTL analysis, R/lineup2 for the sample mix-up analysis, R/broman for some general utilities and graphs, and the DT package for dynamic tables.

library(qtl2)
library(lineup2)
library(broman)
library(DT)

Next we will download the data file from https://churchilllab.jax.org/qtlviewer/svenson/DOHFD/rdata. It’s a .RData file that is about 3.1 GB, so this will take a while.

url <- "https://churchilllab.jax.org/qtlviewer/svenson/DOHFD/rdata"
file <- "mixups_cache/Svenson_DO850_for_eQTL_viewer_v9.RData"
if(!file.exists(file)) {
    download.file(url, file)
}

We can load the data with load(). This will also take a while.

load(file)

The file contains a number of objects, including the following:

  • dataset.mrna - a list containing the mRNA expression data (478 mice)
  • dataset.protein - a list containing the proteomics data (192 mice)
  • genoprobs - genotype probabilities across the genome from calc_genoprob(), reduced to 8 allele dosages (835 mice)

The mRNA and protein datasets are lists containing a number of objects. Here are the key pieces of dataset.mrna:

  • annot.mrna - metadata about the mRNA traits (gene.id, symbol, chr, start, end, strand, middle, and nearest.marker.id)
  • covar.matrix - 478 × 6 matrix (including sexM, a 0/1 indicator with 1 = male)
  • data - a list with “norm”, “raw”, “rz”, each a 478 × 11,770 matrix
  • lod.peaks - a list with “additive”, “diet”, “sex_int”, each a data frame with gene.id, marker.id, and lod

The dataset.protein object is similar. Here are the key pieces:

  • annot.protein - metadata about the proteins (protein.id, gene.id, Ensembl.Transcript.ID, symbol, chr, start, end, Gene.Biotype, and Transcript.Biotype)
  • covar.matrix - 192 × 8 matrix (including sexM, a 0/1 indicator with 1 = male)
  • data - a matrix of size 192 × 6,740
  • lod.peaks - a list with “additive”, “diet”, “sex_int”, each a data frame with protein.id, marker.id, and lod

Expression vs. protein

We’ll start by comparing the gene expression data to the proteomics data.

Let’s pull out the protein and mRNA data as matrices. The 192 mice in the proteomics data are all among the 478 mice in the mRNA data.

protein <- dataset.protein$data
mrna <- dataset.mrna$data$norm
protein_id <- rownames(protein)
mrna_id <- rownames(mrna)

# all mice in protein data are also in mrna data
stopifnot(all(protein_id %in% mrna_id))

We want to verify that the rows in the proteomics data match their corresponding row in the mRNA data, but the variables (the columns) are different. There are two solutions to that: we can look at all pairs of variables (protein/mRNA), or we can find the corresponding pairs, where the protein and transcript are for the same gene. The latter approach will be much faster, so let’s start by looking at that.

We first find the corresponding pairs. The protein annotation data includes a gene.id column that corresponds to the column names in the mRNA data. We also subset the mRNA data to the mice for which we have proteomics data.

stopifnot(all( colnames(protein) == dataset.protein$annot.protein$protein.id))

# find the gene ids for each protein
protein_gene <- dataset.protein$annot.protein$gene.id

# subset protein and mrna to sets in common
protein_sub <- protein[, protein_gene %in% colnames(mrna)]
mrna_sub <- mrna[rownames(protein), protein_gene[protein_gene %in% colnames(mrna)]]

Of the 6740 proteins measured, 6509 have a corresponding gene in the mRNA data.

But we can’t just look at the correlations between the measurements across all genes, because it will be a sloppy mess. In any given tissue, most genes won’t be expressed and the proteins will be absent, and so the observed values will be uncorrelated noise. So we first seek to identify a subset of highly correlated protein/gene pairs.

Again, we’ll start by just looking at the protein/transcript pairs that correspond to the same gene, because we would expect these to be among the most-correlated, and the calculation is fast. We use the function corr_betw_matrices() in the R/lineup2 package. The third argument indicates that the columns of the two matrices are paired, and we just want the correlations between the paired columns.

cor_protein_mrna_paired <- corr_betw_matrices(protein_sub, mrna_sub, "paired", cores=0)

Here is a histogram of the correlations.

par(mar=c(5.1, 0.6, 0.6, 0.6))
hist(cor_protein_mrna_paired, breaks=seq(-0.6, 1, len=201), main="", yaxt="n", ylab="",
     xlab="Correlation between protein and mRNA (paired by gene)")

Alternatively, we could, for each protein, find the most-correlated gene, looking across all genes. We subset the mRNA data to the mice that have proteomics data, and then we use corr_betw_matrices() with the argument "bestright", which returns a data frame that gives, for each column in the first dataset, the most-correlated column in the second dataset, and the value of the maximum correlation.

mrna_subind <- mrna[rownames(protein),]
cor_protein_mrna <- corr_betw_matrices(protein, mrna_subind, "bestright", cores=0)

Here is a histogram of these correlations:

par(mar=c(5.1, 0.6, 0.6, 0.6))
hist(cor_protein_mrna$corr, breaks=seq(-0.6, 1, len=201), main="", yaxt="n", ylab="",
     xlab="Correlation between protein and mRNA (max for each protein)")

It turns out, though, that for these particular data, a lot of the correlated protein/transcript pairs will be ones with a large sex difference. This leads to gene/protein pairs that are highly correlated but not especially useful for identifying samples.

For our problem, of identifying sample mix-ups, it turns out to be better to focus on protein/transcript pairs that are highly correlated even after accounting for sex.

So we’ll choose gene/protein pairs by looking at the residual correlation after accounting for sex. To do so, we first “regress out” sex and then look at the correlations among the residuals.

protein_sex <- dataset.protein$covar.matrix[,"sexM"]
protein_adj <- protein
for(i in 1:ncol(protein)) {
    r <- lm(protein[,i] ~ protein_sex)$resid
    protein_adj[names(r), i] <- r
}

mrna_sex <- dataset.mrna$covar.matrix[,"sexM"]
mrna_adj <- mrna
for(i in 1:ncol(mrna)) {
    r <- lm(mrna[,i] ~ mrna_sex)$resid
    mrna_adj[names(r), i] <- r
}
mrna_adj_subind <- mrna_adj[rownames(mrna_subind),]

Now we calculate the correlations between these adjusted protein and mRNA measurements. Again, for each protein we find the most-correlated gene.

cor_protein_mrna_adj <- corr_betw_matrices(protein_adj, mrna_adj_subind, "bestright", cores=0)

Now we will use these to identify the top 100 protein/transcript pairs.

o <- order(cor_protein_mrna_adj[,1], decreasing=TRUE)[1:100]
protein_select <- rownames(cor_protein_mrna_adj)[o]
gene_select <- cor_protein_mrna_adj[o,"ycol"]

And then we will use these 100 selected pairs to calculate a similarity matrix for the samples. We’ll return to the measurements that haven’t been adjusted for the sex difference, and calculate the correlation between each row in the protein data and each row in the mRNA data, across these 100 pairs of proteins and transcripts. We use corr_betw_matrices() with the option "all" to get the correlation of each column in one matrix with each column in the other. We transpose each of the protein and mRNA datasets so that the columns correspond to the samples.

x <- t(protein[,protein_select])
y <- t(mrna[,gene_select])
sim_pve <- corr_betw_matrices(x, y, "all", align_rows=FALSE, cores=0)

The result is a 192 × 478 matrix of correlations. Here is a grayscale image with light being low (or negative) correlation and dark being high positive correlation. I’ll use image() with the transpose of the similarity matrix, so the rows are the protein samples and the columns are the mRNA samples.

par(mar=c(5.1, 4.1, 0.6, 0.6), las=1)
image(1:ncol(sim_pve), 1:nrow(sim_pve), t(sim_pve), col=revgray(),
      xlab="mRNA samples", ylab="Protein samples")

The black pixels, for the most part forming two diagonal lines, are the matching samples. There’s a rather complex pattern of associations in the background.

Here are histograms of the similarity values, broken into the protein/mRNA sample pairs with the same labels and those with different labels.

hist_self_nonself(sim_pve, xlabel="similarity")

You can see that the correct samples have protein/mRNA correlation around 0.6 – 0.8, while protein/mRNA correlation for two different mice are mostly -0.5 – 0.5. There are clearly some problems in the upper histogram, with mice having low correlation between their protein and mRNA samples, and the lower histogram shows some well-correlated sample pairs with different labels.

For each protein sample (rows in the similarity matrix), we can pull the mRNA sample with the highest correlation. It should be the one with the corresponding label, but it may not be. Let’s make a scatterplot of the maximum similarity vs the self similarity. We use the R/lineup2 function get_self() to pull out the values where the row and column labels are the same, and get_best() to find the optimal value in each row.

self <- get_self(sim_pve)
best <- get_best(sim_pve, get_min=FALSE)
par(mar=c(5.1,4.1,0.6,0.6))
grayplot(self, best, xlab="Self similarity", ylab="Maximum similarity")

# add labels
hilit <- names(self)[!is.na(self) & self < best]
right <- setNames(rep(TRUE, length(hilit)), hilit)
right[c("M404", "F401", "F423")] <- FALSE
for(ind in hilit) {
    text(self[ind]+0.01*(right[ind]*2-1), best[ind], ind,
         adj=c(1-right[ind], 0.5))
}

The samples along the diagonal line look to be correctly labeled: for these protein samples, the mRNA sample with the corresponding label is the one with the highest correlation. There are 21 samples that are off the diagonal: there is some other mRNA sample that is more-correlated that the corresponding one. M404 is near the diagonal so may just be noisy. M371 has low self similarity and so looks like there was a labeling mistake, but there is no mRNA sample that is very strongly correlated with its protein sample. The other 19 points look to be incorrectly labeled (low self-similarity) but there is another mRNA sample that is quite similar.

We can use the function get_problems() to pull out the problem samples. This shows, for each problem protein sample (where self similarity is less than optimal) which mRNA is the most closely matched sample, and also which is the 2nd-most optimal sample. We use the datatable() function in the DT package to create this dynamic table. Also note that get_problems() can be used to look by column rather than by row.

problems <- get_problems(sim_pve, get_min=FALSE)
formatRound(datatable( problems, rownames=FALSE, options=list(pageLength=10)),
            columns=c("self", "best", "next_best"),
            digits=2)

If you look through this carefully, you find that there are nine pairs of apparent sample swaps, plus two mislabeled samples and one more (M404) that is unclear.

For example, the first row shows that while the protein and mRNA samples labeled F371 are not at all similar (correlation -0.14), the protein sample F371 is very similar to the mRNA sample M349 (correlation 0.74). The next-most similar mRNA sample has considerably lower correlation. In the second row above, we see that the protein sample M349 is also not similar to the correspondingly labeled mRNA sample but is similar to the F371 mRNA sample, so this looks like a simple sample swap, though we can’t tell at this point whether it is the protein samples that were swapped or the mRNA samples.

The last row in the table shows the protein sample M404, for which the corresponding mRNA sample is only slightly less-correlated than the optimal one, and so it may not actually be mislabeled.

In the second-to-last row, the protein sample M371 is not well correlated with the corresponding mRNA sample, but there is no mRNA sample that is particularly well-correlated.

To further investigate the samples, we can plot the full row of correlations: a given protein sample compared to each of the mRNA samples. Let’s look at six samples:

par(mfrow=c(3,2), mar=c(5.1, 4.1, 2.1, 0.6))
for(sample in c("F374", "M372", "F371", "M349", "M371", "M404")) {
    y <- sim_pve[sample,]
    grayplot(y, xlab="mRNA sample", ylim=range(sim_pve),
             ylab=paste("Correlation with protein sample", sample),
             main=paste("Protein", sample))

    xd <- 4

    wh_max <- which.max(y)
    points(wh_max, y[wh_max], pch=21, col="violetred")
    text(wh_max-xd, y[wh_max], names(y)[wh_max], adj=c(1, 0.5))

    wh_self <- which(names(y) == sample)
    points(wh_self, y[wh_self], pch=21, col="green3")
    text(wh_self-xd, y[wh_self], sample, adj=c(1, 0.5))
}

The top two panels are for protein samples F374 and M372, which both seem to correspond to mRNA M372. The middle two panels are for the sample swap F371 and M349. The bottom-left panel, for protein sample M371, doesn’t seem to match any of the mRNA samples. The bottom-right panel, for protein sample M404, is also not well-correlated with any of the mRNA samples, though the mRNA sample with the corresponding label is near the top.

In summary, in this relation of protein samples to mRNA samples, we see:

  • nine sample swaps; not yet clear if in protein samples or mRNA samples
    • M348 ↔︎ M410
    • M349 ↔︎ F371
    • F349 ↔︎ F423
    • F353 ↔︎ F366
    • F354 ↔︎ M366
    • F379 ↔︎ F410
    • M379 ↔︎ F381
    • F386 ↔︎ F401
    • M393 ↔︎ M403
  • protein F374 looks like it should be labeled M372; M372 was in duplicate
  • protein M371 looks mislabelled but it’s not clear what it should be
  • protein M404 is questionable but probably okay

A number of the sample swaps are between sexes; the presence of proteins and genes with strong sex differences should help us to see these problems and potentially to determine if the problem was in the protein or the mRNA label. But we won’t pursue this further here.

Expression vs. genotypes

Let’s now compare the gene expression data to the DNA genotypes. We again want to select pairs of gene expression traits and genetic markers that are highly correlated, and then use this subset to investigate the association between mRNA samples and DNA samples.

So generally, the first task would be to run a genome scan for each expression trait, or to just focus on cis-eQTL by finding the nearest marker to each gene. But this dataset includes the set of eQTL with large LOD scores, so this can be pretty quick: the genes with strong eQTL have already been identified. We’ll focus on the top 100 eQTL.

n_traits <- 100
top_peaks_mrna <- dataset.mrna$lod.peaks$additive
top_peaks_mrna <- top_peaks_mrna[order(top_peaks_mrna$lod, decreasing=TRUE)[1:n_traits],]

The simplest way to relate the gene expression data to the corresponding eQTL genotypes is to compare the observed gene expression to the predicted expression values, given the observed genotypes (or really the genotype probabilities, reduced to allele dosages). This was the approach taken in Westra et al. (2011).

We first pull out the (normalized) gene expression data for these genes.

obs_mrna <- dataset.mrna$data$norm[,top_peaks_mrna$gene.id]

Now we want to get corresponding predicted expression, based on the genotype probabilities. At each eQTL, we’ll fit the single-QTL model and get estimated coefficients, and then we’ll use matrix multiplication to get the predicted expression.

We use pull_genoprobpos() to pull out a matrix of genotype probabilities (as allele dosages). Then we use the R/qtl2 function fit1(), with zerosum=FALSE so that the coefficients don’t get re-scaled to sum to 0. We could just as easily have used the base R function lm(); the main advantage of fit1() here, is that it aligns the samples using the rownames. We finally use matrix multiplication (%*%) to get fitted values. Note that while the observed mRNA expression data is for 478 mice, we have predicted expression for 835 mice, as there was genotype data on additional mice that were not assayed for expression.

exp_mrna <- NULL
for(i in 1:nrow(top_peaks_mrna)) {
    pr <- pull_genoprobpos(genoprobs, marker=top_peaks_mrna$marker.id[i])
    out_fit1 <- fit1(pr, obs_mrna[,i,drop=FALSE], zerosum=FALSE)
    fitted <- pr %*% out_fit1$coef
    if(is.null(exp_mrna)) {
        exp_mrna <- matrix(nrow=nrow(fitted), ncol=ncol(obs_mrna))
        dimnames(exp_mrna) <- list(rownames(pr), colnames(obs_mrna))
    }
    exp_mrna[rownames(fitted), i] <- fitted
}

While we could look at the correlation between the observed and predicted expression values, even better is to look at the RMS differences between the two, as a measure of distance. We use the R/lineup2 function dist_betw_matrices() which by default calculates these RMS differences; it can also calculate mean absolute differences.

d_evg <- dist_betw_matrices(obs_mrna, exp_mrna)

Here’s a plot of the self distance versus the minimum distance for each of the 478 mRNA samples. As it turns out, there are no cases where an mRNA sample does not seem to match the corresponding DNA sample. This suggests the sample mix-ups in the protein/mRNA comparison were entirely in the protein samples. (There may have been mix-ups in the mRNA samples and DNA samples that were previously corrected.)

par(mar=c(5.1,4.1,0.6,0.6))
grayplot(get_self(d_evg), get_best(d_evg), xlab="Self distance", ylab="Minimum distance")

Protein vs. genotypes

We will go through the same set of procedures to compare the proteomics data to the DNA genotype data. Again, the dataset includes the set of LOD peaks, so we don’t need to re-run any genome scans. We’ll focus on the top 100 pQTL.

n_traits <- 100
top_peaks_protein <- dataset.protein$lod.peaks$additive
top_peaks_protein <- top_peaks_protein[order(top_peaks_protein$lod, decreasing=TRUE)[1:n_traits],]

We pull out the observed protein levels for these traits.

obs_protein <- dataset.protein$data[,top_peaks_protein$protein.id]

Next we get corresponding predicted protein levels, based on the genotype probabilities. As with the mRNA expression, we consider each pQTL, fit a single-QTL model, get estimated coefficients, and then use matrix multiplication to get the predicted protein levels.

exp_protein <- NULL
for(i in 1:nrow(top_peaks_protein)) {
    pr <- pull_genoprobpos(genoprobs, marker=top_peaks_protein$marker.id[i])
    out_fit1 <- fit1(pr, obs_protein[,i,drop=FALSE], zerosum=FALSE)
    fitted <- pr %*% out_fit1$coef
    if(is.null(exp_protein)) {
        exp_protein <- matrix(nrow=nrow(fitted), ncol=ncol(obs_protein))
        dimnames(exp_protein) <- list(rownames(pr), colnames(obs_protein))
    }
    exp_protein[rownames(fitted), i] <- fitted
}

We’ll again use the RMS differences between the observed and predicted protein levels as a measure of distance.

d_pvg <- dist_betw_matrices(obs_protein, exp_protein)

Here is a plot of the self distance vs the minimum distance for each protein sample.

par(mar=c(5.1,4.1,0.6,0.6))
self_pvg <- get_self(d_pvg)
best_pvg <- get_best(d_pvg)
grayplot(self_pvg, best_pvg, xlab="Self distance", ylab="Minimum distance")

# add labels
hilit <- names(self_pvg)[!is.na(self_pvg) & self_pvg > best_pvg]
right <- setNames(rep(FALSE, length(hilit)), hilit)
right[c("F353", "M348", "F401", "F379")] <- TRUE
for(ind in hilit) {
    text(self_pvg[ind]+0.004*(right[ind]*2-1), best_pvg[ind], ind,
         adj=c(1-right[ind], 0.5))
}

There are 20 protein samples that are off the diagonal line: the RMS difference between the observed and predicted protein levels is large for the DNA sample with the same label as the protein sample, but there is another DNA sample that is close.

We can again use get_problems() to get a summary of the potential problems. These show exactly the same problems that we saw above when comparing mRNA and protein samples, with two exceptions: M404 is now along the diagonal (its DNA sample is the closest), and we can now identify the correct label for protein sample M371: it corresponds to DNA sample M338 (which wasn’t intended to be assayed for protein or expression).

problems <- get_problems(d_pvg)
formatRound(datatable( problems, rownames=FALSE, options=list(pageLength=10)),
            columns=c("self", "best", "next_best"),
            digits=2)

A key thing we learn here: the problems were all in the protein samples, not the mRNA samples.

Sample M404 looks okay, but it looks unusually like F330. It’s possible that it’s a mixture.

par(mar=c(5.1,4.1,2.1,0.6))
sample <- "M404"
alt <- "F330"
y <- d_pvg[sample,]
grayplot(y, xlab="DNA sample", ylim=c(0.7, 1.3),
         ylab=paste("Distance from protein sample", sample),
         main=paste("Protein sample", sample))

xd <- 4

wh_alt <- which(names(y) == alt)
points(wh_alt, y[wh_alt], pch=21, col="violetred")
text(wh_alt-xd, y[wh_alt], alt, adj=c(1, 0.5))

wh_self <- which(names(y) == sample)
points(wh_self, y[wh_self], pch=21, col="green3")
text(wh_self-xd, y[wh_self], sample, adj=c(1, 0.5))

Re-calculate LOD scores

Let’s look at how correcting the sample mix-ups changes the evidence for QTL. Rather than take the time to re-run the full genome scans, we will focus just on the LOD peaks that were previously observed. We expect that when correcting the labels of 20/192 protein samples, the LOD scores will generally increase.

Let’s first spot-check the results: pick five random rows from the lod.peaks results for the protein data, run fit1() at those positions, and check to see that we get the same LOD scores.

lod_peaks <- dataset.protein$lod.peaks$additive
test_peaks <- cbind(lod_peaks[sample(nrow(lod_peaks), 5),], lod_retest=NA)
for(row in 1:nrow(test_peaks)) {
    y <- dataset.protein$data[,test_peaks[row,1]]
    pr <- pull_genoprobpos(genoprobs, marker=test_peaks[row,2])
    chr <- find_markerpos(map, test_peaks[row,2])$chr
    test_peaks[row, "lod_retest"] <- fit1(pr, y, addcovar=dataset.protein$covar.matrix, kinship=K[[chr]])$lod
}
test_peaks
##              protein.id   marker.id      lod lod_retest
## 6306 ENSMUSP00000049056 12_84232755 6.151918   6.151918
## 6312 ENSMUSP00000041070  2_49650224 6.124543   6.124543
## 4170 ENSMUSP00000005600 12_54870415 6.037857   6.037857
## 788  ENSMUSP00000088365 2_145821112 6.828309   6.828309
## 8935 ENSMUSP00000108595 11_52377446 6.057183   6.057183

Having verified that we know what we’re doing, let’s create a revised version of the proteomics data with the row labels corrected. We will omit one mouse and change the labels on 19 others. This can be a bit tricky.

# make vector of IDs to swap and individuals to omit
problems <- get_problems(d_pvg)
swap_name <- setNames(problems$ind, problems$which_best)
omit <- "F374"
swap_name <- swap_name[!(swap_name %in% omit)]

# this is a really contorted way to correct the names
protein_rev <- dataset.protein$data
protein_rev <- protein_rev[!(rownames(protein_rev) %in% omit),]
id <- rownames(protein_rev)
names(id) <- id
id[swap_name] <- names(swap_name)
names(id) <- NULL
rownames(protein_rev) <- id

Let’s now go back and re-run the fit1 analyses for all of the LOD peaks, to get revised LOD scores when the protein samples have been relabeled.

lod_peaks <- cbind(dataset.protein$lod.peaks$additive, lod_rev=NA)
for(row in 1:nrow(lod_peaks)) {
    if(row==round(row,-2)) cat(row,"\n")
    y <- protein_rev[,lod_peaks[row,1]]
    pr <- pull_genoprobpos(genoprobs, marker=lod_peaks[row,2])
    chr <- find_markerpos(map, lod_peaks[row,2])$chr
    lod_peaks[row,"lod_rev"] <- fit1(pr, y, addcovar=dataset.protein$covar.matrix, kinship=K[[chr]])$lod
}

The LOD scores actually went down in 67% of cases. But it mostly went down for the LOD scores < 10. For LOD scores ≥ 10, it went down just 10% of the time.

Here’s a plot of the change in LOD (revised vs original) versus the original LOD score.

par(mar=c(5.1, 4.1, 0.6, 0.6))
grayplot(lod_peaks$lod, lod_peaks$lod_rev - lod_peaks$lod, ylim=c(-10, 30),
         xlab="Original LOD score", ylab="Change in LOD (revised - original)",
         hlines=c(-10, 0, 10, 20, 30), hlines.col=c("white", "violetred", "white", "white", "white"))

The bulk of the smaller LOD scores went down, but the larger LOD scores generally went up, and many went up by a lot. The biggest change was for ENSMUSP00000134570, at 17_35404132, where the LOD score increased from 26.4 to 56.7.

Summary

Sample mix-ups can weaken QTL studies by reducing genotype:phenotype correlations. And with traditional phenotypes, we are generally blind to such mix-ups. But with high-throughput phenotype data with strong genetic effects, such as genome-wide gene expression and proteomics data, we may be able to detect and even correct sample mix-ups.

We have illustrated our approach to detecting sample mix-ups in Diversity Outbred mouse data using the genotype, transcript, and protein data from Chick et al. (2016). There were 835 genotyped mice, including 478 mice with mRNA expression data and 192 mice with proteomics data. We identified a set of 20 sample mix-ups, including nine simple label swaps and two other mislabeled samples. All of the mix-ups were in the proteomics data.

Our basic strategy is to work with a pair of datasets at a time, identify pairs of highly correlated columns, and then look at the correlation or RMS distance between rows, among the selected columns. With gene expression and protein data, we look at the correlation. With gene expression and genotype or with protein and genotype, we first get predicted gene expression or protein levels and then calculate the RMS difference between the observed and predicted values. The R/lineup2 package provides utilities for calculating and summarizing these correlations and distances.


Source document: do_mixups.Rmd