Cleaning genotype data for diversity outbred mice

class: center, middle, inverse, title-slide

# Cleaning genotype data for </br>diversity outbred mice
### <a href="https://kbroman.org">Karl Broman</a>
### Biostatistics & Medical Informatics<br/>University of Wisconsin–Madison<br/><br/><a href="https://kbroman.org"><code>kbroman.org</code></a><br/><a href="https://github.com/kbroman"><code>github.com/kbroman</code></a><br/><a href="https://twitter.com/kwbroman"><code><span class="citation">@kwbroman</span></code></a><br/>Slides: <a href="https://bit.ly/2018ctc"><code>bit.ly/2018ctc</code></a>

---

# Heterogeneous stock

![](index_files/figure-html/dofig-1.svg)

---

# Diversity outbred mouse data

- 500 DO mice

- GigaMUGA SNP arrays (114k SNPs)

- RNA-seq data on pancreatic islets

- Microbiome data (16S and shotgun sequencing)

- protein and lipid measurements by mass spec

- Collaboration with Alan Attie, Gary Churchill, Brian Yandell,
  Josh Coon, Federico Rey, and many others

---

# Principles

What might have gone wrong?

How could it be revealed?

Also, just make a bunch of graphs.

If you see something weird, try to figure it out.

---

# Possible problems

- Sample duplicates

- Sample mix-ups

- Bad samples

- Bad markers

- Genotyping errors in founders

---
class: inverse, middle, center

# What to look at first?

---

# Missing data per sample

![](index_files/figure-html/plot_missing-1.svg)

---

# Missing data per sample

![](index_files/figure-html/missing_genotypes_labeled-1.svg)

---
class: inverse, middle, center

# Swapped sex labels

---

# Average SNP intensity on X and Y chr

![](index_files/figure-html/yint_vs_xint_all_probes-1.svg)

---

# Average SNP intensity on X and Y chr

![](index_files/figure-html/yint_vs_xint_selected_probes-1.svg)

---

# Average SNP intensity on X and Y chr

![](index_files/figure-html/yint_vs_xint_selected_probes_labeled-1.svg)

---

# Heterozygosity vs SNP intensity on X chr

![](index_files/figure-html/het_vs_xint-1.svg)

---

# Heterozygosity vs SNP intensity on X chr

![](index_files/figure-html/het_vs_xint_labeled-1.svg)

---
class: inverse, middle, center

# Sample duplicates

---

# Percent matching genotypes between pairs

![](index_files/figure-html/prop_match-1.png)

---

# Percent matching genotypes between pairs

![](index_files/figure-html/prop_match_labeled-1.png)

---
class: inverse, middle, center

# Sample mix-ups

---

# Sample mix-ups

![](index_files/figure-html/mixup_illustration_1-1.svg)

---

# Sample mix-ups

![](index_files/figure-html/mixup_illustration_2-1.svg)

---

# Sample mix-ups

![](index_files/figure-html/mixup_illustration_3-1.svg)

---

# Sample mix-ups

![](index_files/figure-html/mixup_illustration_4-1.svg)

---

# RNA-seq sample mix-ups: distance matrix

![](index_files/figure-html/gve_dist_image-1.png)

---

# RNA-seq sample mix-ups: min vs self distance

![](index_files/figure-html/gve_best_vs_self_unlabeled-1.svg)

---

# RNA-seq sample mix-ups: min vs self distance

![](index_files/figure-html/gve_best_vs_self_labeled-1.svg)

---

# RNA-seq sample mix-ups: detail

![](index_files/figure-html/rnaseq_problems_detail-1.svg)

---

# Microbiome data

![](index_files/figure-html/microbiome_illustration-1.svg)

---

# Sample mix-ups: Microbiome data

- Impute genotypes at all SNPs in DNA samples

- Map microbiome reads to mouse genome; find reads overlapping a SNP

- For each pair of samples (DNA + microbiome):

- Focus on reads that overlap a SNP where that DNA sample is
    homozygous

- Distance = proportion of reads where SNP allele doesn't match DNA
    sample's genotype

---

# Microbiome DO361 vs DNA DO361

---

# Microbiome DO360 vs DNA DO360

---

# Microbiome DO360 vs DNA DO370

---

# Microbiome mix-ups: min vs self distance

![](index_files/figure-html/microbiome_best_vs_self-1.svg)

---

# Microbiome mix-ups: min vs self distance

![](index_files/figure-html/microbiome_best_vs_self_labeled-1.svg)

---
class: inverse, middle, center

# Sample quality

---

# Missing data per sample

![](index_files/figure-html/missing_genotypes_labeled-1.svg)

---

# Array intensities

![](index_files/figure-html/array_intensities_densities-1.png)

---

# Allele frequencies, by individual

![](index_files/figure-html/genotype_freq-1.svg)

---

# Allele frequencies, by individual

![](index_files/figure-html/allele_freq_subset-1.svg)

---

# Allele frequencies, by individual

![](index_files/figure-html/allele_freq_subset_wdensity-1.svg)

---

# Genotype frequencies, by individual

![](index_files/figure-html/geno_freq_byind-1.svg)

---

# Genotype frequencies, by individual

![](index_files/figure-html/geno_freq_byind_wlabels-1.svg)

---

# Heterozygosities, by individual

![](index_files/figure-html/het_byind-1.svg)

---

# Genotype probabilities (one mouse on one chr)

![](index_files/figure-html/plot_genoprob-1.png)

---

# Genome reconstruction (one mouse)

![](index_files/figure-html/plot_onegeno-1.svg)

---

# Percent missing vs number of crossovers

![](index_files/figure-html/missing_v_nxo-1.svg)

---

# Percent missing vs number of crossovers

![](index_files/figure-html/missing_v_nxo_labeled-1.svg)

---

# Percent missing vs number of crossovers

![](index_files/figure-html/missing_v_nxo_subset-1.svg)

---

# Percent missing vs number of crossovers

![](index_files/figure-html/missing_v_nxo_subset_labeled-1.svg)

---

# No. crossovers by generation

![](index_files/figure-html/nxo_by_wave-1.svg)

---

# Estimated percent of genotyping errors

![](index_files/figure-html/genotype_errors-1.svg)

---

# Estimated percent of genotyping errors

![](index_files/figure-html/genotype_errors_subset-1.svg)

---
class: inverse, middle, center

# Marker quality

---

# Proportion missing data

![](index_files/figure-html/hist_missing_bymar-1.png)

---

# Allele frequencies, by marker

![](index_files/figure-html/allele_freq_bymar-1.svg)

---

# Allele frequencies, by marker

![](index_files/figure-html/allele_freq_bymar_wdensity-1.svg)

---

# Genotype frequencies, by marker

![](index_files/figure-html/geno_freq_bymar-1.png)

---

# Heterozygosities, by marker

![](index_files/figure-html/het_bymar-1.svg)

---

# Genotyping error rates

![](index_files/figure-html/genotype_errors_bymar-1.png)

---

# Genotyping error rate vs percent missing

![](index_files/figure-html/error_vs_missing_bymar-1.png)

---

# Genotyping error rate vs percent missing

![](index_files/figure-html/error_vs_missing_bymar_logscale-1.png)

---

# Nice markers

![](index_files/figure-html/snpint_nice-1.svg)

---

# Crap markers

![](index_files/figure-html/snpint_crap-1.svg)

---

# More crap markers

![](index_files/figure-html/snpint_more_crap-1.svg)

---

# One bad blob

![](index_files/figure-html/snpint_onebadblob-1.svg)

---

# Wrong genomic coordinates

![](index_files/figure-html/snpint_wrong_location-1.svg)

---

# Puzzling no calls

![](index_files/figure-html/snpint_nocalls-1.svg)

---
class: inverse, middle, center

# Founder genotyping errors

---

# One founder missing

![](index_files/figure-html/snpint_foundermissing-1.svg)

---

# Another case with one founder missing

![](index_files/figure-html/snpint_foundermissing_again-1.svg)

---
class: compressed_bullets

# Summary

- Quality of results depends on quality of data

- Think about what might have gone wrong, and how it might be revealed

- Pulling out the bad samples is the most important thing

- Sex swaps: look at array intensities

- Look for sample duplicates, and if possible sample mix-ups

- Samples: missing data, array intensities, crossovers, errors

- Markers: lots of reasons for the bad ones

---
class: indent

# Acknowledgments

Alan Attie<br/>
Gary Churchill<br/>
Dan Gatti<br/>
Alexandra Lobo<br/>
Federico Rey<br/>
&#346;aunak Sen<br/>
Lindsay Traeger<br/>
Brian Yandell<br/><br/>
NIH/NIGMS, NIH/NIDDK

---
class: middle, indent

# &nbsp;

Slides: [`bit.ly/2018ctc`](https://bit.ly/2018ctc)
.cc0-badge[ [![CC0](cc-zero.svg)](https://creativecommons.org/publicdomain/zero/1.0) ]

[`kbroman.org`](https://kbroman.org)

[`github.com/kbroman`](https://github.com/kbroman)

[`@kwbroman`](https://twitter.com/kwbroman)