class: center, middle, inverse, title-slide # Cleaning genotype data for diversity outbred mice ###
Karl Broman
### Biostatistics & Medical Informatics
University of Wisconsin–Madison
kbroman.org
github.com/kbroman
@kwbroman
Slides:
bit.ly/2018ctc
--- # Heterogeneous stock <!-- --> --- # Diversity outbred mouse data - 500 DO mice - GigaMUGA SNP arrays (114k SNPs) - RNA-seq data on pancreatic islets - Microbiome data (16S and shotgun sequencing) - protein and lipid measurements by mass spec - Collaboration with Alan Attie, Gary Churchill, Brian Yandell, Josh Coon, Federico Rey, and many others --- # Principles What might have gone wrong? How could it be revealed? -- Also, just make a bunch of graphs. -- If you see something weird, try to figure it out. --- # Possible problems - Sample duplicates - Sample mix-ups - Bad samples - Bad markers - Genotyping errors in founders --- class: inverse, middle, center # What to look at first? --- # Missing data per sample <!-- --> --- # Missing data per sample <!-- --> --- class: inverse, middle, center # Swapped sex labels --- # Average SNP intensity on X and Y chr <!-- --> --- # Average SNP intensity on X and Y chr <!-- --> --- # Average SNP intensity on X and Y chr <!-- --> --- # Heterozygosity vs SNP intensity on X chr <!-- --> --- # Heterozygosity vs SNP intensity on X chr <!-- --> --- class: inverse, middle, center # Sample duplicates --- # Percent matching genotypes between pairs <!-- --> --- # Percent matching genotypes between pairs <!-- --> --- class: inverse, middle, center # Sample mix-ups --- # Sample mix-ups <!-- --> --- # Sample mix-ups <!-- --> --- # Sample mix-ups <!-- --> --- # Sample mix-ups <!-- --> --- # RNA-seq sample mix-ups: distance matrix <!-- --> --- # RNA-seq sample mix-ups: min vs self distance <!-- --> --- # RNA-seq sample mix-ups: min vs self distance <!-- --> --- # RNA-seq sample mix-ups: detail <!-- --> --- # Microbiome data <!-- --> --- # Sample mix-ups: Microbiome data - Impute genotypes at all SNPs in DNA samples - Map microbiome reads to mouse genome; find reads overlapping a SNP - For each pair of samples (DNA + microbiome): - Focus on reads that overlap a SNP where that DNA sample is homozygous - Distance = proportion of reads where SNP allele doesn't match DNA sample's genotype --- # Microbiome DO361 vs DNA DO361 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 939,918 </td> <td style="text-align:right;"> 1,044 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2,998 </td> <td style="text-align:right;"> 125,962 </td> </tr> </tbody> </table> --- # Microbiome DO360 vs DNA DO360 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2,661,645 </td> <td style="text-align:right;"> 190,188 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 427,685 </td> <td style="text-align:right;"> 202,335 </td> </tr> </tbody> </table> --- # Microbiome DO360 vs DNA DO370 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 3,137,751 </td> <td style="text-align:right;"> 1,475 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 7,461 </td> <td style="text-align:right;"> 310,369 </td> </tr> </tbody> </table> --- # Microbiome mix-ups: min vs self distance <!-- --> --- # Microbiome mix-ups: min vs self distance <!-- --> --- class: inverse, middle, center # Sample quality --- # Missing data per sample  --- # Array intensities <!-- --> --- # Allele frequencies, by individual <!-- --> --- # Allele frequencies, by individual <!-- --> --- # Allele frequencies, by individual <!-- --> --- # Genotype frequencies, by individual <!-- --> --- # Genotype frequencies, by individual <!-- --> --- # Heterozygosities, by individual <!-- --> --- # Genotype probabilities (one mouse on one chr) <!-- --> --- # Genome reconstruction (one mouse) <!-- --> --- # Percent missing vs number of crossovers <!-- --> --- # Percent missing vs number of crossovers <!-- --> --- # Percent missing vs number of crossovers <!-- --> --- # Percent missing vs number of crossovers <!-- --> --- # No. crossovers by generation <!-- --> --- # Estimated percent of genotyping errors <!-- --> --- # Estimated percent of genotyping errors <!-- --> --- class: inverse, middle, center # Marker quality --- # Proportion missing data <!-- --> --- # Allele frequencies, by marker <!-- --> --- # Allele frequencies, by marker <!-- --> --- # Genotype frequencies, by marker <!-- --> --- # Heterozygosities, by marker <!-- --> --- # Genotyping error rates <!-- --> --- # Genotyping error rate vs percent missing <!-- --> --- # Genotyping error rate vs percent missing <!-- --> --- # Nice markers <!-- --> --- # Crap markers <!-- --> --- # More crap markers <!-- --> --- # One bad blob <!-- --> --- # Wrong genomic coordinates <!-- --> --- # Puzzling no calls <!-- --> --- class: inverse, middle, center # Founder genotyping errors --- # One founder missing <!-- --> --- # Another case with one founder missing <!-- --> --- class: compressed_bullets # Summary - Quality of results depends on quality of data - Think about what might have gone wrong, and how it might be revealed - Pulling out the bad samples is the most important thing - Sex swaps: look at array intensities - Look for sample duplicates, and if possible sample mix-ups - Samples: missing data, array intensities, crossovers, errors - Markers: lots of reasons for the bad ones --- class: indent # Acknowledgments Alan Attie<br/> Gary Churchill<br/> Dan Gatti<br/> Alexandra Lobo<br/> Federico Rey<br/> Śaunak Sen<br/> Lindsay Traeger<br/> Brian Yandell<br/><br/> NIH/NIGMS, NIH/NIDDK --- class: middle, indent # Slides: [`bit.ly/2018ctc`](https://bit.ly/2018ctc) .cc0-badge[ [](https://creativecommons.org/publicdomain/zero/1.0) ] [`kbroman.org`](https://kbroman.org) [`github.com/kbroman`](https://github.com/kbroman) [`@kwbroman`](https://twitter.com/kwbroman)