class: center, middle, inverse, title-slide # Cleaning genotype data for diversity outbred mice ###
Karl Broman
### Biostatistics & Medical Informatics
University of Wisconsin–Madison
kbroman.org
github.com/kbroman
@kwbroman
Slides:
bit.ly/2018ctc
--- # Heterogeneous stock data:image/s3,"s3://crabby-images/fa676/fa676bc2eea61fd860ac341f4f4d9e61dd0b17a9" alt=""<!-- --> --- # Diversity outbred mouse data - 500 DO mice - GigaMUGA SNP arrays (114k SNPs) - RNA-seq data on pancreatic islets - Microbiome data (16S and shotgun sequencing) - protein and lipid measurements by mass spec - Collaboration with Alan Attie, Gary Churchill, Brian Yandell, Josh Coon, Federico Rey, and many others --- # Principles What might have gone wrong? How could it be revealed? -- Also, just make a bunch of graphs. -- If you see something weird, try to figure it out. --- # Possible problems - Sample duplicates - Sample mix-ups - Bad samples - Bad markers - Genotyping errors in founders --- class: inverse, middle, center # What to look at first? --- # Missing data per sample data:image/s3,"s3://crabby-images/bed17/bed1713f61385eee4611e7e32af5c79d06c08989" alt=""<!-- --> --- # Missing data per sample data:image/s3,"s3://crabby-images/89f43/89f437ababd941fca7a96a7668dd14d03f594958" alt=""<!-- --> --- class: inverse, middle, center # Swapped sex labels --- # Average SNP intensity on X and Y chr data:image/s3,"s3://crabby-images/27ecc/27ecc6b833d373b46990bbe9755ca2d5fed57b3b" alt=""<!-- --> --- # Average SNP intensity on X and Y chr data:image/s3,"s3://crabby-images/10b9e/10b9ec9708ed02175592711ba0d7128efe1d4e6e" alt=""<!-- --> --- # Average SNP intensity on X and Y chr data:image/s3,"s3://crabby-images/5166c/5166c92de0d5407dbd193f35e331602e37dbb588" alt=""<!-- --> --- # Heterozygosity vs SNP intensity on X chr data:image/s3,"s3://crabby-images/d702e/d702e324f4619c537a74cf278dccc94083599805" alt=""<!-- --> --- # Heterozygosity vs SNP intensity on X chr data:image/s3,"s3://crabby-images/99c5e/99c5ef759f27ec467ea6f089e30f1dcf02409afb" alt=""<!-- --> --- class: inverse, middle, center # Sample duplicates --- # Percent matching genotypes between pairs data:image/s3,"s3://crabby-images/ddac1/ddac165f3c4c91acd9dc335d6a0be31957548b0a" alt=""<!-- --> --- # Percent matching genotypes between pairs data:image/s3,"s3://crabby-images/29d56/29d56df7bbad29709f0744b5d899aa6fc09d306c" alt=""<!-- --> --- class: inverse, middle, center # Sample mix-ups --- # Sample mix-ups data:image/s3,"s3://crabby-images/1458a/1458a75808d6c9e5e87f4089063d72e39096aa46" alt=""<!-- --> --- # Sample mix-ups data:image/s3,"s3://crabby-images/e116e/e116ed7589440d79579cbb4844c207e854183a2b" alt=""<!-- --> --- # Sample mix-ups data:image/s3,"s3://crabby-images/39494/394941e7352b17196ed45c98a4460fa6fd42e2da" alt=""<!-- --> --- # Sample mix-ups data:image/s3,"s3://crabby-images/eedaa/eedaac133be347992ff1cf1dee9988d3b1205748" alt=""<!-- --> --- # RNA-seq sample mix-ups: distance matrix data:image/s3,"s3://crabby-images/fa041/fa0411822efde3fac6f6a2f20fba97b4e20638d3" alt=""<!-- --> --- # RNA-seq sample mix-ups: min vs self distance data:image/s3,"s3://crabby-images/481e6/481e6d1e1d939cbfc88c18ea98a8c5f66e5d235e" alt=""<!-- --> --- # RNA-seq sample mix-ups: min vs self distance data:image/s3,"s3://crabby-images/84c18/84c184939ff0e89396eb16237bfc5d41a9fb51e6" alt=""<!-- --> --- # RNA-seq sample mix-ups: detail data:image/s3,"s3://crabby-images/b0b26/b0b26b52da13fb165fddbe4778e1261a351f4a80" alt=""<!-- --> --- # Microbiome data data:image/s3,"s3://crabby-images/650b4/650b445d2b598e008dbb097b5e38ace60ae63e14" alt=""<!-- --> --- # Sample mix-ups: Microbiome data - Impute genotypes at all SNPs in DNA samples - Map microbiome reads to mouse genome; find reads overlapping a SNP - For each pair of samples (DNA + microbiome): - Focus on reads that overlap a SNP where that DNA sample is homozygous - Distance = proportion of reads where SNP allele doesn't match DNA sample's genotype --- # Microbiome DO361 vs DNA DO361 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 939,918 </td> <td style="text-align:right;"> 1,044 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2,998 </td> <td style="text-align:right;"> 125,962 </td> </tr> </tbody> </table> --- # Microbiome DO360 vs DNA DO360 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2,661,645 </td> <td style="text-align:right;"> 190,188 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 427,685 </td> <td style="text-align:right;"> 202,335 </td> </tr> </tbody> </table> --- # Microbiome DO360 vs DNA DO370 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 3,137,751 </td> <td style="text-align:right;"> 1,475 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 7,461 </td> <td style="text-align:right;"> 310,369 </td> </tr> </tbody> </table> --- # Microbiome mix-ups: min vs self distance data:image/s3,"s3://crabby-images/bb2e9/bb2e9d0d2d0ca864fd7c56656d9b1cbe66adbc02" alt=""<!-- --> --- # Microbiome mix-ups: min vs self distance data:image/s3,"s3://crabby-images/6c1d5/6c1d505c3860750eb35c01a2044c9690f2a4a100" alt=""<!-- --> --- class: inverse, middle, center # Sample quality --- # Missing data per sample data:image/s3,"s3://crabby-images/89f43/89f437ababd941fca7a96a7668dd14d03f594958" alt="" --- # Array intensities data:image/s3,"s3://crabby-images/424ae/424ae8e6983599ca10bfdcdc09f5c8260ca0b0ac" alt=""<!-- --> --- # Allele frequencies, by individual data:image/s3,"s3://crabby-images/c4205/c42053e010db919625398eefc969a65ef607975d" alt=""<!-- --> --- # Allele frequencies, by individual data:image/s3,"s3://crabby-images/d3b6b/d3b6bfd67eaeaa1c9cd380e030718fff19af7ba5" alt=""<!-- --> --- # Allele frequencies, by individual data:image/s3,"s3://crabby-images/012c2/012c2c2f04c3140547b60081c64d1f52d053c3c4" alt=""<!-- --> --- # Genotype frequencies, by individual data:image/s3,"s3://crabby-images/9a321/9a32135dadf2511d1ab953064b7f43c419f90228" alt=""<!-- --> --- # Genotype frequencies, by individual data:image/s3,"s3://crabby-images/9ac1d/9ac1d1a21296adadb9351a8248c84f2f1645fed2" alt=""<!-- --> --- # Heterozygosities, by individual data:image/s3,"s3://crabby-images/fad7d/fad7dea0d08da21e22011807795df1835b10b2dd" alt=""<!-- --> --- # Genotype probabilities (one mouse on one chr) data:image/s3,"s3://crabby-images/f9767/f9767357fe343ed9924608b072ac58e53d1f5142" alt=""<!-- --> --- # Genome reconstruction (one mouse) data:image/s3,"s3://crabby-images/f53ab/f53ab3145e84de452879a90df4c79e34a4f3a593" alt=""<!-- --> --- # Percent missing vs number of crossovers data:image/s3,"s3://crabby-images/46ef0/46ef015cfbdd635a1fa66d43c5fd07b958d51560" alt=""<!-- --> --- # Percent missing vs number of crossovers data:image/s3,"s3://crabby-images/32597/32597ed908fc523059c9c3bb1d1a35e92b35c9ce" alt=""<!-- --> --- # Percent missing vs number of crossovers data:image/s3,"s3://crabby-images/cc34a/cc34aa5323fc57d364ce64fc1ef114d2f0f79f36" alt=""<!-- --> --- # Percent missing vs number of crossovers data:image/s3,"s3://crabby-images/56abd/56abd5a1814f8fb056e32512bdbbfede80aa5cbe" alt=""<!-- --> --- # No. crossovers by generation data:image/s3,"s3://crabby-images/2d73e/2d73e1b707098258961e5d8d0599281f6a2362f9" alt=""<!-- --> --- # Estimated percent of genotyping errors data:image/s3,"s3://crabby-images/8de7a/8de7ab5a6c687e85212735134615dfb773d28c49" alt=""<!-- --> --- # Estimated percent of genotyping errors data:image/s3,"s3://crabby-images/139bd/139bdb363b52a8f725a9124405795244dc9eb269" alt=""<!-- --> --- class: inverse, middle, center # Marker quality --- # Proportion missing data data:image/s3,"s3://crabby-images/ebb77/ebb773aebfeae724f8c192ba4192b15ee93755f1" alt=""<!-- --> --- # Allele frequencies, by marker data:image/s3,"s3://crabby-images/0afb7/0afb75182b377d50cbc7091460830bd3f173de97" alt=""<!-- --> --- # Allele frequencies, by marker data:image/s3,"s3://crabby-images/0318f/0318f30d18f261cf3789f8526640541dc64dac6b" alt=""<!-- --> --- # Genotype frequencies, by marker data:image/s3,"s3://crabby-images/f8afd/f8afdde7fc42c0ea58cea6611facb51e75f5db06" alt=""<!-- --> --- # Heterozygosities, by marker data:image/s3,"s3://crabby-images/9d72d/9d72d9c6852cd3da229287850ab4a712eeae9291" alt=""<!-- --> --- # Genotyping error rates data:image/s3,"s3://crabby-images/d1246/d1246b2cc47c84ecb569150ca15bda0b59380c48" alt=""<!-- --> --- # Genotyping error rate vs percent missing data:image/s3,"s3://crabby-images/5b836/5b836829a0f461614d5bcce522ab7499e60880f0" alt=""<!-- --> --- # Genotyping error rate vs percent missing data:image/s3,"s3://crabby-images/3a8fb/3a8fbac324be2ac403dc4235753f81d7e64fb98e" alt=""<!-- --> --- # Nice markers data:image/s3,"s3://crabby-images/425fd/425fd123d0c8ce63f0080b349c47098606037278" alt=""<!-- --> --- # Crap markers data:image/s3,"s3://crabby-images/23cc6/23cc678bbc0ff05a51f98cd2f001bdb41d4ee88a" alt=""<!-- --> --- # More crap markers data:image/s3,"s3://crabby-images/5cd9a/5cd9a95c7b7aef5f4dd778ccc20dc2f36de768a2" alt=""<!-- --> --- # One bad blob data:image/s3,"s3://crabby-images/46816/46816a6e579b47825a9308477bd0b10bf6b4aaa2" alt=""<!-- --> --- # Wrong genomic coordinates data:image/s3,"s3://crabby-images/7444e/7444e55b2bd0a205f47089378ab4b24a12d2803a" alt=""<!-- --> --- # Puzzling no calls data:image/s3,"s3://crabby-images/6c569/6c569180a2a562fa672dca705a8e71ab9862682e" alt=""<!-- --> --- class: inverse, middle, center # Founder genotyping errors --- # One founder missing data:image/s3,"s3://crabby-images/680be/680be94be5a27331f57c1a5f88baf509cdebf281" alt=""<!-- --> --- # Another case with one founder missing data:image/s3,"s3://crabby-images/f7c7f/f7c7f67cee030a678ebb79a5210273b972d33f89" alt=""<!-- --> --- class: compressed_bullets # Summary - Quality of results depends on quality of data - Think about what might have gone wrong, and how it might be revealed - Pulling out the bad samples is the most important thing - Sex swaps: look at array intensities - Look for sample duplicates, and if possible sample mix-ups - Samples: missing data, array intensities, crossovers, errors - Markers: lots of reasons for the bad ones --- class: indent # Acknowledgments Alan Attie<br/> Gary Churchill<br/> Dan Gatti<br/> Alexandra Lobo<br/> Federico Rey<br/> Śaunak Sen<br/> Lindsay Traeger<br/> Brian Yandell<br/><br/> NIH/NIGMS, NIH/NIDDK --- class: middle, indent # Slides: [`bit.ly/2018ctc`](https://bit.ly/2018ctc) .cc0-badge[ [data:image/s3,"s3://crabby-images/7da07/7da07decf85f90f0dfdf9eb7da2a039a021cede6" alt="CC0"](https://creativecommons.org/publicdomain/zero/1.0) ] [`kbroman.org`](https://kbroman.org) [`github.com/kbroman`](https://github.com/kbroman) [`@kwbroman`](https://twitter.com/kwbroman)