class: center, middle, inverse, title-slide # Cleaning genotype data for diversity outbred mice ###
Karl Broman
### Biostatistics & Medical Informatics
University of Wisconsin–Madison
kbroman.org
github.com/kbroman
@kwbroman
Slides:
bit.ly/2018ctc
--- # Heterogeneous stock ![](index_files/figure-html/dofig-1.svg)<!-- --> --- # Diversity outbred mouse data - 500 DO mice - GigaMUGA SNP arrays (114k SNPs) - RNA-seq data on pancreatic islets - Microbiome data (16S and shotgun sequencing) - protein and lipid measurements by mass spec - Collaboration with Alan Attie, Gary Churchill, Brian Yandell, Josh Coon, Federico Rey, and many others --- # Principles What might have gone wrong? How could it be revealed? -- Also, just make a bunch of graphs. -- If you see something weird, try to figure it out. --- # Possible problems - Sample duplicates - Sample mix-ups - Bad samples - Bad markers - Genotyping errors in founders --- class: inverse, middle, center # What to look at first? --- # Missing data per sample ![](index_files/figure-html/plot_missing-1.svg)<!-- --> --- # Missing data per sample ![](index_files/figure-html/missing_genotypes_labeled-1.svg)<!-- --> --- class: inverse, middle, center # Swapped sex labels --- # Average SNP intensity on X and Y chr ![](index_files/figure-html/yint_vs_xint_all_probes-1.svg)<!-- --> --- # Average SNP intensity on X and Y chr ![](index_files/figure-html/yint_vs_xint_selected_probes-1.svg)<!-- --> --- # Average SNP intensity on X and Y chr ![](index_files/figure-html/yint_vs_xint_selected_probes_labeled-1.svg)<!-- --> --- # Heterozygosity vs SNP intensity on X chr ![](index_files/figure-html/het_vs_xint-1.svg)<!-- --> --- # Heterozygosity vs SNP intensity on X chr ![](index_files/figure-html/het_vs_xint_labeled-1.svg)<!-- --> --- class: inverse, middle, center # Sample duplicates --- # Percent matching genotypes between pairs ![](index_files/figure-html/prop_match-1.png)<!-- --> --- # Percent matching genotypes between pairs ![](index_files/figure-html/prop_match_labeled-1.png)<!-- --> --- class: inverse, middle, center # Sample mix-ups --- # Sample mix-ups ![](index_files/figure-html/mixup_illustration_1-1.svg)<!-- --> --- # Sample mix-ups ![](index_files/figure-html/mixup_illustration_2-1.svg)<!-- --> --- # Sample mix-ups ![](index_files/figure-html/mixup_illustration_3-1.svg)<!-- --> --- # Sample mix-ups ![](index_files/figure-html/mixup_illustration_4-1.svg)<!-- --> --- # RNA-seq sample mix-ups: distance matrix ![](index_files/figure-html/gve_dist_image-1.png)<!-- --> --- # RNA-seq sample mix-ups: min vs self distance ![](index_files/figure-html/gve_best_vs_self_unlabeled-1.svg)<!-- --> --- # RNA-seq sample mix-ups: min vs self distance ![](index_files/figure-html/gve_best_vs_self_labeled-1.svg)<!-- --> --- # RNA-seq sample mix-ups: detail ![](index_files/figure-html/rnaseq_problems_detail-1.svg)<!-- --> --- # Microbiome data ![](index_files/figure-html/microbiome_illustration-1.svg)<!-- --> --- # Sample mix-ups: Microbiome data - Impute genotypes at all SNPs in DNA samples - Map microbiome reads to mouse genome; find reads overlapping a SNP - For each pair of samples (DNA + microbiome): - Focus on reads that overlap a SNP where that DNA sample is homozygous - Distance = proportion of reads where SNP allele doesn't match DNA sample's genotype --- # Microbiome DO361 vs DNA DO361 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 939,918 </td> <td style="text-align:right;"> 1,044 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2,998 </td> <td style="text-align:right;"> 125,962 </td> </tr> </tbody> </table> --- # Microbiome DO360 vs DNA DO360 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2,661,645 </td> <td style="text-align:right;"> 190,188 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 427,685 </td> <td style="text-align:right;"> 202,335 </td> </tr> </tbody> </table> --- # Microbiome DO360 vs DNA DO370 <table class="table table-striped" style="font-size: 48px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> AA </th> <th style="text-align:right;"> BB </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 3,137,751 </td> <td style="text-align:right;"> 1,475 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 7,461 </td> <td style="text-align:right;"> 310,369 </td> </tr> </tbody> </table> --- # Microbiome mix-ups: min vs self distance ![](index_files/figure-html/microbiome_best_vs_self-1.svg)<!-- --> --- # Microbiome mix-ups: min vs self distance ![](index_files/figure-html/microbiome_best_vs_self_labeled-1.svg)<!-- --> --- class: inverse, middle, center # Sample quality --- # Missing data per sample ![](index_files/figure-html/missing_genotypes_labeled-1.svg) --- # Array intensities ![](index_files/figure-html/array_intensities_densities-1.png)<!-- --> --- # Allele frequencies, by individual ![](index_files/figure-html/genotype_freq-1.svg)<!-- --> --- # Allele frequencies, by individual ![](index_files/figure-html/allele_freq_subset-1.svg)<!-- --> --- # Allele frequencies, by individual ![](index_files/figure-html/allele_freq_subset_wdensity-1.svg)<!-- --> --- # Genotype frequencies, by individual ![](index_files/figure-html/geno_freq_byind-1.svg)<!-- --> --- # Genotype frequencies, by individual ![](index_files/figure-html/geno_freq_byind_wlabels-1.svg)<!-- --> --- # Heterozygosities, by individual ![](index_files/figure-html/het_byind-1.svg)<!-- --> --- # Genotype probabilities (one mouse on one chr) ![](index_files/figure-html/plot_genoprob-1.png)<!-- --> --- # Genome reconstruction (one mouse) ![](index_files/figure-html/plot_onegeno-1.svg)<!-- --> --- # Percent missing vs number of crossovers ![](index_files/figure-html/missing_v_nxo-1.svg)<!-- --> --- # Percent missing vs number of crossovers ![](index_files/figure-html/missing_v_nxo_labeled-1.svg)<!-- --> --- # Percent missing vs number of crossovers ![](index_files/figure-html/missing_v_nxo_subset-1.svg)<!-- --> --- # Percent missing vs number of crossovers ![](index_files/figure-html/missing_v_nxo_subset_labeled-1.svg)<!-- --> --- # No. crossovers by generation ![](index_files/figure-html/nxo_by_wave-1.svg)<!-- --> --- # Estimated percent of genotyping errors ![](index_files/figure-html/genotype_errors-1.svg)<!-- --> --- # Estimated percent of genotyping errors ![](index_files/figure-html/genotype_errors_subset-1.svg)<!-- --> --- class: inverse, middle, center # Marker quality --- # Proportion missing data ![](index_files/figure-html/hist_missing_bymar-1.png)<!-- --> --- # Allele frequencies, by marker ![](index_files/figure-html/allele_freq_bymar-1.svg)<!-- --> --- # Allele frequencies, by marker ![](index_files/figure-html/allele_freq_bymar_wdensity-1.svg)<!-- --> --- # Genotype frequencies, by marker ![](index_files/figure-html/geno_freq_bymar-1.png)<!-- --> --- # Heterozygosities, by marker ![](index_files/figure-html/het_bymar-1.svg)<!-- --> --- # Genotyping error rates ![](index_files/figure-html/genotype_errors_bymar-1.png)<!-- --> --- # Genotyping error rate vs percent missing ![](index_files/figure-html/error_vs_missing_bymar-1.png)<!-- --> --- # Genotyping error rate vs percent missing ![](index_files/figure-html/error_vs_missing_bymar_logscale-1.png)<!-- --> --- # Nice markers ![](index_files/figure-html/snpint_nice-1.svg)<!-- --> --- # Crap markers ![](index_files/figure-html/snpint_crap-1.svg)<!-- --> --- # More crap markers ![](index_files/figure-html/snpint_more_crap-1.svg)<!-- --> --- # One bad blob ![](index_files/figure-html/snpint_onebadblob-1.svg)<!-- --> --- # Wrong genomic coordinates ![](index_files/figure-html/snpint_wrong_location-1.svg)<!-- --> --- # Puzzling no calls ![](index_files/figure-html/snpint_nocalls-1.svg)<!-- --> --- class: inverse, middle, center # Founder genotyping errors --- # One founder missing ![](index_files/figure-html/snpint_foundermissing-1.svg)<!-- --> --- # Another case with one founder missing ![](index_files/figure-html/snpint_foundermissing_again-1.svg)<!-- --> --- class: compressed_bullets # Summary - Quality of results depends on quality of data - Think about what might have gone wrong, and how it might be revealed - Pulling out the bad samples is the most important thing - Sex swaps: look at array intensities - Look for sample duplicates, and if possible sample mix-ups - Samples: missing data, array intensities, crossovers, errors - Markers: lots of reasons for the bad ones --- class: indent # Acknowledgments Alan Attie<br/> Gary Churchill<br/> Dan Gatti<br/> Alexandra Lobo<br/> Federico Rey<br/> Śaunak Sen<br/> Lindsay Traeger<br/> Brian Yandell<br/><br/> NIH/NIGMS, NIH/NIDDK --- class: middle, indent # Slides: [`bit.ly/2018ctc`](https://bit.ly/2018ctc) .cc0-badge[ [![CC0](cc-zero.svg)](https://creativecommons.org/publicdomain/zero/1.0) ] [`kbroman.org`](https://kbroman.org) [`github.com/kbroman`](https://github.com/kbroman) [`@kwbroman`](https://twitter.com/kwbroman)