Introduction

While preparing a talk and paper on genotype diagnostics in Diversity Outbred mice, I developed some concerns about the UNC annotation file for the GigaMUGA array. In particular, I felt that some of the columns may have been scrambled:

  • In blasting the GigaMUGA probe sequences against the mouse genome, I found the same number of non-unique markers (3,114), but they were not entirely the same markers.

  • Of the markers in common between the MegaMUGA and GigaMUGA arrays, the chromosome assignments have a lot of differences, even among those indicated to be unique.

In case that the probe sequences in the UNC annotation files might be in error, I obtained the array probe sequences directly from GeneSeek. These proved to be the same as those in the UNC files.

My goal here is to construct new annotation files for the GigaMUGA and MegaMUGA arrays with updated information about which markers map uniquely, which map to multiple locations, and which do not map. I’ll include genomic coordinates only for the markers that map uniquely.

A summary of my findings:

  • For markers mapping uniquely (by my determination), the genomic coordinates in the UNC GigaMUGA annotation file were correct (except for a set of 394 markers where the UNC position is off by one basepair, because the probe sequences include the SNP and this wasn’t taken into account).

  • The UNC MegaMUGA annotation file had coordinates in a previous mouse genome build (mm9). For markers mapping uniquely, there are just a small number with differences in chromosome assignments; there are lots of differences in positions, with the new build (mm10).

  • The unique column in the GigaMUGA annotation file from UNC is messed up.

  • On both arrays, there are a number of pairs of markers with common probe sequences. Many of these are intended duplicates, but some are not and have inconsistent information

  • Of the markers in common between the two arrays, some have different probe sequences on the two arrays, and there are cases where the sequence on one array maps uniquely but the other does not.

The final annotation files and all source materials are at https://github.com/kbroman/MUGAarrays.

Procedures

The key thing here was to run blastn to map all array probe sequences to the mouse genome (build mm10). There was some trickiness in trimming SNP alleles off some probes, and in identifying the SNP position from the blast results.

Separately, I studied the probe sequences in some detail. Most probes do not include the SNP, which is the next base off the 3’ end. But for probes where AlleleB_ProbeSeq is provided (in addition to AlleleA_ProbeSeq, which is always provided), the sequences include the SNP allele as the last base. In these cases, I trimmed off the SNP allele and used the 49-bp sequence.

I performed blast allowing no insertions/deletions. In cases where the match didn’t include the full probe sequence, I took all non-aligned bases to be mismatches.

More precisely, I installed blastn on my linux laptop with

sudo apt install ncbi-blast+

I downloaded the mm10 mouse genome files from http://hgdownload-test.cse.ucsc.edu/goldenPath/mm10/bigZips/

I created index files with

makeblastdb -in MouseFasta/chr19.fa -dbtype nuc

I ran blastn with a command like this:

blastn -db MouseFasta/chr19.fa -ungapped -out output_c19.txt -query gigamuga.fa -outfmt 6

For some reason, there were a small number of duplicate alignments (same query, same chromosome, same start and end) in the results; I omitted all but one of these duplicates.

I’ve also focused on full-length, perfect hits.

Preliminaries

I’ll first load some R packages.

library(data.table)
library(broman)
library(devtools)

I’ll now load various data sets: the GeneSeek and UNC annotation files, and the blast results. For the blast results, I only keep the full length, perfect hits.

# GeneSeek files
gm_gs <- data.table::fread("../GeneSeek/gigamuga_geneseek.csv",
                           skip=7, data.table=FALSE)
wh <- which(gm_gs[,1]=="[Controls]")
gm_gs <- gm_gs[1:(wh-1),]
rownames(gm_gs) <- gm_gs$Name

mm_gs <- data.table::fread("../GeneSeek/megamuga_geneseek.csv",
                           skip=7, data.table=FALSE)
wh <- which(mm_gs[,1]=="[Controls]")
mm_gs <- mm_gs[1:(wh-1),]
rownames(mm_gs) <- mm_gs$Name

# UNC files
load("../UNC/snps.gigamuga.Rdata")
gm_unc <- snps
# omit "chr" from the chromosome IDs
gm_unc$chr <- sub("chr", "", gm_unc$chr)

load("../UNC/snps.megamuga.Rdata")
mm_unc <- snps
# omit "chr" from the chromosome IDs
mm_unc$chr <- sub("chr", "", mm_unc$chr)

# Blast results; keep just the perfect matches(?)
mm_blast <- readRDS("../Blast/results_mm/mm_blastn_results.rds")
mm_blast <- mm_blast[mm_blast$tot_mismatch==0,]
gm_blast <- readRDS("../Blast/results_gm/gm_blastn_results.rds")
gm_blast <- gm_blast[gm_blast$tot_mismatch==0,]

# blast results for untrimmed sequences
gm_untrimmed_blast <- readRDS("../Blast/results_gm_untrimmed/gm_untrimmed_blastn_results.rds")
gm_untrimmed_blast <- gm_untrimmed_blast[gm_untrimmed_blast$tot_mismatch==0,]

For reasons that will become apparent later, for a set of 469 markers, I also ran blast with an “untrimmed” probe sequence that included the SNP allele at the end.

Summarize blast hits

I’ll start with a bunch of code to summarize the blast hits. For each SNP, I want to count the number of perfect blast hits and the number of chromosomes hit. For markers with a unique hit, I’ll record the chromosome, position, and strand. For markers that hit multiple locations but all on the same chromosome, I’ll record the range of the positions hit (in basepairs).

Here’s the code for the MegaMUGA array:

# no. blast hits
mm_tab <- table(mm_blast$query)
mm_nchr <- mm_nhits <- setNames(rep(0, nrow(mm_gs)), rownames(mm_gs))
mm_nhits[names(mm_tab)] <- mm_tab

# no. chromosomes hit
mm_tab_chr <- table(mm_blast$query, mm_blast$chr)
mm_nchr[rownames(mm_tab_chr)] <- rowSums(mm_tab_chr > 0)

# chr,pos,strand for the unique ones
mm_blast_uniq <- mm_blast[mm_blast$query %in% names(mm_nhits)[mm_nhits==1],]
mm_blast_chr <- mm_blast_pos <- mm_blast_strand <- setNames(rep(NA, nrow(mm_gs)), rownames(mm_gs))
mm_blast_chr[mm_blast_uniq$query] <- mm_blast_uniq$chr
mm_blast_pos[mm_blast_uniq$query] <- mm_blast_uniq$snp_pos
mm_blast_strand[mm_blast_uniq$query] <- mm_blast_uniq$strand

# probe sequences
mm_blast_probe <- setNames(mm_gs$AlleleA_ProbeSeq, mm_gs$Name)
mm_trim <- (mm_gs$AlleleB_ProbeSeq != "")
mm_blast_probe[mm_trim] <- substr(mm_blast_probe[mm_trim], 1, nchar(mm_blast_probe[mm_trim])-1)

# SNP alleles
mm_blast_snp <- setNames(sub("]", "", mm_gs$SNP, fixed=TRUE), mm_gs$Name)
mm_blast_snp <- sub("[", "", mm_blast_snp, fixed=TRUE)
mm_blast_snp <- sub("/", "", mm_blast_snp, fixed=TRUE)

# put all of this stuff into a data frame
mm_uwisc <- data.frame(marker=names(mm_nhits),
                       n_blast_hits=mm_nhits,
                       n_blast_chr=mm_nchr,
                       unique=(mm_nhits==1),
                       multi=(mm_nhits>1),
                       unmapped=(mm_nhits==0),
                       chr=mm_blast_chr,
                       pos=mm_blast_pos,
                       strand=mm_blast_strand,
                       snp=mm_blast_snp[names(mm_nhits)],
                       probe=mm_blast_probe[names(mm_nhits)],
                       stringsAsFactors=FALSE)
rownames(mm_uwisc) <- mm_uwisc$marker

# probes that hit multiply but all to one chromosome
mm_mult_but1chr <- names(mm_nhits)[mm_nhits > 1 & mm_nchr==1]
mm_mult_but1chr_range <- sapply(mm_mult_but1chr, function(mar)
    diff(range(mm_blast$snp_pos[mm_blast$query==mar])))

# probes that hit uniquely on each of X and Y and to no other chromosome
mm_hitXY <- rownames(mm_tab_chr)[rowSums(mm_tab_chr[,1:20])==0 & mm_tab_chr[,21]>0 & mm_tab_chr[,22]>0]

Now we repeat that for the GigaMUGA array:

# no. blast hits
gm_tab <- table(gm_blast$query)
gm_nchr <- gm_nhits <- setNames(rep(0, nrow(gm_gs)), rownames(gm_gs))
gm_nhits[names(gm_tab)] <- gm_tab

# no. chromosomes hit
gm_tab_chr <- table(gm_blast$query, gm_blast$chr)
gm_nchr[rownames(gm_tab_chr)] <- rowSums(gm_tab_chr > 0)

# chr,pos,strand for the unique ones
gm_blast_uniq <- gm_blast[gm_blast$query %in% names(gm_nhits)[gm_nhits==1],]
gm_blast_chr <- gm_blast_pos <- gm_blast_strand <- setNames(rep(NA, nrow(gm_gs)), rownames(gm_gs))
gm_blast_chr[gm_blast_uniq$query] <- gm_blast_uniq$chr
gm_blast_pos[gm_blast_uniq$query] <- gm_blast_uniq$snp_pos
gm_blast_strand[gm_blast_uniq$query] <- gm_blast_uniq$strand

# probe sequences
gm_blast_probe <- setNames(gm_gs$AlleleA_ProbeSeq, gm_gs$Name)
gm_trim <- (gm_gs$AlleleB_ProbeSeq != "")
gm_blast_probe[gm_trim] <- substr(gm_blast_probe[gm_trim], 1, nchar(gm_blast_probe[mm_trim])-1)

# SNP alleles
gm_blast_snp <- setNames(sub("]", "", gm_gs$SNP, fixed=TRUE), gm_gs$Name)
gm_blast_snp <- sub("[", "", gm_blast_snp, fixed=TRUE)
gm_blast_snp <- sub("/", "", gm_blast_snp, fixed=TRUE)

# put all of this stuff into a data frame
gm_uwisc <- data.frame(marker=names(gm_nhits),
                       n_blast_hits=gm_nhits,
                       n_blast_chr=gm_nchr,
                       unique=(gm_nhits==1),
                       multi=(gm_nhits>1),
                       unmapped=(gm_nhits==0),
                       chr=gm_blast_chr,
                       pos=gm_blast_pos,
                       strand=gm_blast_strand,
                       snp=gm_blast_snp[names(gm_nhits)],
                       probe=gm_blast_probe[names(gm_nhits)],
                       stringsAsFactors=FALSE)
rownames(gm_uwisc) <- gm_uwisc$marker

# probes that hit multiply but all to one chromosome
gm_mult_but1chr <- names(gm_nhits)[gm_nhits > 1 & gm_nchr==1]
gm_mult_but1chr_range <- sapply(gm_mult_but1chr, function(mar)
    diff(range(gm_blast$snp_pos[gm_blast$query==mar])))

# probes that hit uniquely on each of X and Y and to no other chromosome
gm_hitXY <- rownames(gm_tab_chr)[rowSums(gm_tab_chr[,1:20])==0 & gm_tab_chr[,21]>0 & gm_tab_chr[,22]>0]

MegaMUGA

Basic blast results

I’ll start by focusing on the MegaMUGA array, which contains 77,808 markers. In the UNC annotation file, 75,135 are on autosomes, 2,499 are on the X chromosome, 45 are on the Y chromosome, 46 are on the mitochondrial genome, and 83 have chromosome P, which indicates they are transgene-related markers.

In the blast results, I find that 75,061 markers have a single, unique hit in the mouse genome (build mm10), while 1,236 hit to multiple locations (median 4 locations and maximum 18,855), and 1,511 have no perfect match in the mm10 build of the mouse genome.

Unique markers

If we look at the inferred chromosome assignments of the markers with a unique hit to the genome, all but 6 match the chromosome in the UNC annotation file. Here are the markers with discrepancies in chromosome assignment:

# reorder mm_unc
mm_unc <- mm_unc[mm_uwisc$marker,]
wh <- mm_uwisc$unique & mm_uwisc$chr != mm_unc$chr
mm_diff_chr <- data.frame(marker=mm_uwisc$marker,
                          unc_chr=mm_unc$chr,
                          blast_chr=mm_uwisc$chr,
                          stringsAsFactors=FALSE)[wh,]
mm_diff_chr <- mm_diff_chr[order(mm_diff_chr$marker),]
rownames(mm_diff_chr) <- 1:nrow(mm_diff_chr)
mm_diff_chr
##            marker unc_chr blast_chr
## 1   MiniMuscle001      16        11
## 2          Mit024       M         1
## 3          NNT001       P        13
## 4          NNT002       P        13
## 5 SAbGeoEUCOMM001       P         5
## 6 SAbGeoEUCOMM002       P         5

Four of these were chr P (transgene-related) in the UNC annotations but map to autosomes, one was mitochondrial (with a name to match) but the sequence maps to chr 1, and one (MiniMuscle001) was assigned to chr 16 but maps to chr 11.

Five of these six markers are found on the GigaMUGA array (all but SAbGeoEUCOMM002), and the chromosome assignments in the UNC annotations of the GigaMUGA array all match what we see in these blast results, rather than the chromosome in the UNC annotations of the MegaMUGA array.

I won’t make any comparisons of basepair positions, as the UNC annotation file for the MegaMUGA array uses coordinates from build mm9 of the mouse genome, rather than build mm10, which I’m using.

Multiple hits to one chromosome

There are 364 markers on the MegaMUGA array that have multiple perfect hits in the mouse genome, but all on the same chromosome. The range in SNP positions, for these multiple hits, varied from 35 bp to 147 Mbp, with a median of 218 kbp.

There are 29 markers with multiple hits but all on a single chromosome and all within a 1000 bp interval. We could maybe keep these, and assign each to the median of the observed positions. But for now we’ll leave them with missing genomic coordinates.

Chromosome “P”

The UNC annotations for the MegaMUGA array has 83 markers with chromosome P, which are transgene-related markers, for example Cre001 and LacZ001. Only four have hits to the mm10 mouse genome build, and these are the four mentioned above, that actually map to autosomes. The other 79 have no perfect hits in the blast results.

Pseudoautosomal markers

There are 17 markers with names like PAR10, which I take to mean they’re from the pseudoautosomal region. All of these were placed on chr X in the UNC annotation file. In the blast results, four of these have unique hits in the mouse genome, all to chromosome X. The remainder have multiple hits. PAR4 has a single hit on each of the X and Y chromosomes. PAR6 has a single hit on the X chromosome and two hits on the Y chromosome. The other 11 PAR markers have multiple hits but just to the X chromosome.

There are two additional markers with hits to both the X and the Y chromosome, but with no hits to autosomes: UNC31596388 and UNC31596457. UNC31596388 has a single hit on each of the X and Y chromosomes. UNC31596457 has a single hit on the X and two hits on the Y. So maybe these are also pseudoautosomal markers.

Markers with the same probes

One last thing on the MegaMUGA array: there are 50 pairs of markers with the same probe sequence, and one set of four markers that all have the same sequence.

For most of these replicate probes, it seems like they were intended, as they have names like JAX00333860 and repJAX00333860, or CEAJAX00159275 and CEBJAX00159275.

The group of four probes with matching sequences are JAX00725045, JAX00725054, repJAX00725045, and repJAX00725054, all on the Y chromosome. So two pairs of intended replicates, but it’s not obvious that JAX00725045 and JAX00725054 should be the same.

Of the other 50 pairs, 35 are obvious from their names. Here are the others:

mm_probe_tab <- table(mm_uwisc$probe)
dup <- names(mm_probe_tab)[mm_probe_tab==2]
dup_pairs <- t(sapply(dup, function(a) mm_uwisc$marker[mm_uwisc$probe==a]))
dup_pairs <- dup_pairs[dup_pairs[,2] != paste0("rep", dup_pairs[,1]) &
                       dup_pairs[,1] != paste0("rep", dup_pairs[,2]) &
                       dup_pairs[,2] != paste0("CEA", dup_pairs[,2]) &
                       dup_pairs[,1] != paste0("CEA", dup_pairs[,2]) &
                       sub("^CEA", "CEB", dup_pairs[,2]) != dup_pairs[,1] &
                       sub("^CEA", "CEB", dup_pairs[,1]) != dup_pairs[,2],]
dup_pairs <- t(apply(dup_pairs, 1, sort))
rownames(dup_pairs) <- 1:nrow(dup_pairs)
colnames(dup_pairs) <- c("snp1", "snp2")
dup_pairs
##    snp1             snp2         
## 1  "Prdm9001"       "Prdm9005"   
## 2  "Mit020"         "Mit031"     
## 3  "UNC2512583"     "UNC88694"   
## 4  "Mit001"         "Mit002"     
## 5  "NNT001"         "NNT002"     
## 6  "UNC23833145"    "UNC4368900" 
## 7  "UNC20251282"    "UNC5746160" 
## 8  "UNC27690624"    "UNC30009421"
## 9  "UNC10237349"    "UNC9964748" 
## 10 "Mx1002"         "Mx1004"     
## 11 "Mx1001"         "Mx1003"     
## 12 "UNC19598208"    "UNC24941679"
## 13 "NRAMP1001"      "NRAMP1002"  
## 14 "Mit017"         "Mit018"     
## 15 "CEAJAX00012397" "UNC2165985"

This begs the question: are there other marker pairs whose names indicate that they should be duplicates, but where they’re not really duplicates?

mn <- mm_uwisc$marker
mnrep <- paste0("rep", mn)
mncea <- paste0("CEA", mn)
mnceb <- paste0("CEB", substr(mn, 4, nchar(mn)))
mnceb[grepl("^CEB", mn)] <- "" # drop those that already start with CEB

more_dup <- rbind(cbind(mn, mnrep)[mnrep %in% mn,],
                  cbind(mn, mncea)[mncea %in% mn,],
                  cbind(mn, mnceb)[mnceb %in% mn,])

more_dup_probes <- cbind(mm_uwisc[more_dup[,1],"probe"],
                         mm_uwisc[more_dup[,2],"probe"])

stopifnot(more_dup_probes[,1] == more_dup_probes[,2])

If I look for the patterns that I saw in the marker pairs with common sequences, I see a total of 37 pairs, but these are exactly the ones found based on their sequences matching (including the two pairs that make up the set of four markers all with the same sequence).

GigaMUGA

I’ll now turn to the GigaMUGA array, and do the same stuff plus a bit more.

Basic blast results

The GigaMUGA array contains 143,259 markers. In the UNC annotation file, 137,475 are on autosomes, 5,589 are on the X chromosome, 83 are on the Y chromosome, and 32 are on the mitochondrial genome.

In the UNC annotation file, none of the GigaMUGA markers are assigned to chromosome P, but there are 80 markers where the chromosome ID is missing. Of these, 72 were on the MegaMUGA array and assigned to chromosome P, 3 were on MegaMUGA array but assigned to chr 18, and 5 were not on the MegaMUGA array.

(Of the 83 markers on the MegaMUGA array with chr ID P, 8 are not on the GigaMUGA array, 72 have missing chr ID in the UNC annotations on the GigaMUGA array, 2 are on chr 13 in the UNC annotation of the GigaMUGA array, and 1 is on chr 5 in the UNC annotation of the GigaMUGA array.)

I’m going to take these NA chromosome IDs to be P.

gm_unc$chr[is.na(gm_unc$chr)] <- "P"

The GeneSeek annotation file for the GigaMUGA array includes 143,446 markers: the 143,259 markers in the UNC annotation file for the array, plus another 187 markers. These extra markers all have names like CTRL-22 and so seem to be control probes. None of them have matches in the mouse genome, and so we will omit them from further consideration.

# make sure control probes have no hits to mouse genome
ctrl <- gm_gs$Name[!(gm_gs$Name %in% gm_unc$marker)]
stopifnot( !any(ctrl %in% gm_blast$query) )

# omit these control probes from the annotations
gm_gs <- gm_gs[gm_gs$Name %in% gm_unc$marker,]
gm_uwisc <- gm_uwisc[gm_uwisc$marker %in% gm_unc$marker,]

# omit them from other pieces I'm using
gm_nhits <- gm_nhits[gm_uwisc$marker]

In the blast results, I find that 137,359 markers have a single, unique hit in the mouse genome (build mm10), while 3,150 hit to multiple locations (median 3 locations and maximum 18,855), and 2,750 have no perfect match in the mm10 build of the mouse genome.

Unique markers

If we look at the inferred chromosome assignments of the 137,359 markers with a unique hit to the genome, all of them match the chromosome in the UNC annotation file.

# all of the unique ones match
stopifnot( all(gm_unc[gm_uwisc$marker[gm_uwisc$unique], "chr"] == gm_uwisc$chr[gm_uwisc$unique]) )

# reorder gm_unc and gm_gs
gm_unc <- gm_unc[gm_uwisc$marker,]

The SNP positions match for 136,965 markers, but are off by ±1 for the other 394 markers. These off-by-1 markers are exactly the ones where the probe sequence in the GeneSeek annotation file included the SNP, which I trimmed off. Moreover, the ones that I have at +1 relative to the UNC annotation are the ones on the minus strand, and the ones I have at -1 relative to the UNC annotation are the ones on the plus strand. And so these off-by-one values are due to the handling of the cases where the probe sequences included the SNP. And I am confident that my positions are the correct values.

# verify the statements above
# markers where position is off by one are the cases where AlleleB is provided
off_by_plus1 <- gm_uwisc$marker[gm_uwisc$pos == gm_unc$pos+1 & gm_uwisc$unique]
off_by_minus1 <- gm_uwisc$marker[gm_uwisc$pos == gm_unc$pos-1 & gm_uwisc$unique]
off_by_pm1 <- c(off_by_plus1, off_by_minus1)
stopifnot( sort(off_by_pm1) == sort(gm_gs$Name[gm_gs$AlleleB_ProbeSeq != "" & gm_uwisc$unique]) )
stopifnot( sort(off_by_plus1) == sort(gm_gs$Name[gm_gs$AlleleB_ProbeSeq != "" & gm_uwisc$unique & gm_uwisc$strand=="minus"]) )
stopifnot( sort(off_by_minus1) == sort(gm_gs$Name[gm_gs$AlleleB_ProbeSeq != "" & gm_uwisc$unique & gm_uwisc$strand=="plus"]) )

Here’s an example of this, for marker UNCrs220914250. I show the basepair positions assigned by me and UNC, as well as the two probes and the source sequence.

##                chr pos_uwisc   pos_unc
## UNCrs220914250   1 130451926 130451927
## probe A:   CACCTGAGTGATGGACCCTGTGCTCTGGGTGGTCTTTTCCCTTAAAGCTC
## probe B:   CACCTGAGTGATGGACCCTGTGCTCTGGGTGGTCTTTTCCCTTAAAGCTG
## source:  NCCACCTGAGTGATGGACCCTGTGCTCTGGGTGGTCTTTTCCCTTAAAGCT[C/G]TGCTA ...

The blast result for the trimmed sequence has it matching to chr 1 from 130,451,877 to 130,451,925, putting the SNP at 130,451,926.

The blast results for the untrimmed sequence has it matching to chr 1 from 130,451,877 to 130,451,926, but since the probe includes the SNP, that would place the SNP at the end position, 130,451,927.

Multiple hits to one chromosome

There are 1,840 markers on the GigaMUGA array that have multiple perfect hits in the mouse genome, but all on the same chromosome. The range in SNP positions, for these multiple hits, varied from 33 bp to 147 Mbp, with a median of 596 kbp.

There are 55 markers with multiple hits but all on a single chromosome and all within a 1000 bp interval. We could maybe keep these, and assign each to the median of the observed positions. But for now we’ll leave them with missing genomic coordinates.

Chromosome “P”

Regarding the chromosome “P” markers: there were 80 GigaMUGA markers where the chromosome ID was missing in the UNC annotation file, and which we’re interpreting here as P (transgene-related). None of them have a hit to the mm10 mouse genome build.

Pseudoautosomal region

There are 60 markers with names like PAR10 or PARv2S06. 29 of these have unique hits in the mouse genome, all to the X chromosome. One PAR marker (PARv2S58) has no hits to the mouse genome. The other 30 have multiple hits

There are 42 markers that have hits on both the X and Y chromosomes and not to any autosomes. These include 16 of these PAR markers, but also another 26 markers with names starting JAX or UNC. Seven of these markers have a single hit on each of the X and Y chromosomes, which sounds like what we’d want for a pseudoautosomal marker: PAR4, PARv2S15, PARv2S18, PARv2S44, UNC31596388, UNCHS049772, and UNCJPD007636.

Markers with the same probes

As with the MegaMUGA array, there are a number of sets of probes with the same sequence: 2,620 pairs of markers with the same probe sequence, 49 trios, and 2 groups of four.

gm_probe_tab <- table(gm_uwisc$probe)
gm_pairs <- t(sapply(names(gm_probe_tab)[gm_probe_tab==2],
                     function(probe) gm_uwisc$marker[gm_uwisc$probe==probe]))
gm_trios <- t(sapply(names(gm_probe_tab)[gm_probe_tab==3],
                     function(probe) gm_uwisc$marker[gm_uwisc$probe==probe]))
gm_quart <- t(sapply(names(gm_probe_tab)[gm_probe_tab==4],
                     function(probe) gm_uwisc$marker[gm_uwisc$probe==probe]))

One of the quartets is for markers UNCrs52300311, complement008, complement016, and complement018, with the absurd probe sequence AGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAG, which doesn’t have a blast hit to the mouse genome.

The other quartet is for markers JAX00076336, JAX00076336q, UNC27974559, and UNCHS044416, whose sequence has a unique hit in the mouse genome, on chr 17. Of these four markers, only UNC27974559 was on the MegaMUGA array. In the UNC annotation file for the GigaMUGA array, these were all placed at the same location on chr 17, and at the same location as the blast hit.

I spent a bunch of time studying the marker names for those sharing a common probe sequence. There are a bunch of pairs that are maybe obvious small changes from one name to another:

  • add q, r, s, t, or u at the end of the name
  • replace a q,r,s,t,u at the end with a different one of these letters
  • add “backup” at the beginning
  • add “CEA” at the beginning
  • Replace “CEA” at the beginning with “CEB”

These can explain about half of the pairs of markers with common sequences. But when I start looking for pairs of markers that satisfy these rules, I find a ton of marker pairs that are related in these ways but have completely different sequences. I finally decided the whole thing was fruitless.

unique” column in the GM array

The unique column in the UNC annotations for the GM array is quite different from what I’m finding with blast.

The UNC annotation files indicates that 3114 of the GigaMUGA markers are non-unique.

When I first ran blast, I found exactly this many of the GigaMUGA markers had multiple hits to the mouse genome. I’ve since changed my analysis slightly, by trimming off the SNP from a set of 469 probes that contained it. As a result, I’m now getting 3150 markers with multiple hits, but the difference (36) is due to this trimming of some of the probe sequences.

(I went back and checked, and indeed, of the markers that I’d trimmed, there were 62 that showed multiple hits with the trimmed sequence, but only 26 that showed multiple hits with the untrimmed sequence, and so using the untrimmed sequences, I’d get 3150 - 62 + 26 = 3114 markers with multiple hits.)

But anyway, the key thing is that these indicators of markers with multiple hits in the mouse genome, while not completely independent, are quite different. The two lists have just 1469 markers in common. There are 1681 markers that my blast results say have multiple hits but that the UNC file says are unique in the mouse genome, and there are 1645 markers that my blast results say have 0 or 1 hits but that the UNC file says are non-unique.

We can spot check a few of these.

  • UNC22328648 is said to be unique by the UNC annotations, but I find it has 36 hits in the mouse genome, on 17 different chromosomes. Here’s the probe:

    TGCTGGGATTTGAACTCAGAACCTTCGGAAGAGCAGTCAGTGCTCTTAAC

  • UNC7052740 is said to be non-unique by the UNC annotations, but I find it has a single, unique hit on chr 4 at 35,637,450 bp, which is the position given in the UNC annotations. Here’s the probe:

    TAATGTGATAGATCCCCAGGTGGGGCAGTCTCTGGATGGTCCATCCTTCC

  • UNC15059592 is said to be unique by the UNC annotations, but this is the one that I find has the most hits to the mouse genome: 18,855 on 21 chromosomes (all but the mtDNA). Here’s the probe:

    GCTTCTGGCTATTATAAATAAGGCTGCTATGAACATAGTGGAGCATGTGT

Markers on both arrays

There are 66,992 markers that are in common between the MegaMUGA and GigaMUGA arrays, if we just match the names.

If we instead use the probe sequences, we get a larger number of markers that are in common: 68,572 of the probe sequences present in the GigaMUGA array are also found in the MegaMUGA array.

But somehow only 67,122 of the sequences on the MegaMUGA array are also on the GigaMUGA array. This difference is due to the many sequences that are in duplicate on each array.

The MegaMUGA array has 77,755 unique probes across 77,808 markers, and 67,087 of them are also on the GigaMUGA array.

The GigaMUGA array has 140,535 unique probes across 143,259 markers, and again 67,087 of them are also on the MegaMUGA array.

Now the GigaMUGA annotation file has a column is.MM that presumably indicates which markers are also on the MegaMUGA array. It points to 67,645 markers, which includes the 66,992 markers with matching names, but also includes another 653 markers.

# markers where "is.MM" is true but the exact marker name isn't seen on MM array
odd_mm <- as.character(gm_unc$marker)[gm_unc$is.MM & !(gm_unc$marker %in% mm_unc$marker)]
odd_mm_no_r <- sub("r$", "", odd_mm)

# the ones that changed probes
probe_changed <- odd_mm_no_r[gm_uwisc[odd_mm,"probe"] != mm_uwisc[odd_mm_no_r, "probe"]]
probe_changed_r <- paste0(probe_changed, "r")

These 653 markers all have names like “JAX00017210r”, where “JAX00017210” is a marker on the MegaMUGA array (and 617 of the markers without the terminal r are also on the GigaMUGA array). Importantly, this indicates that the is.MM column is as intended. So while I’ve concluded that the unique column is at least partially scrambled, there’s no evidence for other columns being similarly scrambled.

636 of the probes are unchanged between the MegaMUGA and GigaMUGA arrays, despite the “r” that was added to the name.

That leaves another 17 markers for which the probe was changed. For one of these markers, the probe on the MegaMUGA had multiple hits but the probe on the GigaMUGA has no hits. For two others, the probe on the MegaMUGA array has multiple genome hits but the probe on the GigaMUGA array is unique. Twelve of the markers have unique sequences on both arrays, and both are for the same SNP position, but with one on the plus strand and one on the minus strand. But then there are two markers where the SNP was switched from the plus to the minus strand but also shifted somewhat: JAX00712617/JAX00712617r and JAX00181213/JAX00181213r. Both pairs are on the X chromosome, and the difference in SNP position is 72 bp in the first case and 34 bp in the second.

I’m coming to decide that maybe we should just focus on SNP positions and ignore the marker names themselves: cases with different names but really the same probe, cases with names suggesting re-designed probes that actually lead to different SNPs being assayed. Moreover, of the 66,992 markers that are on both arrays (by matching their names), 29 actually have different probes on the two arrays.

mar_in_both <- mm_uwisc$marker[mm_uwisc$marker %in% gm_uwisc$marker]
diff_probes <- mar_in_both[mm_uwisc[mar_in_both,"probe"] != gm_uwisc[mar_in_both,"probe"]]
diff_probes_pos <- cbind(mm_uwisc[diff_probes, c("chr", "pos", "strand")], gm_uwisc[diff_probes, c("chr", "pos", "strand")])

21 of these were a switch from the minus to the plus strand, or vice versa. But for two of these the GigaMUGA probe doesn’t have a perfect match with the mm10 assembly, and for five of these the MegaMUGA probe didn’t have a perfect match with the mm10 assembly, and then for 1 the MegaMUGA probe has multiple hits to the mouse genome.

So I’m currently thinking we should just go by SNP position, and create a file that indicates groups of markers across the two arrays that are assaying the same SNP.

gm_pos <- apply(gm_uwisc[gm_uwisc$unique,c("chr", "pos")], 1, function(a) gsub(" ", "", paste(a, collapse=":")))
mm_pos <- apply(mm_uwisc[mm_uwisc$unique,c("chr", "pos")], 1, function(a) gsub(" ", "", paste(a, collapse=":")))

With a focus on SNP positions, the MegaMUGA has 75,021 unique SNPs, the GigaMUGA has 134,602 unique SNPs, and the two arrays have 64,887 SNPs in common.

Marker tier information

I was going to check the tier column in the UNC annotation file, with quality tiers 1-4, but I can’t find the data from Morgan et al. (2016). It’s reported to be at http://csbio.unc.edu/MUGA, but they no longer seem to be there (just the annotation files and information on SNP intensity clusters).

New annotation files

I now want to save my findings as new annotations files for the MegaMUGA and GigaMUGA arrays. I’ll also create a metadata file (“data dictionary”) that explains the columns. Finally, I’ll create a file that indicates which markers are assaying a common SNP, between and within the two arrays.

The final annotation files and all source materials are at https://github.com/kbroman/MUGAarrays.

Version 0

We’ll call this version 0. In the MegaMUGA and GigaMUGA files, I’ll include marker, chromosome, mm10 bp position, strand, snp, unique, multi, unmapped, n_blast_hits, n_blast_chr, and the probe sequence. So just what I’ve already assembled, but with the columns reordered a bit.

# order of columns
cols <- c("marker", "chr", "pos", "strand", "snp", "unique", "multi",
          "unmapped", "n_blast_hits", "n_blast_chr", "probe")
# revised names
cols_new <- c("marker", "chr", "bp_mm10", "strand", "snp", "unique", "multi",
              "unmapped", "n_blast_hits", "n_blast_chr", "probe")

# MegaMUGA file
mm_file <- "../UWisc/mm_uwisc_v0.csv"

# reorder and rename columns
mm_uwisc <- mm_uwisc[,cols]
colnames(mm_uwisc) <- cols_new

# reorder rows
mm_uwisc <- mm_uwisc[order(factor(mm_uwisc$chr, levels=c(1:19,"X","Y","M")),
                           mm_uwisc$bp_mm10),]

# write to CSV file
write.table(mm_uwisc, mm_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

# GigaMUGA file
gm_file <- sub("mm", "gm", mm_file)

# reorder and rename columns
gm_uwisc <- gm_uwisc[,cols]
colnames(gm_uwisc) <- cols_new

# reorder rows
gm_uwisc <- gm_uwisc[order(factor(gm_uwisc$chr, levels=c(1:19,"X","Y","M")),
                           gm_uwisc$bp_mm10),]

# write to CSV file
write.table(gm_uwisc, gm_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

I’ll also create a dictionary for each file, which explains what the columns are.

descriptions <- c("Name of SNP marker",
                  "Chromosome",
                  "Physical positions in basepairs for mm10 mouse genome build",
                  "Strand (plus/minus) from which the probe sequence was taken",
                  "SNP alleles as a two-character string",
                  "TRUE indicates that the probe sequence appears exactly once in mm10 mouse genome build",
                  "TRUE indicates that the probe sequence appears >1 time in mm10 mouse genome build",
                  "TRUE indicates that the probe sequence has no perfect match in mm10 mouse genome build",
                  "Number of perfect matches in the mm10 mouse genome build",
                  "Number of chromosomes (out of 22) with at least one perfect match",
                  "Probe sequence (49 or 50 bases); the SNP occurs immediately after")

mm_dict_file <- "../UWisc/mm_uwisc_dict_v0.csv"
output <- data.frame(column=cols_new,
                     description=descriptions,
                     stringsAsFactors=FALSE)
write.table(output, mm_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)
gm_dict_file <- sub("mm", "gm", mm_dict_file)
write.table(output, gm_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

Finally, I want to make a file that indicates the common markers, within and between the two arrays, using the SNP positions to determine which ones are identical.

# unique marker positions between the arrays
mm_pos <- setNames(paste(mm_uwisc$chr, mm_uwisc$bp_mm10, sep=":"), mm_uwisc$marker)[mm_uwisc$unique]
gm_pos <- setNames(paste(gm_uwisc$chr, gm_uwisc$bp_mm10, sep=":"), gm_uwisc$marker)[gm_uwisc$unique]
all_pos <- unique(c(mm_pos, gm_pos))

# find the positions that are in duplicate within each array
tab <- table(mm_pos)
dup_mm <- names(tab)[tab > 1]
tab <- table(gm_pos)
dup_gm <- names(tab)[tab > 1]

# find the positions that are in both arrays
in_both <- all_pos[all_pos %in% mm_pos & all_pos %in% gm_pos]

# find the markers that have pos in both arrays
all_pos_mm <- setNames(rep("", length(all_pos)), all_pos)
all_pos_mm[in_both] <- names(mm_pos)[match(in_both, mm_pos)]
all_pos_gm <- setNames(rep("", length(all_pos)), all_pos)
all_pos_gm[in_both] <- names(mm_pos)[match(in_both, mm_pos)]

# find the markers that at the duplicate positions within each array
dup_mm <- sapply(dup_mm, function(a) paste(names(mm_pos)[mm_pos == a], collapse=";"))
all_pos_mm[names(dup_mm)] <- dup_mm
dup_gm <- sapply(dup_gm, function(a) paste(names(gm_pos)[gm_pos == a], collapse=";"))
all_pos_gm[names(dup_gm)] <- dup_gm

# subset to the positions that are duplicated within or between
keep <- (all_pos_mm != "" | all_pos_gm != "")
all_pos <- all_pos[keep]
all_pos_mm <- all_pos_mm[keep]
all_pos_gm <- all_pos_gm[keep]

# split the positions back into chr and pos
all_pos_spl <- strsplit(all_pos, ":")

# create data frame with the results
common <- data.frame(chr=sapply(all_pos_spl, "[", 1),
                     bp_mm10=sapply(all_pos_spl, "[", 2),
                     megamuga=all_pos_mm,
                     gigamuga=all_pos_gm,
                     stringsAsFactors=FALSE)
# reorder by genomic position
common <- common[order(factor(common$chr, c(1:19,"X","Y","M")), common$bp_mm10),]

# write to a CSV file
write.table(common, "../UWisc/mm_gm_commonmark_uwisc_v1.csv",
            sep=",", quote=FALSE, row.names=FALSE, col.names=TRUE)

# data dictionary
common_cols <- colnames(common)
common_descriptions <- c("Chromosome ID",
                         "Physical positions in basepairs for mm10 mouse genome build",
                         "MegaMUGA markers at that position (separated by semi-colons); blank means no markers",
                         "GigaMUGA markers at that position (separated by semi-colons); blank means no markers")
common_dict <- data.frame(column=common_cols,
                          description=common_descriptions,
                          stringsAsFactors=FALSE)
# write to file
write.table(common_dict, "../UWisc/mm_gm_commonmark_uwisc_dict_v1.csv",
            sep=",", quote=FALSE, row.names=FALSE, col.names=TRUE)
# write just chr, bp to files, for use with mouse map converter
# (want to get interpolated cM positions from the Cox and G2F1 maps)
write.table(mm_uwisc[!is.na(mm_uwisc$chr) & mm_uwisc$chr %in% c(1:19,"X"), c("chr", "bp_mm10")],
            "../GenMaps/mm_bp.txt", sep=" ", quote=FALSE,
            row.names=FALSE, col.names=FALSE)

write.table(gm_uwisc[!is.na(gm_uwisc$chr) & gm_uwisc$chr %in% c(1:19,"X"), c("chr", "bp_mm10")],
            "../GenMaps/gm_bp.txt", sep=" ", quote=FALSE,
            row.names=FALSE, col.names=FALSE)

Version 1, with genetic maps

I used the mouse map converter to convert the mm10 basepair positions of the autosome and X chromosome markers to sex-averaged cM from the Cox et al. and Liu et al. (aka G2F1) genetic maps.

gm_cox <- data.table::fread("../GenMaps/gm_cox.txt", header=FALSE, data.table=FALSE)
gm_g2f1 <- data.table::fread("../GenMaps/gm_g2f1.txt", header=FALSE, data.table=FALSE)
mm_cox <- data.table::fread("../GenMaps/mm_cox.txt", header=FALSE, data.table=FALSE)
mm_g2f1 <- data.table::fread("../GenMaps/mm_g2f1.txt", header=FALSE, data.table=FALSE)

# verify stuff
stopifnot( all(gm_cox[,1] == gm_uwisc$chr[1:nrow(gm_cox)]) )
stopifnot( all(gm_cox[,2] == gm_uwisc$bp_mm10[1:nrow(gm_cox)]) )
stopifnot( all(gm_g2f1[,1] == gm_uwisc$chr[1:nrow(gm_g2f1)]) )
stopifnot( all(gm_g2f1[,2] == gm_uwisc$bp_mm10[1:nrow(gm_g2f1)]) )
stopifnot( all(mm_cox[,1] == mm_uwisc$chr[1:nrow(mm_cox)]) )
stopifnot( all(mm_cox[,2] == mm_uwisc$bp_mm10[1:nrow(mm_cox)]) )
stopifnot( all(mm_g2f1[,1] == mm_uwisc$chr[1:nrow(mm_g2f1)]) )
stopifnot( all(mm_g2f1[,2] == mm_uwisc$bp_mm10[1:nrow(mm_g2f1)]) )

The G2F1 genetic maps include positions < 0 cM. Seems perfectly okay, but I’m going to shift the maps, using the overall cM:Mbp ratio on each chromosome, to make 3 Mbp (the conventional start position for the mouse genome builds) equal to 0 cM.

shifts <- read.csv("../GenMaps/g2f1_shift.csv",
                   stringsAsFactors=FALSE)
shifts <- setNames(shifts[,2], shifts[,1])

for(chr in c(1:19,"X")) {
    gm_g2f1[gm_g2f1[,1]==chr,5] <- gm_g2f1[gm_g2f1[,1]==chr,5] + shifts[chr]
    mm_g2f1[mm_g2f1[,1]==chr,5] <- mm_g2f1[mm_g2f1[,1]==chr,5] + shifts[chr]
}
nas <- rep(NA, nrow(gm_uwisc)-nrow(gm_cox))
gm_uwisc <- cbind(gm_uwisc,
                  cM_cox=c(gm_cox[,5], nas),
                  cM_g2f1=c(gm_g2f1[,5], nas))


nas <- rep(NA, nrow(mm_uwisc)-nrow(mm_cox))
mm_uwisc <- cbind(mm_uwisc,
                  cM_cox=c(mm_cox[,5], nas),
                  cM_g2f1=c(mm_g2f1[,5], nas))

Now I can write the new annotation files.

cols_new <- c(cols_new[1:3], "cM_cox", "cM_g2f1", cols_new[-(1:3)])
mm_uwisc <- mm_uwisc[,cols_new]
gm_uwisc <- gm_uwisc[,cols_new]

# write MegaMUGA file
mm_file <- "../UWisc/mm_uwisc_v1.csv"
write.table(mm_uwisc, mm_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

# GigaMUGA file
gm_file <- sub("mm", "gm", mm_file)
write.table(gm_uwisc, gm_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

And finally, the new data dictionary files.

descriptions <- c(descriptions[1:3],
                  "Sex-averaged cM positions from Cox et al. https://doi.org/10.1534/genetics.109.105486",
                  paste("Sex-averaged cM positions from Liu et al.",
                        "https://doi.org/10.1534/genetics.114.161653",
                        "(shifted to avoid cM positions < 0)"),
                  descriptions[-(1:3)])

mm_dict_file <- "../UWisc/mm_uwisc_dict_v1.csv"
output <- data.frame(column=cols_new,
                     description=descriptions,
                     stringsAsFactors=FALSE)
write.table(output, mm_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)
gm_dict_file <- sub("mm", "gm", mm_dict_file)
write.table(output, gm_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

Session info

Here are the versions of R and R packages that I am using.

devtools::session_info()
## ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  os       Pop!_OS 18.04 LTS           
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US:en                    
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-11-13                  
## 
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
##  backports     1.1.2   2017-12-13 [1] CRAN (R 3.5.0)
##  base64enc     0.1-3   2015-07-28 [1] CRAN (R 3.5.0)
##  broman      * 0.68-4  2018-08-17 [1] local         
##  callr         3.0.0   2018-08-24 [1] CRAN (R 3.5.1)
##  cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.1)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
##  data.table  * 1.11.8  2018-09-30 [1] CRAN (R 3.5.1)
##  debugme       1.1.0   2017-10-22 [1] CRAN (R 3.5.0)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)
##  devtools    * 2.0.1   2018-10-26 [1] CRAN (R 3.5.1)
##  digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
##  evaluate      0.12    2018-10-09 [1] CRAN (R 3.5.1)
##  fs            1.2.6   2018-08-23 [1] CRAN (R 3.5.1)
##  glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.1)
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
##  knitr         1.20    2018-02-20 [1] CRAN (R 3.5.0)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.0)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)
##  pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.1)
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)
##  processx      3.2.0   2018-08-16 [1] CRAN (R 3.5.1)
##  ps            1.2.1   2018-11-06 [1] CRAN (R 3.5.1)
##  R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.1)
##  Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.1)
##  remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.1)
##  rlang         0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
##  rmarkdown     1.10    2018-06-11 [1] CRAN (R 3.5.0)
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)
##  stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.1)
##  stringr       1.3.1   2018-05-10 [1] CRAN (R 3.5.0)
##  testthat      2.0.1   2018-10-13 [1] CRAN (R 3.5.1)
##  usethis     * 1.4.0   2018-08-14 [1] CRAN (R 3.5.1)
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)
## 
## [1] /home/kbroman/Rlibs
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library