Revisiting the MiniMUGA annotations

Introduction

The annotation file I created for the miniMUGA array (2018-11-08) is at https://github.com/kbroman/MUGAarrays/blob/main/UWisc/mini_uwisc_v1.csv. We’ll call this the UWisc annotation file.

The miniMUGA paper has now been published. Initially published at bioRxiv on 2020-03-14, it provides official annotations with the Supplemental material, as Table S2. We will call this the UNC annotation file.

My goal here is to compare the two sets of annotations. The UNC annotation file includes many additional markers which have been added to the miniMUGA array since my initial analysis. And so I’ll first do a basic comparison of the two files, and then I’ll go about building a revised version of UWisc annotations that incorporates all of the markers.

Key findings

file <- "mini_revisited_summary.rds"
if(file.exists(file)) {
    exec_summ <- readRDS(file)
} else {
    message <- "**RE RUN FILE**"
    exec_summ <- list(new_markers=message,
                      n_new_markers=message,
                      no_hit=message,
                      n_no_hit=message,
                      n_transgene=message,
                      diff_chr=message,
                      n_diff_chr=message,
                      n_matching_chr=message,
                      pos_mismatch=message,
                      n_pos_mismatch=message,
                      pos_mismatch_old=message,
                      n_pos_mismatch_old=message,
                      pos_mismatch_new=message,
                      n_pos_mismatch_new=message)
}

The new UNC annotation file for the miniMUGA array has an additional 954 markers that weren’t present in the original annotation file I’d studied.
There are 97 markers with no perfect blast hit in the mouse genome (assembly mm10, GRCm38). This is on top of the 306 transgene markers.
There are 74 markers where the UNC annotation file has them on a different chromosome than is seen in my blast results. All of these were among the 954 markers added to the array.
The remaining 10,645 markers have matching chromosome IDs between the UNC annotation file and the blast results, but 18 of these have discrepancies in their position. Of these, 7 were on the previous version of the array and are on the mitochondria, while the other 11 are new.

Preliminaries

Let’s load the two annotation files, plus a few packages.

library(data.table)
library(broman)
library(devtools)
library(parallel)

uw <- data.table::fread("../UWisc/mini_uwisc_v1.csv", data.table=FALSE)

unc_url <- "https://gsajournals.figshare.com/ndownloader/files/25117973"
unc_file <- "../UNC/miniMUGA_tableS2.csv"
if(!file.exists(unc_file)) download.file(unc_url, unc_file)
unc <- data.table::fread(unc_file, data.table=FALSE)

The UWisc annotations had 10,171 markers, while the UNC annotation file has 11,125 markers.

All the markers in the UWisc annotations are in the UNC annotation file, but then there are 954 additional markers in the UNC file, and these are distributed across most chromosomes.

Here’s the chromosome distribution of markers in the UWisc annotations. I’ll replace the missing chromosomes with 0 and replace M with MT for the mitochondria, as in the new UNC annotations.

uw$chr[is.na(uw$chr)] <- 0
uw$chr[uw$chr=="M"] <- "MT"
uw$chr_unc[uw$chr_unc=="M"] <- "MT"
table(factor(uw$chr, levels=c(0:19,"X","Y","PAR", "MT")))

##
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19   X   Y PAR  MT
## 351 665 591 574 485 517 456 695 463 394 582 447 443 423 397 391 365 379 389 361 638  74   3  88

And here’s the chromosome distribution of markers that are only in the new annotations.

table(factor(unc$chr[!(unc$"Marker name" %in% uw$marker)],
             levels=c(0:19,"X","Y","PAR","MT")))

##
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19   X   Y PAR  MT
##  49  39  41  45  41  48  47  43  44  45  48  43  37  50  49  48  45  43  50  46  53   0   0   0

The additional markers are sprinkled throughout the autosomes and X chromosome, plus a number on “chr 0”, but no new markers on Y, PAR, or mitochondria.

Comparison of common markers

Let’s focus on the markers in the UWisc annotations, and check that the genomic positions are the same. I’ll first grab the subset of markers in common and get them in the same order

rownames(unc) <- unc$"Marker name"
uncsub <- unc[uw$marker,]

There are 94 markers where the chromosomes are not the same between the two annotation files. These are exactly the ones where the UNC file says the marker maps uniquely (the column bowtie_unique is 1) but the UWisc found as not unique (the column unique is FALSE).

stopifnot(
    all( (uncsub$bowtie_unique == 1 & uw$unique==FALSE) ==
         (uncsub$chr != uw$chr) )
         )

Let’s reduce to the 9,820 markers that the UWisc annotations says map uniquely in the mouse genome.

uncsub2 <- uncsub[uw$unique,]
uwsub <- uw[uw$unique,]

For these markers, the chromosome IDs are the same between the two files, but the basepair positions differ in 7 cases.

cbind(uncsub2[,c("chromosome", "position", "strand")],
      uwsub[c("marker", "chr", "bp_mm10", "strand")])[uncsub2$position != uwsub$bp_mm10,]

##            chromosome position strand     marker chr bp_mm10 strand
## mMit001            MT       55      -    mMit001  MT      54  minus
## UNC2202344         MT     2344      - UNC2202344  MT    2394   plus
## UNC2203635         MT     3635      - UNC2203635  MT    3634  minus
## UNC2212395         MT    12395      - UNC2212395  MT   12394  minus
## UNC2212695         MT    12695      - UNC2212695  MT   12694  minus
## UNC2212752         MT    12752      - UNC2212752  MT   12751  minus
## UNC2213612         MT    13612      - UNC2213612  MT   13611  minus

They are all mitochondrial markers, and they differ by one base except for one case that is off by 50 and says the opposite strand (marker UNC2202344). This is the only marker where the strands are different.

Probe sequences

For the markers in common, are the probe sequences the same? For 287 markers, the probe sequence contained the SNP and I had trimmed that off.

For the 9,884 markers where there’s no seqB and the probe sequence doesn’t contain the SNP, the probes are identical.

stopifnot( all(uncsub$seqA[uncsub$seqB==""] == uw$probe[uncsub$seqB==""]) )

For the other 287 markers, the probe sequence in the UWisc file deletes the last base, and the first 49 bases are the same in all cases.

seqA <- uncsub$seqA[uncsub$seqB != ""]
probe <- uw$probe[uncsub$seqB != ""]
stopifnot( all(substr(seqA, 1, 49) == probe) )

Alleles

Finally, let’s verify that the info about SNP alleles are the same. (Indeed, they are.)

uncsub$snp <- paste0(uncsub$"reference allele", uncsub$"alternate allele")
stopifnot( all(uncsub$snp ==  uw$snp) )

Revised UWisc annotations

I’m going to go back and re-load the UNC annotations and do all of the various things to create revised UWisc annotations. I’ll follow my previous report closely.

# UNC file
mini_unc <- unc # from above
rownames(mini_unc) <- mini_unc$"Marker name"

# change a few things:
# Marker name -> marker
# chromosome -> chr
# position -> pos
# MT -> M
colnames(mini_unc)[1:3] <- c("marker", "chr", "pos")
mini_unc$chr[mini_unc$chr=="MT"] <- "M"

# Blast results; keep just the perfect matches(?)
mini_blast <- readRDS("../Blast/results_mini2/mini2_blastn_results.rds")
mini_blast <- mini_blast[mini_blast$tot_mismatch==0,]

Summarize blast hits

I’ll start with some code to summarize the blast hits. For each SNP, I want to count the number of perfect blast hits. As it turns out, the probes have either 1 or 0 perfect hits, and so we don’t have to do what we did with the MegaMUGA and GigaMUGA arrays, regarding studying the number of distinct chromosomes hit.

For markers with a unique hit, I’ll record the chromosome, position, and strand.

# no. blast hits
mini_tab <- table(mini_blast$query)
mini_nchr <- mini_nhits <- setNames(rep(0, nrow(mini_unc)), rownames(mini_unc))
mini_nhits[names(mini_tab)] <- mini_tab

# chr,pos,strand for the unique ones
mini_blast_uniq <- mini_blast[mini_blast$query %in% names(mini_nhits)[mini_nhits==1],]
mini_blast_chr <- mini_blast_pos <- mini_blast_strand <- setNames(rep(NA, nrow(mini_unc)), rownames(mini_unc))
mini_blast_chr[mini_blast_uniq$query] <- mini_blast_uniq$chr
mini_blast_pos[mini_blast_uniq$query] <- mini_blast_uniq$snp_pos
mini_blast_strand[mini_blast_uniq$query] <- mini_blast_uniq$strand

# probe sequences
mini_blast_probe <- setNames(mini_unc$seqA, mini_unc$Marker)
mini_trim <- (mini_unc$seqB != "")
mini_blast_probe[mini_trim] <- substr(mini_blast_probe[mini_trim], 1, nchar(mini_blast_probe[mini_trim])-1)

# SNP alleles
mini_blast_snp <- paste0(mini_unc$"reference allele", mini_unc$"alternate allele")

# put all of this stuff into a data frame
mini_uwisc <- data.frame(marker=names(mini_nhits),
                         n_blast_hits=mini_nhits,
                         unique=(mini_nhits==1),
                         unmapped=(mini_nhits==0),
                         chr=mini_blast_chr,
                         pos=mini_blast_pos,
                         strand=mini_blast_strand,
                         snp=mini_blast_snp,
                         probe=mini_blast_probe,
                         stringsAsFactors=FALSE)
rownames(mini_uwisc) <- mini_uwisc$marker

Basic blast results

The miniMUGA array contains 11,125 markers, of which (according to the UNC annotations), 9,958 are on autosomes, 695 are on the X chromosome, 75 are on the Y chromosome, 88 are on the mitochondrial genome, 3 are in the pseudoautosomal region, and 306 have chromosome 0 which are transgene-related markers.

In the blast results, I find that 10,722 markers have a single, unique hit in the mouse genome (build mm10) and 403 have no perfect hit. There are no markers with multiple hits.

The markers with no perfect hit include all of the transgene-related markers, plus another 97 markers that are spread across all autosomes and the X and Y chromosomes.

Chromosome assignments

If we look at the inferred chromosome assignments of the markers with a unique hit to the genome, there are 77 differences.

This is many more discrepancies than we saw in the first version of the UNC annotation file. Previously, we just saw the three markers annotated to be in the pseudoautosomal region (PAR); the blast results match these to the X chromosome. We still see these, and we’ll force them back to PAR.

mini_uwisc$chr[mini_unc$chr=="PAR"] <- "PAR"

But in addition there are another 74 markers that map to different chromosomes than seen in the UNC annotation file.

diff_chr <- mini_uwisc$marker[!is.na(mini_uwisc$chr) & mini_unc$chr != mini_uwisc$chr]

In the UNC annotations, these span most autosomes (all except chr 2) plus the X chromosome, while in the blast results, they are on chr 1, 2, 3, 4, 5, 7, 14, 15, 17, and 19. And none of these markers were in the original version of the miniMUGA array. They are all among the 954 new markers.

For example, consider marker DX1033615090, with probe sequence TGACCTAAGCTGGAGAAGAACATCCCTGGGAGAGCTGTAAGCATATAAGC. In the UNC annotations, it is on chr 11 at 99,960,673 bp, but in the blast results, it shows up on chr 3 at 90,377,274 bp.

As another example, consider marker DX1150407521, with probe sequence AGTTTTGGAAGTTGGAATAGCATAACTTCCAAATGGCCAGGCTGTCTTAA. In the UNC annotations, it is on chr 18 at 65,182,949 bp, but in the blast results, it shows up on chr 15 at 10,188,038 bp.

If we paste these probe sequences into the NCBI blast website, using the GRCm38 assembly, it confirms my blast results. Using the GRCm39 assembly, we get the same chromosome assignments but slightly different positions.

Positions

For the remaining 10,645 markers, UNC chromosome assignment and the blast hit do match, but 18 of them have some discrepancy in their position.

chr_match <- mini_unc$marker[!is.na(mini_uwisc$chr) & mini_uwisc$chr != "PAR" &
                             mini_unc$chr == mini_uwisc$chr]
stopifnot( all( mini_unc[chr_match, "chr"] == mini_uwisc[chr_match, "chr"] ))
pd <- mini_unc[chr_match, "pos"] - mini_uwisc[chr_match, "pos"]
diff_pos <- chr_match[pd != 0]
result <- cbind(mini_unc[chr_match, c("chr", "pos")], mini_uwisc[chr_match, c("chr", "pos")], pd)[(pd != 0),]
colnames(result) <- c("unc_chr", "unc_pos", "blast_chr", "blast_pos", "pos_diff")
result

##               unc_chr   unc_pos blast_chr blast_pos  pos_diff
## DT2015871561        1 172074890         1 146789049  25285841
## DX1023823158        2  18624897         2  95578972 -76954075
## DX1024130647        2 109813167         2 103266184   6546983
## B6JB6CCPLHD01       2 157065076         2 157065111       -35
## BCC046203725        4 155093131         4 155093173       -42
## BCC072620402        7  65510056         7  65510091       -35
## B6JB6CCPLHD06      11  50673556        11  50673577       -21
## B6JB6CCPLHD07      13  39551176        13  39551219       -43
## DX1143424828       14  82805125        14  85620712  -2815587
## BCC180983620       18  24590521        18  24590555       -34
## mMit001             M        55         M        54         1
## UNC2202344          M      2344         M      2394       -50
## UNC2203635          M      3635         M      3634         1
## UNC2212395          M     12395         M     12394         1
## UNC2212695          M     12695         M     12694         1
## UNC2212752          M     12752         M     12751         1
## UNC2213612          M     13612         M     13611         1
## BCC202519908        X  62997706         X  62997715        -9

The 7 mitochondrial markers were all on the original version of the array and are off by mostly 1 bp (one marker is off by 50). The other 11 markers are new, and are off by quite varying amounts.

The strand assignments are the same for all 10,645 of the markers with matching chromosome assignments, except for the one mitochondrial marker that is off by 50 bp, UNC2202344.

strand_unc <- setNames(mini_unc[chr_match,"strand"], chr_match)
strand_uwisc <- setNames(mini_uwisc[chr_match,"strand"], chr_match)
strand_uwisc[strand_uwisc=="plus"] <- "+"
strand_uwisc[strand_uwisc=="minus"] <- "-"
stopifnot( sum(strand_unc != strand_uwisc) == 1)
mt_marker <- rownames(result)[result$blast_chr=="M" & result$pos_diff== -50]
stopifnot( strand_unc[mt_marker] != strand_uwisc[mt_marker] )

Markers with the same probes

Not all of the probes are distinct.

tab_probe <- table(mini_uwisc$probe)
pairs <- names(tab_probe)[tab_probe=="2"]
trios <- names(tab_probe)[tab_probe=="3"]

# names of the markers in the dup pairs
pair_names <- t(sapply(pairs, function(pr) rownames(mini_uwisc)[mini_uwisc$probe == pr]))
rownames(pair_names) <- 1:nrow(pair_names)
pair_chr <- cbind(mini_unc[pair_names[,1],"chr"], mini_unc[pair_names[,2],"chr"])

# pairs from these that are not on chr 0
pairs_not0 <- pair_names[rowSums(pair_chr=="0")<2,]

# verify name mutations
stopifnot( all(pairs_not0[,2] == paste0(pairs_not0[,1], "b") |
    pairs_not0[,2] == paste0("g", pairs_not0[,1]) |
    pairs_not0[,2] == paste0("m", pairs_not0[,1]) |
    pairs_not0[,1] == paste0(pairs_not0[,2], "b") |
    pairs_not0[,1] == paste0("g", pairs_not0[,2]) |
    pairs_not0[,1] == paste0("m", pairs_not0[,2]) |
    pairs_not0[,1] == sub("^D", "S", pairs_not0[,2]) ))

There are 115 pairs and 4 trios of markers with identical probe sequences.

The trios of markers with identical probe sequences are all on chr 0 in the UNC annotation file, for transgene-related markers.

Of the 115 pairs, 37 are both on chromosome 0.

The other 156 pairs of markers with identical probe sequences appear to be intended duplicates: they all are such the one name is the same as the other but with g or m added to the beginning, or with b added to the end, or in one case ST is replaced by DT but the rest of the marker name is the same (markers ST2023174930 and DT2023174930).

That last example, though, is the one pair where the UNC annotation file has the markers placed on different chromosomes.

pairs_not0 <- cbind(pairs_not0, chr1=mini_unc[pairs_not0[,1], "chr"],
                    chr2=mini_unc[pairs_not0[,2], "chr"],
                    pos1=mini_unc[pairs_not0[,1], "pos"],
                    pos2=mini_unc[pairs_not0[,2], "pos"])
stopifnot( sum(pairs_not0[,"chr1"] != pairs_not0[,"chr2"] | pairs_not0[,"pos1"] != pairs_not0[,"pos2"]) == 1)

wh <- pairs_not0[pairs_not0[,"chr1"] != pairs_not0[,"chr2"] | pairs_not0[,"pos1"] != pairs_not0[,"pos2"],1:2]
mini_unc[wh, c("chr", "pos", "strand", "seqA")]

##              chr      pos strand                                               seqA
## ST2023174930   2 79373270      + AGCTCCCCTCCTGGGAGGCAAACACTGAGAAGGCTCTGGGAATCCTTCCA
## DT2023174930   5 45958101      + AGCTCCCCTCCTGGGAGGCAAACACTGAGAAGGCTCTGGGAATCCTTCCA

The position for ST2023174930 (chr 2 at 79,373,270 bp) looks to be the correct position.

Comparison to GigaMUGA

For the 97 markers that do not map to the mm10 mouse genome assembly, and for the 92 markers that map to a different chromosome or position than is provided in the UNC annotation file, how many are on the GigaMUGA array, and how do the positions compare?

gm <- read.csv("../UWisc/gm_uwisc_v1.csv")
on_gm <- no_hit %win% gm$marker
no_hit_probes <- setNames(mini_uwisc[mini_uwisc$marker %in% no_hit, "probe"], no_hit)
probe_on_gm <- names(no_hit_probes)[no_hit_probes %in% gm$probe]

For the 97 markers with no hit to the mm10 assembly, 5 appear in the GigaMUGA array by name, and 42 of the probes appear on the GigaMUGA array (38 with different names).

In contrast, for the 92 markers that map uniquely to the mm10 assembly but to a different chromosome or position than is in the UNC annotation file, 0 are on the GigaMUGA array by name, and 0 of the probes show up on the GigaMUGA array.

Let’s look more closely at the unmapped markers that are found on the GigaMUGA array.

We’ve got 38 probes that are on the GigaMUGA array but with a different marker name. They are all also not mapped to the mm10 genome assembly in my GigaMUGA annotations, and the marker names on the miniMUGA are the same as those on the GigaMUGA, but with a prefix g, m, or gb.

diff_name <- probe_on_gm %wnin% on_gm
probe <- no_hit_probes[diff_name]
result <- cbind(mini_unc[diff_name, c("marker", "chr", "pos")],
                gm[match(probe, gm$probe), c("marker", "chr", "bp_mm10")])
colnames(result) <- c("mini_marker", "unc_chr", "unc_pos",
                      "giga_marker", "giga_chr", "giga_pos")
stopifnot(all(is.na(result$giga_chr) & is.na(result$giga_pos)))
stopifnot(all(paste0("g", result$giga_marker) == result$mini_marker |
              paste0("m", result$giga_marker) == result$mini_marker |
              paste0("gb", result$giga_marker) == result$mini_marker))

There were 4 markers on the GM array with the same probe and same marker name. These are all unmapped in my GigaMUGA array annotations.

on_gm_same_probe <- on_gm[no_hit_probes[on_gm] == gm[match(on_gm, gm$marker), "probe"]]
result <- cbind(mini_unc[on_gm_same_probe, c("marker", "chr", "pos")],
                gm[match(on_gm_same_probe, gm$marker), c("marker", "chr", "bp_mm10")])
colnames(result) <- c("mini_marker", "unc_chr", "unc_pos",
                      "giga_marker", "giga_chr", "giga_pos")
stopifnot(all( is.na(result$giga_chr) & is.na(result$giga_pos) ))

Finally, there is one marker that is on the GigaMUGA array but with a different probe.

on_gm_diff_probe <- on_gm[no_hit_probes[on_gm] != gm[match(on_gm, gm$marker), "probe"]]
result <- cbind(mini_unc[on_gm_diff_probe, c("marker", "chr", "pos", "seqA")],
                gm[match(on_gm_diff_probe, gm$marker), c("marker", "chr", "bp_mm10", "probe")])
colnames(result) <- c("mini_marker", "unc_chr", "unc_pos", "mini_probe",
                      "giga_marker", "giga_chr", "giga_pos", "giga_probe")
stopifnot(result$unc_chr == result$giga_chr,
          result$unc_pos == result$giga_pos,
          result$mini_marker == result$giga_marker)

This is marker UNC21896811. On both arrays it is on chr 12 at 111,651,522 bp. But on the miniMUGA array it has probe GTTCTGTTTGCTTGGTTGGGTTTTGTGGGCGGTTGGGGTTTTGTTTTTGT while on the GigaMUGA array it has probe GAGCAGCTAGGCTATACAGAGAAACCCTGTCTTGAAAACAAAAACAAACA.

Let’s go ahead and look at all of the miniMUGA markers.

name_on_both <- mini_unc$marker %win% gm$marker
name_on_both_same_probe <- name_on_both[mini_unc[name_on_both, "seqA"]==
                                        gm[match(name_on_both, gm$marker), "probe"]]
name_on_both_diff_probe <- name_on_both %wnin% name_on_both_same_probe
cf_diff_probe <- cbind(mini_uwisc[match(name_on_both_diff_probe, mini_uwisc$marker), c("chr", "pos")],
                       gm[match(name_on_both_diff_probe,
gm$marker),c("chr", "bp_mm10","marker")])
colnames(cf_diff_probe) <- c("mini_chr", "mini_pos", "gm_chr", "gm_pos", "marker")
diff_probe_unmapped_gm <- cf_diff_probe$marker[!is.na(cf_diff_probe$mini_chr) &
                                               is.na(cf_diff_probe$gm_chr)]
diff_probe_unmapped_mini <- cf_diff_probe$marker[is.na(cf_diff_probe$mini_chr) &
                                               !is.na(cf_diff_probe$gm_chr)]
diff_probe_same_pos <- cf_diff_probe$marker[!is.na(cf_diff_probe$mini_chr) &
                                            !is.na(cf_diff_probe$gm_chr) &
                                            cf_diff_probe$mini_chr == cf_diff_probe$gm_chr &
                                            cf_diff_probe$mini_pos == cf_diff_probe$gm_pos]
unmapped_mini_uw <- mini_uwisc[match(diff_probe_unmapped_gm, mini_uwisc$marker),]
unmapped_mini_unc <- mini_unc[match(diff_probe_unmapped_gm, mini_unc$marker),]
stopifnot(all(unmapped_mini_uw$chr == unmapped_mini_unc$chr &
              unmapped_mini_uw$pos == unmapped_mini_unc$pos))

# same name and same probe
z1 <- mini_uwisc[match(name_on_both_same_probe, mini_uwisc$marker),]
z2 <- gm[match(name_on_both_same_probe, gm$marker),]
# mapped/unmapped the same
stopifnot( all(is.na(z1$chr) == is.na(z2$chr)) )
# chr and pos the same
stopifnot( all(is.na(z1$chr) | (z1$chr==z2$chr & z1$pos==z2$pos)) )

# same probe but different names
probes_on_both_mini <- setNames(mini_uwisc$probe, mini_uwisc$marker)[mini_uwisc$probe %in% gm$probe]
probe_match <- mclapply(seq_along(probes_on_both_mini), function(i) gm$marker[gm$probe==probes_on_both_mini[i]], mc.cores=8)
names(probe_match) <- names(probes_on_both_mini)
cf_names <- t(sapply(seq_along(probe_match), function(i) {
          n <- names(probe_match)[i]
          p <- probe_match[[i]]
          c(plain=any(p==n),
            g_b=any(paste0("g", p, "b")==n),
            m=any(paste0("m", p)==n),
            g=any(paste0("g", p)==n),
            mit=any(sub("^Mit", "gMit00", p)==n),
            mit0=any(sub("^Mit", "gMit000", p)==n),
            gb=any(paste0("gb", p)==n)) }))

There are 188 miniMUGA markers that are on GigaMUGA array with the same name but with a different probe. One of these (UNC21896811) we saw above, where the new probe doesn’t have a perfect match in the mm10 assembly. Another 179 look like redesigns, switching strands but targetting the same position. The other 8 markers have probes on the miniMUGA array that map uniquely, but the probe on the GigaMUGA array does not; these markers all have the same position in the UNC annotation as found in the blast search.

There are 559 markers that are on the GigaMUGA array with the same name and the same probe sequence four of these do not map uniquely to the mm10 assembly; the other 555 are placed at the same genomic position.

Of the 11,125 markers on the miniMUGA, 3,881 have a probe that is also on the GigaMUGA array. These will be placed in the same positions, since I used the same process to identify unique mappings and the corresponding positions, but some of the names may be different. Most of these having matching names (559 match exactly, and another 3281 have names that were changed slightly, say by prepending a g or m). Another 41 miniMUGA markers have a probe that matches a GigaMUGA marker, but with a totally different name.

New annotation file

I now want to save my findings as a new annotation file for the miniMUGA array. I’ll also create a metadata file (“data dictionary”) that explains the columns.

The final annotation files and all source materials are at https://github.com/kbroman/MUGAarrays.

Version pre_2

We’ll call this version “pre 2”. It’ll be like the MegaMUGA and GigaMUGA annotation files I made, but without multi, n_blast_hits, or n_blast_chr. And I’ll include columns chr_unc and bp_unc, the chromosome and position in the UNC annotation file. (This is basically a preliminary file before I go and get the genetic map positions from the mouse map converter site.)

# order of columns
cols <- c("marker", "chr", "pos", "strand", "snp", "unique",
          "unmapped", "probe", "chr_unc", "pos_unc")
# revised names
cols_new <- c("marker", "chr", "bp_mm10", "strand", "snp", "unique",
              "unmapped", "probe", "chr_unc", "bp_unc")

# MegaMUGA file
mini_file <- "../UWisc/mini_uwisc_pre_v2.csv"

# reorder and rename columns
mini_uwisc <- cbind(mini_uwisc, chr_unc=mini_unc$chr, pos_unc=mini_unc$pos)
mini_uwisc <- mini_uwisc[,cols]
colnames(mini_uwisc) <- cols_new

# reorder rows
mini_uwisc <- mini_uwisc[order(factor(mini_uwisc$chr, levels=c(1:19,"X","Y","PAR","M")),
                               mini_uwisc$bp_mm10,
                               factor(mini_uwisc$chr_unc, levels=c(0:19, "X", "Y", "PAR","M"))),]

# write to CSV file
write.table(mini_uwisc, mini_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

I’ll also create a dictionary for the file, which explains what the columns are.

descriptions <- c("Name of SNP marker",
                  "Chromosome",
                  "Physical position in basepairs for mm10 mouse genome build",
                  "Strand (plus/minus) from which the probe sequence was taken",
                  "SNP alleles as a two-character string",
                  "TRUE indicates that the probe sequence appears exactly once in mm10 mouse genome build",
                  "TRUE indicates that the probe sequence has no perfect match in mm10 mouse genome build",
                  "Probe sequence (49 or 50 bases); the SNP occurs immediately after",
                  "Chromosome in UNC annotation file",
                  "Physical positions in basepairs in the UNC annotation file")

mini_dict_file <- "../UWisc/mini_uwisc_dict_pre_v2.csv"
output <- data.frame(column=cols_new,
                     description=descriptions,
                     stringsAsFactors=FALSE)
write.table(output, mini_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

Finally, I want to make a file that indicates the common markers, using the SNP positions to determine which ones are identical. There are a bunch of them, and they all have related names, like the pairs with identical probe sequences.

# unique marker positions between the arrays
mini_pos <- setNames(paste(mini_uwisc$chr, mini_uwisc$bp_mm10, sep=":"), mini_uwisc$marker)[mini_uwisc$unique]

# find the positions that are in duplicate within the array
tab <- table(mini_pos)
dup <- names(tab)[tab > 1]

# find the corresponding markers
dup_names <- t(sapply(dup, function(d) names(mini_pos)[mini_pos==d]))

pos_spl <- strsplit(rownames(dup_names), ":")

# create data frame with the results
common <- data.frame(chr=sapply(pos_spl, "[", 1),
                     bp_mm10=sapply(pos_spl, "[", 2),
                     marker1=dup_names[,1],
                     marker2=dup_names[,2],
                     stringsAsFactors=FALSE)
# reorder by genomic position
common <- common[order(factor(common$chr, c(1:19,"X","Y","M")), common$bp_mm10),]

# write to a CSV file
write.table(common, "../UWisc/mini_commonmark_uwisc_v2.csv",
            sep=",", quote=FALSE, row.names=FALSE, col.names=TRUE)

# data dictionary
common_cols <- colnames(common)
common_descriptions <- c("Chromosome ID",
                         "Physical position in basepairs for mm10 mouse genome build",
                         "A miniMUGA marker at that position",
                         "A second miniMUGA marker at that position")
common_dict <- data.frame(column=common_cols,
                          description=common_descriptions,
                          stringsAsFactors=FALSE)
# write to file
write.table(common_dict, "../UWisc/mini_commonmark_uwisc_dict_v2.csv",
            sep=",", quote=FALSE, row.names=FALSE, col.names=TRUE)

# write just chr, bp to files, for use with mouse map converter
# (want to get interpolated cM positions from the Cox and G2F1 maps)
write.table(mini_uwisc[!is.na(mini_uwisc$chr) & mini_uwisc$chr %in% c(1:19,"X"), c("chr", "bp_mm10")],
            "../GenMaps/mini2_bp.txt", sep=" ", quote=FALSE,
            row.names=FALSE, col.names=FALSE)

Version 2, with genetic maps

I used the mouse map converter to convert the mm10 basepair positions of the autosome and X chromosome markers to sex-averaged cM from the Cox et al. and Liu et al. (aka G2F1) genetic maps.

mini_cox <- data.table::fread("../GenMaps/mini2_cox.txt", header=FALSE, data.table=FALSE)
mini_g2f1 <- data.table::fread("../GenMaps/mini2_g2f1.txt", header=FALSE, data.table=FALSE)

# verify stuff
stopifnot( all(mini_cox[,1] == mini_uwisc$chr[1:nrow(mini_cox)]) )
stopifnot( all(mini_cox[,2] == mini_uwisc$bp_mm10[1:nrow(mini_cox)]) )
stopifnot( all(mini_g2f1[,1] == mini_uwisc$chr[1:nrow(mini_g2f1)]) )
stopifnot( all(mini_g2f1[,2] == mini_uwisc$bp_mm10[1:nrow(mini_g2f1)]) )

The G2F1 genetic maps include positions < 0 cM. Seems perfectly okay, but I’m going to shift the maps, using the overall cM:Mbp ratio on each chromosome, to make 3 Mbp (the conventional start position for the mouse genome builds) equal to 0 cM.

shifts <- read.csv("../GenMaps/g2f1_shift.csv",
                   stringsAsFactors=FALSE)
shifts <- setNames(shifts[,2], shifts[,1])

for(chr in c(1:19,"X")) {
    mini_g2f1[mini_g2f1[,1]==chr,5] <- mini_g2f1[mini_g2f1[,1]==chr,5] + shifts[chr]
}

nas <- rep(NA, nrow(mini_uwisc)-nrow(mini_cox))
mini_uwisc <- cbind(mini_uwisc,
                  cM_cox=c(mini_cox[,5], nas),
                  cM_g2f1=c(mini_g2f1[,5], nas))

Now I can write the new annotation files.

cols_new <- c(cols_new[1:3], "cM_cox", "cM_g2f1", cols_new[-(1:3)])
mini_uwisc <- mini_uwisc[,cols_new]

# write miniMUGA file
mini_file <- "../UWisc/mini_uwisc_v2.csv"
write.table(mini_uwisc, mini_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

And finally, the new data dictionary files.

descriptions <- c(descriptions[1:3],
                  "Sex-averaged cM positions from Cox et al. https://doi.org/10.1534/genetics.109.105486",
                  paste("Sex-averaged cM positions from Liu et al.",
                        "https://doi.org/10.1534/genetics.114.161653",
                        "(shifted to avoid cM positions < 0)"),
                  descriptions[-(1:3)])

mini_dict_file <- "../UWisc/mini_uwisc_dict_v2.csv"
output <- data.frame(column=cols_new,
                     description=descriptions,
                     stringsAsFactors=FALSE)
write.table(output, mini_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

Session info

file <- "mini_revisited_summary.rds"
exec_summ <- list(new_markers=rownames(mini_unc) %wnin% uw$marker,
                  no_hit=mini_uwisc$marker[!mini_uwisc$unique & mini_uwisc$chr_unc != 0],
                  transgene=mini_uwisc$marker[mini_unc$chr==0],
                  diff_chr=diff_chr,
                  matching_chr=mini_uwisc$marker[!is.na(mini_uwisc$chr) &
                                                 mini_uwisc$chr != "PAR" &
                                                 mini_uwisc$chr == mini_uwisc$chr_unc],
                  pos_mismatch=diff_pos,
                  pos_mismatch_old=diff_pos %win% uw$marker,
                  pos_mismatch_new=diff_pos %wnin% uw$marker)
exec_summ$n_new_markers <- length(exec_summ$new_markers)
exec_summ$n_no_hit <- length(exec_summ$no_hit)
exec_summ$n_transgene <- length(exec_summ$transgene)
exec_summ$n_diff_chr <- length(exec_summ$diff_chr)
exec_summ$n_matching_chr <- length(exec_summ$matching_chr)
exec_summ$n_pos_mismatch <- length(exec_summ$pos_mismatch)
exec_summ$n_pos_mismatch_old <- length(exec_summ$pos_mismatch_old)
exec_summ$n_pos_mismatch_new <- length(exec_summ$pos_mismatch_new)
saveRDS(exec_summ, file)

Here are the versions of R and R packages that I am using.

devtools::session_info()

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.0.3 (2020-10-10)
##  os       Pop!_OS 20.10
##  system   x86_64, linux-gnu
##  ui       X11
##  language en_US:en
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/Chicago
##  date     2020-12-19
##
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date       lib source
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
##  broman      * 0.72-1  2020-12-10 [1] local
##  callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.3)
##  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.3)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
##  data.table  * 1.13.4  2020-12-08 [1] CRAN (R 4.0.3)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
##  devtools    * 2.3.2   2020-09-18 [1] CRAN (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
##  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
##  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.1)
##  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
##  pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.0.3)
##  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
##  processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.3)
##  ps            1.5.0   2020-12-05 [1] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.3)
##  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.3)
##  rlang         0.4.9   2020-11-26 [1] CRAN (R 4.0.3)
##  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.3)
##  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.3)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  testthat      3.0.1   2020-12-17 [1] CRAN (R 4.0.3)
##  usethis     * 2.0.0   2020-12-10 [1] CRAN (R 4.0.3)
##  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
##  xfun          0.19    2020-10-30 [1] CRAN (R 4.0.3)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
##
## [1] /home/kbroman/Rlibs
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library