Annotations for the original MUGA array

Introduction

Mandy Chen asked me to repeat my analysis of the MegaMUGA/GigaMUGA annotation files with the original MUGA array. The probe sequences are in an RData file at the UNC site, http://csbio.unc.edu/MUGA/snps.muga.Rdata.

This contains information for 7,854 SNPs. All are on autosomes 1-19, X, Y, or M. As with the other arrays, there are columns for cM, bp, the two alleles, and probe sequence. For the A/C and T/G markers (vs the A/G and T/C ones), there’s a second probe sequence. (There are 1,831 of these.)

The probes are again 50 bp long. Whereas before, in cases with a second probe sequence, the probe sequences contained the SNP allele, here that doesn’t seem to be the case, but I’m not totally sure. I spot-checked a handful of them, and in all cases the SNP allele was not contained in either probe, and the two probes mapped with a one basepair gap. But I probably should blast the second sequence in all cases, and verify that they are all okay.

The annotation files I created are at https://github.com/kbroman/MUGAarrays/blob/main/UWisc.

Preliminaries

I’ll first load some R packages.

library(data.table)
library(broman)
library(devtools)

I’ll now load MUGA annotation file from UNC, and the blast results. For the blast results, I only keep the full length, perfect hits.

# UNC file already loaded above, as muga_unc
load("../UNC/snps.muga.Rdata")
muga_unc <- snps
muga_unc$chr <- sub("^chr", "", muga_unc$chr)

# Blast results; keep just the perfect matches(?)
muga_blast <- readRDS("../Blast/results_muga/muga_blastn_results.rds")
muga_blast <- muga_blast[muga_blast$tot_mismatch==0,]

Summarize blast hits

I’ll start with some code to summarize the blast hits. For each SNP, I want to count the number of perfect blast hits. For markers with a unique hit, I’ll record the chromosome, position, and strand.

# no. blast hits
muga_tab <- table(muga_blast$query)
muga_nchr <- muga_nhits <- setNames(rep(0, nrow(muga_unc)), rownames(muga_unc))
muga_nhits[names(muga_tab)] <- muga_tab

# no. chromosomes hit
muga_tab_chr <- table(muga_blast$query, muga_blast$chr)
muga_nchr[rownames(muga_tab_chr)] <- rowSums(muga_tab_chr > 0)

# chr,pos,strand for the unique ones
muga_blast_uniq <- muga_blast[muga_blast$query %in% names(muga_nhits)[muga_nhits==1],]
muga_blast_chr <- muga_blast_pos <- muga_blast_strand <- setNames(rep(NA, nrow(muga_unc)), rownames(muga_unc))
muga_blast_chr[muga_blast_uniq$query] <- muga_blast_uniq$chr
muga_blast_pos[muga_blast_uniq$query] <- muga_blast_uniq$snp_pos
muga_blast_strand[muga_blast_uniq$query] <- muga_blast_uniq$strand

# probe sequences
muga_blast_probe <- setNames(muga_unc$seq.A, muga_unc$Marker)

# SNP alleles
muga_blast_snp <- paste0(muga_unc$A1, muga_unc$A2)

# put all of this stuff into a data frame
muga_uwisc <- data.frame(marker=names(muga_nhits),
                         n_blast_hits=muga_nhits,
                         unique=(muga_nhits==1),
                         unmapped=(muga_nhits==0),
                         chr=muga_blast_chr,
                         pos=muga_blast_pos,
                         strand=muga_blast_strand,
                         snp=muga_blast_snp,
                         probe=muga_blast_probe,
                         stringsAsFactors=FALSE)
rownames(muga_uwisc) <- muga_uwisc$marker

Basic blast results

The original MUGA array contains 7,854 markers, of which (according to the UNC annotations), 7,351 are on autosomes, 495 are on the X chromosome, 5 are on the Y chromosome, and 3 are on the mitochondrial genome.

In the blast results, I find that 6,919 markers have a single, unique hit in the mouse genome (build mm10) and 697 have no perfect hit. There are 238 markers with multiple hits.

Unique markers

If we look at the inferred chromosome assignments of the markers with a unique hit to the genome, there are 0 differences.

There were 697 markers that didn’t have a perfect match in the mm10 mouse genome assembly. These markers are on all the autosomes and the X chromosome, but not on Y or M.

Positions

Of the 6,919 markers with a unique hit to the mouse genome, most show different positions from the UNC annotation file, which I presume must be in mm9 coordinates.

Markers with the same probes

Not all of the probes are distinct.

tab_probe <- table(muga_uwisc$probe)
pairs <- names(tab_probe)[tab_probe=="2"]
trios <- names(tab_probe)[tab_probe=="3"]
four <- names(tab_probe)[tab_probe=="4"]
eight <- names(tab_probe)[tab_probe=="8"]

# names of the markers in the dup pairs
pair_names <- t(sapply(pairs, function(pr) rownames(muga_uwisc)[muga_uwisc$probe == pr]))
rownames(pair_names) <- 1:nrow(pair_names)
pair_chr <- cbind(muga_unc[pair_names[,1],"chr"], muga_unc[pair_names[,2],"chr"])

trio_names <- t(sapply(trios, function(pr) rownames(muga_uwisc)[muga_uwisc$probe == pr]))
rownames(trio_names) <- 1:nrow(trio_names)

four_names <- rownames(muga_uwisc)[muga_uwisc$probe==four]
eight_names <- rownames(muga_uwisc)[muga_uwisc$probe==eight]

There are 10 pairs and 2 trios of markers with identical probe sequences. Further, there’s a set of four markers with the same probe sequence, and another set of eight markers with the same probe sequence.

Four of the pairs of markers with identical probe sequences look to be intended duplicates: one has the other’s name with “backup” prepended. But the other six have unrelated names, and the trios, four-some, and eight-some all have unrelated names.

Three of the pairs with identical probe sequences map uniquely to the mouse genome. These are all among the cases that look to be intended duplicates. The other seven pairs either map multiply (one) or have no perfect match. (six).

One of the trios is unmapped and the other maps multiply. The four-some maps multiply while the eight-some is unmapped.

New annotation file

I now want to save my findings as a new annotation file for the original MUGA array. I’ll also create a metadata file (“data dictionary”) that explains the columns.

The final annotation files and all source materials are at https://github.com/kbroman/MUGAarrays.

Version 0

We’ll call this version 0. It’ll be like the MegaMUGA and GigaMUGA annotation files I made, but without multi, n_blast_hits, or n_blast_chr. And I’ll include columns chr_unc and bp_unc, the chromosome and position in the UNC annotation file.

# order of columns
cols <- c("marker", "chr", "pos", "strand", "snp", "unique",
          "unmapped", "probe", "chr_unc", "pos_unc")
# revised names
cols_new <- c("marker", "chr", "bp_mm10", "strand", "snp", "unique",
              "unmapped", "probe", "chr_unc", "bp_unc")

# MegaMUGA file
muga_file <- "../UWisc/muga_uwisc_v0.csv"

# reorder and rename columns
muga_uwisc <- cbind(muga_uwisc, chr_unc=muga_unc$chr, pos_unc=muga_unc$pos)
muga_uwisc <- muga_uwisc[,cols]
colnames(muga_uwisc) <- cols_new

# reorder rows
muga_uwisc <- muga_uwisc[order(factor(muga_uwisc$chr, levels=c(1:19,"X","Y","PAR","M")),
                               muga_uwisc$bp_mm10,
                               factor(muga_uwisc$chr_unc, levels=c(0:19, "X", "Y", "PAR","M"))),]

# write to CSV file
write.table(muga_uwisc, muga_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

I’ll also create a dictionary for the file, which explains what the columns are.

descriptions <- c("Name of SNP marker",
                  "Chromosome",
                  "Physical position in basepairs for mm10 mouse genome build",
                  "Strand (plus/minus) from which the probe sequence was taken",
                  "SNP alleles as a two-character string",
                  "TRUE indicates that the probe sequence appears exactly once in mm10 mouse genome build",
                  "TRUE indicates that the probe sequence has no perfect match in mm10 mouse genome build",
                  "Probe sequence (49 or 50 bases); the SNP occurs immediately after",
                  "Chromosome in UNC annotation file",
                  "Physical positions in basepairs in the UNC annotation file")

muga_dict_file <- "../UWisc/muga_uwisc_dict_v0.csv"
output <- data.frame(column=cols_new,
                     description=descriptions,
                     stringsAsFactors=FALSE)
write.table(output, muga_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

Finally, I want to make a file that indicates the common markers, using the SNP positions to determine which ones are identical. There are a bunch of them, and they all have related names, like the pairs with identical probe sequences.

# unique marker positions between the arrays
muga_pos <- setNames(paste(muga_uwisc$chr, muga_uwisc$bp_mm10, sep=":"), muga_uwisc$marker)[muga_uwisc$unique]

# find the positions that are in duplicate within the array
tab <- table(muga_pos)
dup <- names(tab)[tab > 1]

# find the corresponding markers
dup_names <- t(sapply(dup, function(d) names(muga_pos)[muga_pos==d]))

pos_spl <- strsplit(rownames(dup_names), ":")

# create data frame with the results
common <- data.frame(chr=sapply(pos_spl, "[", 1),
                     bp_mm10=sapply(pos_spl, "[", 2),
                     marker1=dup_names[,1],
                     marker2=dup_names[,2],
                     stringsAsFactors=FALSE)
# reorder by genomic position
common <- common[order(factor(common$chr, c(1:19,"X","Y","M")), common$bp_mm10),]

# write to a CSV file
write.table(common, "../UWisc/muga_commonmark_uwisc_v1.csv",
            sep=",", quote=FALSE, row.names=FALSE, col.names=TRUE)

# data dictionary
common_cols <- colnames(common)
common_descriptions <- c("Chromosome ID",
                         "Physical position in basepairs for mm10 mouse genome build",
                         "A MUGA marker at that position",
                         "A second MUGA marker at that position")
common_dict <- data.frame(column=common_cols,
                          description=common_descriptions,
                          stringsAsFactors=FALSE)
# write to file
write.table(common_dict, "../UWisc/muga_commonmark_uwisc_dict_v1.csv",
            sep=",", quote=FALSE, row.names=FALSE, col.names=TRUE)

# write just chr, bp to files, for use with mouse map converter
# (want to get interpolated cM positions from the Cox and G2F1 maps)
write.table(muga_uwisc[!is.na(muga_uwisc$chr) & muga_uwisc$chr %in% c(1:19,"X"), c("chr", "bp_mm10")],
            "../GenMaps/muga_bp.txt", sep=" ", quote=FALSE,
            row.names=FALSE, col.names=FALSE)

Version 1, with genetic maps

I used the mouse map converter to convert the mm10 basepair positions of the autosome and X chromosome markers to sex-averaged cM from the Cox et al. and Liu et al. (aka G2F1) genetic maps.

muga_cox <- data.table::fread("../GenMaps/muga_cox.txt", header=FALSE, data.table=FALSE)
muga_g2f1 <- data.table::fread("../GenMaps/muga_g2f1.txt", header=FALSE, data.table=FALSE)

# verify stuff
stopifnot( all(muga_cox[,1] == muga_uwisc$chr[1:nrow(muga_cox)]) )
stopifnot( all(muga_cox[,2] == muga_uwisc$bp_mm10[1:nrow(muga_cox)]) )
stopifnot( all(muga_g2f1[,1] == muga_uwisc$chr[1:nrow(muga_g2f1)]) )
stopifnot( all(muga_g2f1[,2] == muga_uwisc$bp_mm10[1:nrow(muga_g2f1)]) )

The G2F1 genetic maps include positions < 0 cM. Seems perfectly okay, but I’m going to shift the maps, using the overall cM:Mbp ratio on each chromosome, to make 3 Mbp (the conventional start position for the mouse genome builds) equal to 0 cM.

shifts <- read.csv("../GenMaps/g2f1_shift.csv",
                   stringsAsFactors=FALSE)
shifts <- setNames(shifts[,2], shifts[,1])

for(chr in c(1:19,"X")) {
    muga_g2f1[muga_g2f1[,1]==chr,5] <- muga_g2f1[muga_g2f1[,1]==chr,5] + shifts[chr]
}

nas <- rep(NA, nrow(muga_uwisc)-nrow(muga_cox))
muga_uwisc <- cbind(muga_uwisc,
                  cM_cox=c(muga_cox[,5], nas),
                  cM_g2f1=c(muga_g2f1[,5], nas))

Now I can write the new annotation files.

cols_new <- c(cols_new[1:3], "cM_cox", "cM_g2f1", cols_new[-(1:3)])
muga_uwisc <- muga_uwisc[,cols_new]

# write MUGA file
muga_file <- "../UWisc/muga_uwisc_v1.csv"
write.table(muga_uwisc, muga_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

And finally, the new data dictionary files.

descriptions <- c(descriptions[1:3],
                  "Sex-averaged cM positions from Cox et al. https://doi.org/10.1534/genetics.109.105486",
                  paste("Sex-averaged cM positions from Liu et al.",
                        "https://doi.org/10.1534/genetics.114.161653",
                        "(shifted to avoid cM positions < 0)"),
                  descriptions[-(1:3)])

muga_dict_file <- "../UWisc/muga_uwisc_dict_v1.csv"
output <- data.frame(column=cols_new,
                     description=descriptions,
                     stringsAsFactors=FALSE)
write.table(output, muga_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

Session info

Here are the versions of R and R packages that I am using.

devtools::session_info()

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 3.6.0 (2019-04-26)
##  os       Pop!_OS 18.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language en_US:en
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/Chicago
##  date     2019-05-10
##
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date       lib source
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.6.0)
##  broman      * 0.69-5  2019-04-11 [1] CRAN (R 3.6.0)
##  callr         3.2.0   2019-03-15 [1] CRAN (R 3.6.0)
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
##  data.table  * 1.12.2  2019-04-07 [1] CRAN (R 3.6.0)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
##  devtools    * 2.0.2   2019-04-08 [1] CRAN (R 3.6.0)
##  digest        0.6.18  2018-10-10 [1] CRAN (R 3.6.0)
##  evaluate      0.13    2019-02-12 [1] CRAN (R 3.6.0)
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.6.0)
##  knitr         1.22    2019-03-08 [1] CRAN (R 3.6.0)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
##  pkgbuild      1.0.3   2019-03-20 [1] CRAN (R 3.6.0)
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.6.0)
##  processx      3.3.1   2019-05-08 [1] CRAN (R 3.6.0)
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.6.0)
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.1   2019-03-17 [1] CRAN (R 3.6.0)
##  remotes       2.0.4   2019-04-10 [1] CRAN (R 3.6.0)
##  rlang         0.3.4   2019-04-07 [1] CRAN (R 3.6.0)
##  rmarkdown     1.12    2019-03-14 [1] CRAN (R 3.6.0)
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.6.0)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
##  testthat      2.1.1   2019-04-23 [1] CRAN (R 3.6.0)
##  usethis     * 1.5.0   2019-04-07 [1] CRAN (R 3.6.0)
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
##  xfun          0.6     2019-04-02 [1] CRAN (R 3.6.0)
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.6.0)
##
## [1] /home/kbroman/Rlibs
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library