MiniMUGA annotations

Introduction

Vivek Kumar asked me to repeat my analysis of the MegaMUGA/GigaMUGA annotation files with the MiniMUGA array. He sent me a CSV file that he received from Fernando Pardo Manuel de Villena, containing information for the ~10k markers: marker, chromosome, position (bp38), strand, probe sequence (and for some markers, a second probe sequence), and reference and alternate alleles.

The probes are again 50 bp long, and only a portion of them have the second probe sequence, and in those cases it seems again (like the MegaMUGA and GigaMUGA arrays) that the two sequences contain the SNP. So I trimmed the 50th base off of those sequences before running blastn.

The annotation files I created are at https://github.com/kbroman/MUGAarrays/blob/main/UWisc.

Note: The miniMUGA paper has now been published, with some additions to the array. Initially published at bioRxiv on 2020-03-14, it provides official annotations with the Supplemental material, as Table S2. I have revised my annotations; see this report. My original annotations are in files labeled v1. My annotations of the revised array are in files labeled v2.

Preliminaries

I’ll first load some R packages.

library(data.table)
library(broman)
library(devtools)

I’ll now load miniMUGA annotation file from Fernando via Vivek (hereafter, I’ll call this the UNC annotation file), and the blast results. For the blast results, I only keep the full length, perfect hits.

# UNC file
mini_unc <- data.table::fread("../UNC/miniMUGA-Marker-Annotations.csv",
                              data.table=FALSE)
rownames(mini_unc) <- mini_unc$Marker

# change a few things:
# Marker -> marker
# Chromosome -> chromosome
# Position (b38) -> pos
# MT -> M
colnames(mini_unc)[1:3] <- c("marker", "chr", "pos")
mini_unc$chr[mini_unc$chr=="MT"] <- "M"

# Blast results; keep just the perfect matches(?)
mini_blast <- readRDS("../Blast/results_mini/mini_blastn_results.rds")
mini_blast <- mini_blast[mini_blast$tot_mismatch==0,]

Summarize blast hits

I’ll start with some code to summarize the blast hits. For each SNP, I want to count the number of perfect blast hits. As it turns out, the probes have either 1 or 0 perfect hits, and so we don’t have to do what we did with the MegaMUGA and GigaMUGA arrays, regarding studying the number of distinct chromosomes hit.

For markers with a unique hit, I’ll record the chromosome, position, and strand.

# no. blast hits
mini_tab <- table(mini_blast$query)
mini_nchr <- mini_nhits <- setNames(rep(0, nrow(mini_unc)), rownames(mini_unc))
mini_nhits[names(mini_tab)] <- mini_tab

# chr,pos,strand for the unique ones
mini_blast_uniq <- mini_blast[mini_blast$query %in% names(mini_nhits)[mini_nhits==1],]
mini_blast_chr <- mini_blast_pos <- mini_blast_strand <- setNames(rep(NA, nrow(mini_unc)), rownames(mini_unc))
mini_blast_chr[mini_blast_uniq$query] <- mini_blast_uniq$chr
mini_blast_pos[mini_blast_uniq$query] <- mini_blast_uniq$snp_pos
mini_blast_strand[mini_blast_uniq$query] <- mini_blast_uniq$strand

# probe sequences
mini_blast_probe <- setNames(mini_unc$seqA, mini_unc$Marker)
mini_trim <- (mini_unc$seqB != "")
mini_blast_probe[mini_trim] <- substr(mini_blast_probe[mini_trim], 1, nchar(mini_blast_probe[mini_trim])-1)

# SNP alleles
mini_blast_snp <- paste0(mini_unc$reference, mini_unc$alternate)

# put all of this stuff into a data frame
mini_uwisc <- data.frame(marker=names(mini_nhits),
                         n_blast_hits=mini_nhits,
                         unique=(mini_nhits==1),
                         unmapped=(mini_nhits==0),
                         chr=mini_blast_chr,
                         pos=mini_blast_pos,
                         strand=mini_blast_strand,
                         snp=mini_blast_snp,
                         probe=mini_blast_probe,
                         stringsAsFactors=FALSE)
rownames(mini_uwisc) <- mini_uwisc$marker

Basic blast results

The miniMUGA array contains 10,171 markers, of which (according to the UNC annotations), 9,106 are on autosomes, 642 are on the X chromosome, 75 are on the Y chromosome, 88 are on the mitochondrial genome, 3 are in the pseudoautosomal region, and 257 have chromosome 0 which are transgene-related markers

In the blast results, I find that 9,820 markers have a single, unique hit in the mouse genome (build mm10) and 351 have no perfect hit. There are no markers with multiple hits.

Unique markers

If we look at the inferred chromosome assignments of the markers with a unique hit to the genome, there are just 3 differences, and these are the cases where the UNC annotation file says chromosome PAR (for pseudoautosomal), while the blast results match to the X chromosome. I’m going to force these back to PAR in my annotations.

mini_uwisc$chr[mini_unc$chr=="PAR"] <- "PAR"

There were 351 markers that didn’t have a perfect match in the mm10 mouse genome assembly. This included the 257 markers that the UNC annotation file has as chromosome 0 (transgene-related markers), and then a smattering of markers on all chromosomes except the mitochondria.

Positions

Of the 9,820 markers with a unique hit to the mouse genome, for all but 7 I get the exact same position as in the UNC annotation file. And this includes the 287 markers for which I’d trimmed off the last base of the probe before using blast. (So the off-by-1 problem in the GigaMUGA marker positions was not an issue for UNC annotations of the miniMUGA array.)

The seven markers where the positions are different are all mitochondrial. Of these, six are off by -1 (UNC2203635, UNC2212395, UNC2212695, UNC2212752, UNC2213612, and mMit001), while the other (UNC2202344) is off by 50.

Markers with the same probes

Not all of the probes are distinct.

tab_probe <- table(mini_uwisc$probe)
pairs <- names(tab_probe)[tab_probe=="2"]
trios <- names(tab_probe)[tab_probe=="3"]

# names of the markers in the dup pairs
pair_names <- t(sapply(pairs, function(pr) rownames(mini_uwisc)[mini_uwisc$probe == pr]))
rownames(pair_names) <- 1:nrow(pair_names)
pair_chr <- cbind(mini_unc[pair_names[,1],"chr"], mini_unc[pair_names[,2],"chr"])

# pairs from these that are not on chr 0
pairs_not0 <- pair_names[rowSums(pair_chr=="0")<2,]

# verify name mutations
stopifnot( all(pairs_not0[,2] == paste0(pairs_not0[,1], "b") |
    pairs_not0[,2] == paste0("g", pairs_not0[,1]) |
    pairs_not0[,2] == paste0("m", pairs_not0[,1]) |
    pairs_not0[,1] == paste0(pairs_not0[,2], "b") |
    pairs_not0[,1] == paste0("g", pairs_not0[,2]) |
    pairs_not0[,1] == paste0("m", pairs_not0[,2])) )

There are 113 pairs and 4 trios of markers with identical probe sequences.

The trios of markers with identical probe sequences are all on chr 0 in the UNC annotation file, for transgene-related markers.

Of the 113 pairs, 36 are both on chromosome 0.

The other 154 pairs of markers with identical probe sequences appear to be intended duplicates: they all are such the one name is the same as the other but with g or m added to the beginning, or with b added to the end.

New annotation file

I now want to save my findings as a new annotation file for the miniMUGA array. I’ll also create a metadata file (“data dictionary”) that explains the columns.

The final annotation files and all source materials are at https://github.com/kbroman/MUGAarrays.

Version 0

We’ll call this version 0. It’ll be like the MegaMUGA and GigaMUGA annotation files I made, but without multi, n_blast_hits, or n_blast_chr. And I’ll include columns chr_unc and bp_unc, the chromosome and position in the UNC annotation file.

# order of columns
cols <- c("marker", "chr", "pos", "strand", "snp", "unique",
          "unmapped", "probe", "chr_unc", "pos_unc")
# revised names
cols_new <- c("marker", "chr", "bp_mm10", "strand", "snp", "unique",
              "unmapped", "probe", "chr_unc", "bp_unc")

# MegaMUGA file
mini_file <- "../UWisc/mini_uwisc_v0.csv"

# reorder and rename columns
mini_uwisc <- cbind(mini_uwisc, chr_unc=mini_unc$chr, pos_unc=mini_unc$pos)
mini_uwisc <- mini_uwisc[,cols]
colnames(mini_uwisc) <- cols_new

# reorder rows
mini_uwisc <- mini_uwisc[order(factor(mini_uwisc$chr, levels=c(1:19,"X","Y","PAR","M")),
                               mini_uwisc$bp_mm10,
                               factor(mini_uwisc$chr_unc, levels=c(0:19, "X", "Y", "PAR","M"))),]

# write to CSV file
write.table(mini_uwisc, mini_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

I’ll also create a dictionary for the file, which explains what the columns are.

descriptions <- c("Name of SNP marker",
                  "Chromosome",
                  "Physical position in basepairs for mm10 mouse genome build",
                  "Strand (plus/minus) from which the probe sequence was taken",
                  "SNP alleles as a two-character string",
                  "TRUE indicates that the probe sequence appears exactly once in mm10 mouse genome build",
                  "TRUE indicates that the probe sequence has no perfect match in mm10 mouse genome build",
                  "Probe sequence (49 or 50 bases); the SNP occurs immediately after",
                  "Chromosome in UNC annotation file",
                  "Physical positions in basepairs in the UNC annotation file")

mini_dict_file <- "../UWisc/mini_uwisc_dict_v0.csv"
output <- data.frame(column=cols_new,
                     description=descriptions,
                     stringsAsFactors=FALSE)
write.table(output, mini_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

Finally, I want to make a file that indicates the common markers, using the SNP positions to determine which ones are identical. There are a bunch of them, and they all have related names, like the pairs with identical probe sequences.

# unique marker positions between the arrays
mini_pos <- setNames(paste(mini_uwisc$chr, mini_uwisc$bp_mm10, sep=":"), mini_uwisc$marker)[mini_uwisc$unique]

# find the positions that are in duplicate within the array
tab <- table(mini_pos)
dup <- names(tab)[tab > 1]

# find the corresponding markers
dup_names <- t(sapply(dup, function(d) names(mini_pos)[mini_pos==d]))

pos_spl <- strsplit(rownames(dup_names), ":")

# create data frame with the results
common <- data.frame(chr=sapply(pos_spl, "[", 1),
                     bp_mm10=sapply(pos_spl, "[", 2),
                     marker1=dup_names[,1],
                     marker2=dup_names[,2],
                     stringsAsFactors=FALSE)
# reorder by genomic position
common <- common[order(factor(common$chr, c(1:19,"X","Y","M")), common$bp_mm10),]

# write to a CSV file
write.table(common, "../UWisc/mini_commonmark_uwisc_v1.csv",
            sep=",", quote=FALSE, row.names=FALSE, col.names=TRUE)

# data dictionary
common_cols <- colnames(common)
common_descriptions <- c("Chromosome ID",
                         "Physical position in basepairs for mm10 mouse genome build",
                         "A miniMUGA marker at that position",
                         "A second miniMUGA marker at that position")
common_dict <- data.frame(column=common_cols,
                          description=common_descriptions,
                          stringsAsFactors=FALSE)
# write to file
write.table(common_dict, "../UWisc/mini_commonmark_uwisc_dict_v1.csv",
            sep=",", quote=FALSE, row.names=FALSE, col.names=TRUE)

# write just chr, bp to files, for use with mouse map converter
# (want to get interpolated cM positions from the Cox and G2F1 maps)
write.table(mini_uwisc[!is.na(mini_uwisc$chr) & mini_uwisc$chr %in% c(1:19,"X"), c("chr", "bp_mm10")],
            "../GenMaps/mini_bp.txt", sep=" ", quote=FALSE,
            row.names=FALSE, col.names=FALSE)

Version 1, with genetic maps

I used the mouse map converter to convert the mm10 basepair positions of the autosome and X chromosome markers to sex-averaged cM from the Cox et al. and Liu et al. (aka G2F1) genetic maps.

mini_cox <- data.table::fread("../GenMaps/mini_cox.txt", header=FALSE, data.table=FALSE)
mini_g2f1 <- data.table::fread("../GenMaps/mini_g2f1.txt", header=FALSE, data.table=FALSE)

# verify stuff
stopifnot( all(mini_cox[,1] == mini_uwisc$chr[1:nrow(mini_cox)]) )
stopifnot( all(mini_cox[,2] == mini_uwisc$bp_mm10[1:nrow(mini_cox)]) )
stopifnot( all(mini_g2f1[,1] == mini_uwisc$chr[1:nrow(mini_g2f1)]) )
stopifnot( all(mini_g2f1[,2] == mini_uwisc$bp_mm10[1:nrow(mini_g2f1)]) )

The G2F1 genetic maps include positions < 0 cM. Seems perfectly okay, but I’m going to shift the maps, using the overall cM:Mbp ratio on each chromosome, to make 3 Mbp (the conventional start position for the mouse genome builds) equal to 0 cM.

shifts <- read.csv("../GenMaps/g2f1_shift.csv",
                   stringsAsFactors=FALSE)
shifts <- setNames(shifts[,2], shifts[,1])

for(chr in c(1:19,"X")) {
    mini_g2f1[mini_g2f1[,1]==chr,5] <- mini_g2f1[mini_g2f1[,1]==chr,5] + shifts[chr]
}

nas <- rep(NA, nrow(mini_uwisc)-nrow(mini_cox))
mini_uwisc <- cbind(mini_uwisc,
                  cM_cox=c(mini_cox[,5], nas),
                  cM_g2f1=c(mini_g2f1[,5], nas))

Now I can write the new annotation files.

cols_new <- c(cols_new[1:3], "cM_cox", "cM_g2f1", cols_new[-(1:3)])
mini_uwisc <- mini_uwisc[,cols_new]

# write miniMUGA file
mini_file <- "../UWisc/mini_uwisc_v1.csv"
write.table(mini_uwisc, mini_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

And finally, the new data dictionary files.

descriptions <- c(descriptions[1:3],
                  "Sex-averaged cM positions from Cox et al. https://doi.org/10.1534/genetics.109.105486",
                  paste("Sex-averaged cM positions from Liu et al.",
                        "https://doi.org/10.1534/genetics.114.161653",
                        "(shifted to avoid cM positions < 0)"),
                  descriptions[-(1:3)])

mini_dict_file <- "../UWisc/mini_uwisc_dict_v1.csv"
output <- data.frame(column=cols_new,
                     description=descriptions,
                     stringsAsFactors=FALSE)
write.table(output, mini_dict_file, sep=",", quote=FALSE,
            row.names=FALSE, col.names=TRUE)

Session info

Here are the versions of R and R packages that I am using.

devtools::session_info()

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.0.3 (2020-10-10)
##  os       Pop!_OS 20.10
##  system   x86_64, linux-gnu
##  ui       X11
##  language en_US:en
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/Chicago
##  date     2020-12-18
##
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date       lib source
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
##  broman      * 0.72-1  2020-11-28 [1] local
##  callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.3)
##  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.3)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
##  data.table  * 1.13.4  2020-12-08 [1] CRAN (R 4.0.3)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
##  devtools    * 2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
##  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
##  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.1)
##  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
##  pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.0.3)
##  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
##  processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.3)
##  ps            1.5.0   2020-12-05 [1] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.3)
##  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.3)
##  rlang         0.4.9   2020-11-26 [1] CRAN (R 4.0.3)
##  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.3)
##  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.3)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  testthat      3.0.1   2020-12-17 [1] CRAN (R 4.0.3)
##  usethis     * 2.0.0   2020-12-10 [1] CRAN (R 4.0.3)
##  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
##  xfun          0.19    2020-10-30 [1] CRAN (R 4.0.3)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
##
## [1] /home/kbroman/Rlibs
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library