This is a reproduction of the results in Lamichhane et al. (2003) A post-genomic method for predicting essential genes at subsaturation levels of mutagenesis: application to Mycobacterium tuberculosis. Proc Natl Acad Sci USA 100:7213-7218 doi:10.1073/pnas.1231432100 The full repository for the reproduction work is https://github.com/kbroman/Paper_ReScience2020.

The original project directory, with data and code, is on github. A subset of the materials is in ../original/, and I will be drawing from that. The code to create Figure 1b in the paper was not in that project directory but rather with materials for a talk I gave in 2002-2003. The materials for that talk is on github. A subset is in ../talk/.

I should say right at the start that I am not able to reproduce Figure 3 in the paper. I have the code to make the Figure based on simulation results, but I can’t find the code that was used to carry out the simulation.

I will first load the required libraries for this analysis.

library(negenes)
library(R.utils)  # for gunzip()
library(devtools) # for session_info()
library(gt)       # for tables
library(xtable)   # for latex tables
library(readxl)   # for reading excel file
library(broman)   # for grayplot()

Move stuff over

The first task is to copy over the primary data and scripts that we need to reproduce the results. I will first create a set of subdirectories to contain them.

dir <- c("Rawdata", "Rawdata/TIGR", "Data", "R", "Figs", "Tabs")

for(d in dir) {
    if(!dir.exists(d)) dir.create(d)
}

Next, I will copy over the primary data files.

raw_data_files <- c("Rawdata/TIGR/GMT.1con.gz",
                    "Rawdata/TBCDC1551_rev.csv",
                    "Rawdata/MTCoords_rev.csv",
                    "Rawdata/MT-RvConversion_rev.csv",
                    "Rawdata/MtbGeneClassification.csv",
                    "Sept02/Rawdata/GenomicData-final4_rev.csv",
                    "Sept02/Rawdata/Phase2-FinalData_rev.csv")

for(file in raw_data_files) {
    orig_file <- file.path("..", "original", file)
    new_file <- sub("Sept02/", "", file, fixed=TRUE)

    if(!file.exists(new_file)) {
        cat(orig_file, " -> ", new_file, "\n")
        file.copy(orig_file, new_file)
    }
}

I need to use gunzip() to unzip the TIGR genomic data file.

gzipped_file <- "Rawdata/TIGR/GMT.1con.gz"
gunzipped_file <- sub(".gz", "", gzipped_file, fixed=TRUE)
if(!file.exists(gunzipped_file)) {
    gunzip(gzipped_file)
}

I will now copy over the perl script that pulls out the transposon insertion sites from the M. tuberculosis genome.

file <- "findTA.pl"
if(!file.exists(file)) {
    file.copy(file.path("..", "original", file), file)
}

Now I will copy over the R scripts that do the MCMC analysis. The ones in Sept02/R create results that appear in Table 2. The ones in Nov02/R create all of the rest of the results.

files <- c("prepareData.R", "analysis.R")

# 2002-09 versions
for(file in files) {
    orig_file <- file.path("..", "original", "Sept02", "R", file)
    new_file <- file.path("R", paste0("sept02_", file))

    if(!file.exists(new_file)) {
        file.copy(orig_file, new_file)
    }
}


# 2002-11 versions
for(file in files) {
    orig_file <- file.path("..", "original", "Nov02", "R", file)
    new_file <- file.path("R", paste0("nov02_", file))

    if(!file.exists(new_file)) {
        file.copy(orig_file, new_file)
    }
}

Finally, I will copy over the R script that create the figures.

file <- "figs4paper.R"
orig_file <- file.path("..", "original", "Nov02", "R", file)
new_file <- file.path("R", file)
if(!file.exists(new_file)) {
    file.copy(orig_file, new_file)
}

Oh, yeah. There is also the script that makes Figure 1b, which was not in the original project directory but rather with materials for a talk I gave at the time.

file <- "circlefig.R"
orig_file <- file.path("..", "talk", "R", file)
new_file <- file.path("R", file)
if(!file.exists(new_file)) {
    file.copy(orig_file, new_file)
}

Run perl script

I will first run the Perl script findTA.pl which uses data from Rawdata/ and creates a set of files in Data/. More specifically, it uses the genomic sequence for M. tuberculosis and the genes’ start and end positions to identify the locations of each transposon insertion site within each gene.

system("perl findTA.pl")

Project structure

The reproduction of this work included a number of challenges: lack of documentation (couldn’t you give me even one README file, 2002 Karl?), multiple directories of scripts with portions of results coming from different ones, not saving random number generator seeds, not loading the required library in the scripts, not saving intermediate results to files (but just leaving them in the R environment, in .RData), and not explaining which R objects were needed in a script.

To reproduce the work, I am going to run a particular script, save the key results to a file, and then clean the R environment for the next step. Some of the calculations take a bit of time, and those I will cache using a crude caching system that will mostly be hidden.

Because of the lack of documentation, the first challenge was to figure out which results I was trying to reproduce. To do that, I re-read the paper. The key results are in Figures 1, 2, and 3, and Tables 2, 3, and 4. Table 3 also has an extended version, as Supplementary Table 6. There are a number of other summary statistics about the data, for example in Table 1 and in the right-hand column of the second page of the paper. But I am going to focus on the key analysis results, Figures 1-3 and Tables 2-4 and Supplemental Table 6.

Let me first explain some key elements of the project directory in ../original/. The R subdirectory contains my initial analysis of the data (performed in like July, 2002), which did not end up in the paper. Sept02/R contains a revised analysis based on a slightly larger set of data and evaluating multiple versions of a rule for part of the data handling (see Table 2 of the published paper). Nov02/R contains an analysis with our final version of the data handling rule and is the basis for one row of Table 2 and all of the rest of the analyses.

Run MCMC

The analysis proceeded in two steps: run a prepareData.R script to create the key data objects, and then run analysis.R which performs the Markov chain Monte Carlo (MCMC) analysis and summarizes the results. We first need to do the Sept02 versions and then the Nov02 versions.

So we will start with Sept02/R/prepareData.R, which we have saved as R/sept02_prepareData.R. It prints a bunch of crap, so we will hide the output.

There are two small wrinkles here. First, the ngeneperfam object has class "table" which leads to an error in the next step, in the analysis.R script. This is due to some change in R since 2002. So we need to convert the ngeneperfam object to a simple numeric vector.

Second, the script reads the file geneinfo.csv assuming that the gene classes column will be converted to a factor, which is no longer true in R version 4. So we will insert a line stringsAsFactors to be TRUE.

rm(list=ls())
setwd("R")

# fix small problem in the sept02 script
lines <- readLines("sept02_prepareData.R")
g <- grep("geneinfo.csv", lines)
lines[g] <- sub(")", ", stringsAsFactors=TRUE)", lines[g])
cat(lines, file="sept02_prepareData.R", sep="\n")

# run the script
source("sept02_prepareData.R")

# need to turn this "table" into numeric
ngeneperfam <- setNames(as.numeric(ngeneperfam), names(ngeneperfam))

# save the data objects to a file
save(data100nostop, data100stop, data90, data70, data60, data80,
     ngeneperfam, numTAs, geneclasses,
     file="sept02_data.RData")

setwd("..")

I will now run the MCMC. This takes about 9 min on my laptop. (I cache the results and only run this if needed.) Again, we copied over the Sept02/R/analysis.R script as R/sept02_analysis.R.

The only slight change here is to rename one of the objects to avoid conflict with the results obtained using the Nov02 script, below.

rm(list=ls())
setwd("R")

# load the data created above
load("sept02_data.RData")

set.seed(74250112)
source("sept02_analysis.R")

# A different finalres object is created by the Nov02 script
# so change the name of this thing to res80
# (but note that res80 is in different format from the others)
res80 <- finalres
save(res100nostop, res90, res80, res70, res60,
     file="sept02_results.RData")

setwd("..")

We now repeat the same things with the Nov02 scripts. First, Nov02/R/prepareData.R, which we have saved as R/nov02_prepareData.R. Again, it prints a bunch of crap, so we will hide the output.

Again, we need to convert the ngeneperfam object to be a numeric vector.

rm(list=ls())
setwd("R")

# fix small problem in the nov02 script
lines <- readLines("nov02_prepareData.R")
g <- grep("geneinfo.csv", lines)
lines[g] <- sub(")", ", stringsAsFactors=TRUE)", lines[g])
cat(lines, file="nov02_prepareData.R", sep="\n")

# run the script
source("nov02_prepareData.R")

# need to turn this "table" into numeric
ngeneperfam <- setNames(as.numeric(ngeneperfam), names(ngeneperfam))

# save the data objects to a file
save(mydata,
     ngeneperfam, numTAs, geneclasses,
     taloc, classes,
     file="nov02_data.RData")

setwd("..")

And again we run the MCMC. This one is faster, because we are doing just one version of the analysis rather than 6. (It takes about 1.5 min on my laptop). The R/nov02_analysis.R script was copied over from Nov02/R/analysis.R.

The only modification here is to change the name of one of the results objects, from wh (which is not very meaningful) to genes.

rm(list=ls())
setwd("R")

# load the data created above
load("nov02_data.RData")

set.seed(27150742)
source("nov02_analysis.R")

genes <- wh # rename "wh" (as more meaningful)
save(finalres, fams, genes, famprob,
     file="nov02_results.RData")

setwd("..")

Tables

Let us now reconstruct the Tables 2-4 and Supplemental Table 6.

Table 2

First, let us look at Table 2, which shows the estimated percent essential genes (with an interval estimate) using various rules for defining the part of the gene where viable transposon insertion would indicate that the gene is not essential.

rm(list=ls())
load("R/sept02_results.RData")
load("R/nov02_results.RData")

tab2 <- matrix(nrow=6, ncol=3)
dimnames(tab2) <- list(c("100%", "90%", "5'80%-3'100bp", "80%", "70%", "60%"),
                       c("estimate (%)", "lo 95% CI", "hi 95% CI"))

for(i in 1:6) {
    # grab the appropriate result object
    obj_nam <- c("res100nostop", "res90", "finalres",
                 "res80", "res70", "res60")[i]
    obj <- get(obj_nam)[[1]] # first component is number of essential genes

    # pull out the summaries
    # divide by the number of genes (4204) and multiple by 100 (to get percent)
    tab2[i,] <- c(mean(obj), quantile(obj, c(0.025, 0.975))) / 4204 * 100
}

# add rule as a first column
tab2 <- cbind(data.frame(rule=rownames(tab2)), tab2)

# use gt package to make a table
gt(tab2) %>%
    fmt_number(columns=2:4, decimals=0)
rule estimate (%) lo 95% CI hi 95% CI
100% 34 27 39
90% 36 29 42
5'80%-3'100bp 35 28 41
80% 40 33 46
70% 42 35 49
60% 42 33 49

Comparing the reconstructed table to the published version, we see just one small difference: the upper confidence limit in the bottom row is 49% here, whereas it was 50% in the published paper. This difference can be ascribed to MCMC sampling variation.

The following creates a LaTeX table comparing these results, for the final paper.

tab2_new <-
    cbind(tab2[,1,drop=FALSE], estimate_orig=c(34,36,35,40,42,42),
          round(tab2[,2,drop=FALSE]),
          ci_lo_orig=c(27,29,28,33,35,33),
          ci_hi_orig=c(39,42,41,46,49,50),
          round(tab2[,3:4]), stringsAsFactors=FALSE)

# highlight differences in bold [there is just one]
for(i in 1:nrow(tab2_new)) {
  if(tab2_new[i,2] != tab2_new[i,3])
      tab2_new[i,3] <- paste0("\\textbf{\\color{red} ", tab2_new[i,3], "}")
  if(tab2_new[i,4] != tab2_new[i,6])
      tab2_new[i,6] <- paste0("\\textbf{\\color{red} ", tab2_new[i,6], "}")
  if(tab2_new[i,5] != tab2_new[i,7])
      tab2_new[i,7] <- paste0("\\textbf{\\color{red} ", tab2_new[i,7], "}")
}

xtab2 <- xtable(tab2_new, digits=0,
               align=c(rep("c", 4), rep(c("r@{--}", "l"), 2)))

addtorow <- list()
addtorow$pos <- list(0,0)
addtorow$command <- c("& \\multicolumn{2}{c}{Estimate (\\%)} & \\multicolumn{4}{c}{95\\% credible interval}\\\\",
                      "Rule & original & reproduction & \\multicolumn{2}{c}{original} & \\multicolumn{2}{c}{reproduction}\\\\")

print(xtab2, file="Tabs/tab2.tex",
      include.rownames=FALSE,
      include.colnames=FALSE,
      sanitize.text.function=function(x) gsub("%", "\\%", x, fixed=TRUE),
      add.to.row=addtorow,
      floating=FALSE)

Table 3 and Supplemental Table 6

Let us now turn to Table 3 and Supplemental Table 6, on the genes with probability > 75% of being essential. (Supplemental Table 6 is an extended version of Table 3.)

In order to reproduce these tables exactly, we need to make one small change to the code in the Nov02/R/analysis.R. While the original script pulled out genes with probability ≥ 0.749, we need to change that cutoff to 0.745. This difference is within rounding error and can be ascribed to MCMC sampling error.

We can make the change by reading in the script, pulling out the appropriate lines, substituting the one number, and then re-running those lines.

rm(list=ls())
load("R/nov02_data.RData")    # numTAs
load("R/nov02_results.RData") # finalres

# read script; subset to lines 50-53; change 0.749 to 0.745
script <- readLines("R/nov02_analysis.R")
lines <- sub("0.749", "0.745", script[50:53], fixed=TRUE)

# run those lines
eval(parse(text=lines))

# add gene number as a column; change row order; change name of object
wh <- cbind(data.frame(MTnum=rownames(wh)), wh)
wh <- wh[rev(order(wh[,3], -as.numeric(rownames(wh)))), ]
wh[14:15,] <- wh[15:14,]
wh[19:20,] <- wh[20:19,]
genes <- wh
genes[,3] <- genes[,3] * 100 # convert to percent

Here is the reconstruction of Table 3. The results match the published version. (If we had left the cutoff at 0.749, we would have missed the last two genes on the list, for which the estimated probability of being essential is now 74.8%. This difference is again likely due to MCMC sampling error.)

# genes to exclude from Table 3 (just listed in Supplemental Table 6)
tab3_exclude <- c("418",  "1218", "3003", "1701",
                  "2448", "3002", "1702")

genes[!(genes$MTnum %in% tab3_exclude), c("MTnum", "Prob.essential")] %>%
    gt() %>%
    fmt_number(columns="Prob.essential", decimals=0)
MTnum Prob.essential
417 83
2062 83
3285 83
2082 82
3974 82
1587 81
1198 79
47 79
2600 78
70 77
1678 76
2551 75
1796 75
116 75
3045 75

Seven genes were excluded from Table 3 (because there was other evidence to indicate that they were non-essential), but the full list was shown in Supplemental Table 6. Here is the reconstruction of the full table, which matches the published version.

genes[, c("MTnum", "Prob.essential")] %>%
    gt() %>%
    fmt_number(columns="Prob.essential", decimals=0)
MTnum Prob.essential
418 93
1218 88
3003 84
417 83
1701 83
2062 83
3285 83
2448 83
2082 82
3974 82
1587 81
1198 79
47 79
3002 78
2600 78
70 77
1678 76
1702 76
2551 75
1796 75
116 75
3045 75

The following creates a LaTeX table comparing these results, for the final paper.

# download original table 6
url <- "https://www.pnas.org/highwire/filestream/586425/field_highwire_adjunct_files/1/1432Table6.xls"
local_file <- "Tabs/1432Table6.xls"
if(!file.exists(local_file)) {
    download.file(url, local_file)
}

# read original table 6
tab6_orig <- readxl::read_excel(local_file, sheet=1, skip=2)
tab6_orig <- as.data.frame(tab6_orig)

# add gene MT number
genes$MTnum <- sprintf("%04d", as.numeric(as.character(genes$MTnum)))

stopifnot(all(tab6_orig[,1] == genes$MTnum))

tab6_new <- cbind(tab6_orig, round(genes$Prob.essential))

# highlight differences in bold [there aren't any]
for(i in 1:nrow(tab6_new)) {
    if(tab6_new[i,4] != tab6_new[i,5])
        tab6_new[i,5] <- paste0("\\textbf{\\color{red} ", tab6_new[i,5], "}")
}

# remove the references at the end of the gene descriptions
tab6_new[,3] <- sub("\\([123]\\)$", "", tab6_new[,3])

xtab6 <- xtable(tab6_new, digits=0, align=rep("c", 6))

addtorow <- list()
addtorow$pos <- list(0,0)
addtorow$command <- c("& & & \\multicolumn{2}{c}{Probability (\\%)}\\\\",
                      "MT \\# & Rv \\# & Gene description & original & reproduction\\\\")

print(xtab6, file="Tabs/tab6.tex",
      include.rownames=FALSE,
      include.colnames=FALSE,
      sanitize.text.function=function(x) gsub("%", "\\%", x, fixed=TRUE),
      add.to.row=addtorow,
      floating=FALSE)

Table 4

Table 4 contains a list of gene families that appear to be enriched with essential genes, or to have a deficit of essential genes. The contents of the table are created in Nov02/R/analysis.R, but one small change needs to be made, replacing the 75% cutoff with 74.5%, as otherwise one of the gene families gets left off. (The new estimate of the enrichment probability for one family has dipped slightly below 75%.)

load("R/nov02_data.RData")
load("R/nov02_results.RData")

# read script
script <- readLines("R/nov02_analysis.R")

# change cutoff from 75 to 74.5
lines <- sub(">= 75", ">= 74.5", script[56:64], fixed=TRUE)

# run the lines
eval(parse(text=lines))

# paste in family labels
fams <- cbind(data.frame(family=classes[fams[,1]]),
              fams[,-1])

# create the table
gt(fams[,c("prob.enriched", "family", "percent.essential", "2.5%", "97.5%")])
prob.enriched family percent.essential 2.5% 97.5%
97 MMM: Aminoacyl tRNA synthases and their modification 54 32 72
93 Other: PE family: PGRS subfamily 45 30 60
82 SMM: Purine ribonucleotide biosynthesis 46 21 68
81 SMM: Polyketide and non-ribosomal peptide synthesis 40 28 52
78 SMM: Biosyn of fatty and mycolic acids 42 23 62
75 SMM: Ser/Thr protein kinases and phosphoprotein phosphatases 43 21 64
75 SMM: Biosyn of Molybdopterin 42 20 65
5 SMM: Intermediate Metabolism-of sulphur 20 7 40
4 Unknowns 32 25 39
4 Other: PPE family of proteins 27 17 38
0 MMM: Conserved membrane proteins 10 0 24

There are a few small changes from the published version. First, the enrichment probability has changed ±1% for three of the families. Second, the upper confidence limits have changed for two families: in the first row the limit changed from 76 to 72%, and in the second-to-last row it changed from 36 to 38%. I think these differences can be ascribed to MCMC sampling error.

The following creates a LaTeX table comparing these results, for the final paper.

# original table 4
tab4_orig <- data.frame(prob_enriched=c(97,94,82,80,78,75,75,4,4,4,0),
                        percent_essential=c(54,45,46,40,42,43,42,32,20,27,10),
                        ci_essential_lo=c(32,30,21,28,23,21,20,25,7,17,0),
                        ci_essential_hi=c(76,60,68,52,62,64,65,39,40,36,24),
                        group=c("Aminoacyl tRNA synthases and their modification",
                                "PE family: PGRS subfamily",
                                "Purine ribonucleotide biosynthesis",
                                "Polyketide and nonribosomal peptide synthesis",
                                "Synthesis of fatty and mycolic acids",
                                "Ser/Thr protein kinases and phosphoprotein phosphatases",
                                "Biosynthesis of molybdopterin",
                                "Unknown proteins",
                                "Metabolism of sulphur",
                                "PPE family",
                                "Conserved membrane proteins"),
                        stringsAsFactors=FALSE)

# clean up the new names
fams[,1] <- as.character(fams[,1])
fams[,1] <- sub("^[A-Za-z]+: ", "", fams[,1])
fams[fams[,1]=="Unknowns",1] <- "Unknown proteins"
fams[fams[,1]=="Intermediate Metabolism-of sulphur",1] <- "Metabolism of sulphur"
fams[fams[,1]=="PPE family of proteins",1] <- "PPE family"
fams[fams[,1]=="Biosyn of Molybdopterin",1] <- "Biosynthesis of molybdopterin"
fams[,1] <- sub("non-ribosomal", "nonribosomal", fams[,1], fixed=FALSE)
fams[,1] <- sub("Biosyn ", "Synthesis ", fams[,1], fixed=FALSE)

# reorder as in original table
m <- match(tab4_orig$group, fams[,1])
stopifnot( !any(is.na(m)) )
fams <- fams[m, ]
stopifnot( all( fams[,1] == tab4_orig$group ) )

tab4_new <- cbind(tab4_orig[,5,drop=FALSE],
                  tab4_orig[,1,drop=FALSE], fams[,2,drop=FALSE],
                  tab4_orig[,2,drop=FALSE], fams[,3,drop=FALSE],
                  tab4_orig[,3:4,drop=FALSE], fams[,4:5,drop=FALSE],
                  stringsAsFactors=FALSE)

# highlight differences in bold
for(i in 1:nrow(tab4_new)) {
    if(tab4_new[i,2] != tab4_new[i,3])
        tab4_new[i,3] <- paste0("\\textbf{\\color{red} ", tab4_new[i,3], "}")
    if(tab4_new[i,4] != tab4_new[i,5])
        tab4_new[i,5] <- paste0("\\textbf{\\color{red} ", tab4_new[i,5], "}")
    if(tab4_new[i,6] != tab4_new[i,8])
        tab4_new[i,8] <- paste0("\\textbf{\\color{red} ", tab4_new[i,8], "}")
    if(tab4_new[i,7] != tab4_new[i,9])
        tab4_new[i,9] <- paste0("\\textbf{\\color{red} ", tab4_new[i,9], "}")
}

# reduce group names
tab4_new[1,1] <- paste0(substr(tab4_new[1,1], 1, 24), "...")
tab4_new[4,1] <- paste0(substr(tab4_new[4,1], 1, 27), "...")
tab4_new[6,1] <- paste0(substr(tab4_new[6,1], 1, 27), "...")

# rearrange columns
tab4_new <- tab4_new[,c(1,2,3,4,6,7,5,8,9)]

tab4_new[,6] <- paste0(tab4_new[,6], ")")
tab4_new[,9] <- paste0(tab4_new[,9], ")")

xtab4 <- xtable(tab4_new, digits=0,
               align=c("c", "l", rep("c", 2), rep(c("c", "@{ (}r@{--}", "l"), 2)))

addtorow <- list()
addtorow$pos <- list(0,0)
addtorow$command <- c(paste("& \\multicolumn{2}{c}{Probability enriched (\\%)} &",
                            "\\multicolumn{6}{c}{Est. \\% essential}\\\\"),
                      paste("Functional group & original & reproduction &",
                            "\\multicolumn{3}{c}{original} & \\multicolumn{3}{c}{reproduction}\\\\"))

print(xtab4, file="Tabs/tab4.tex",
      include.rownames=FALSE,
      include.colnames=FALSE,
      sanitize.text.function=function(x) gsub("%", "\\%", x, fixed=TRUE),
      add.to.row=addtorow,
      floating=FALSE)

Figures

We now turn to reconstruction of the Figures.

Figure 1b

Let us first consider Figure 1b, which shows the transposon insertion sites on the circular genome of M. tuberculosis.

The code was not contained within the project directory, but I found it in a separate directory of files for a talk I gave on the work in 2002-2003.

rm(list=ls())
setwd("R")

mygeneloc <- read.csv("../Data/mygeneloc.csv")
load("nov02_data.RData")

source("circlefig.R")

setwd("..")

The code creates a file Figs/circlefig.ps. To generate the figure within the current document, we can read in the code, comment out the postscript() and dev.off() lines, and then execute them, as follows.

mygeneloc <- read.csv("Data/mygeneloc.csv")
load("R/nov02_data.RData")

# read circlefig code
circlefig <- readLines("R/circlefig.R")
# comment out postscript() and dev.off()
circlefig <- sub("postscript(", "#postscript(", circlefig, fixed=TRUE)
circlefig <- sub("dev.off(", "#dev.off(", circlefig, fixed=TRUE)

# run the code here
eval(parse(text=circlefig))

The reconstructed figure appears to be the same as the published figure.

Figures 1a and 2

The code to create Figure 1a (with the transposon insertion locations) and Figure 2 (probability essential vs number of insertion sites, for each gene) were together in a single file, Nov02/R/figs4paper.R.

It is a simple matter to run the script, which generates postscript files Figs/fig1.ps and Figs/fig2.ps. Points in Figure 2 with probability 0 are jittered vertically, so we set the random number seed here.

rm(list=ls())
load("R/sept02_data.RData")
load("R/nov02_data.RData")
load("R/nov02_results.RData")

set.seed(33359189)
source("R/figs4paper.R")

We can again use the trick of reading in the code, commenting out the postscript() and dev.off(), and running it, to have the figures appear within the present document. Here is Figure 1a.

rm(list=ls())
load("R/sept02_data.RData")
load("R/nov02_data.RData")
load("R/nov02_results.RData")

figs <- readLines("R/figs4paper.R")

# comment out postscript() and dev.off()
figs <- sub("postscript(", "#postscript(", figs, fixed=TRUE)
figs <- sub("dev.off(", "#dev.off(", figs, fixed=TRUE)

# run the code here, just for figure 1a
eval(parse(text=figs[1:13]))

This looks the same as the published figure.

And here is Figure 2.

set.seed(33359189)
eval(parse(text=figs[-(1:13)]))

Again, it looks the same as the published figure.

Verify Figure 2 results

We can verify the detailed results in Figure 2 by comparing the present results to those that were saved in the original project repository, included in the present repository as Nov02/R/finalresults.rds.

Here is a plot of the differences (original - reconstructed) in the estimated probabilities of genes being essential.

load("R/nov02_results.RData")
finalres_orig <- readRDS("../original/Nov02/R/finalresults.rds")

par(las=1, mar=c(4.1, 4.1, 0.6, 0.6))
grayplot(numTAs,
         (finalres_orig$bygene - finalres$bygene)*100,
         xlab="Number of TAs in proximal portion of gene",
         ylab="Difference in % probability (original - reproduction)")

The median difference is 0.07. So the reconstructed probabilities in Figure 2 have changed < 0.1%, which is immaterial and likely due to MCMC sampling error.

Let me clarify: the positive values in Figure 2 are like 0.35–0.95, and we have reproduced them to within about 0.001. As percentages, they are like 35-95%, and we have got them nailed down to about 0.1%.

I’ll also save this figure as a PDF for the paper.

pdf("Figs/fig2_diff.pdf", height=4.5, width=6.5, pointsize=10)
load("R/nov02_results.RData")
finalres_orig <- readRDS("../original/Nov02/R/finalresults.rds")

par(las=1, mar=c(4.1, 4.1, 0.6, 0.6))
grayplot(numTAs,
         (finalres_orig$bygene - finalres$bygene)*100,
         xlab="Number of TAs in proximal portion of gene",
         ylab="Difference in % probability (original - reproduction)")
dev.off()
## png 
##   2

Summary

The main difficulties in reproducing the results in the paper concerned poor documentation and organization of the project data and code.

  • I should have included a ReadMe file that described the structure of the project.

  • I should have revised the scripts to give just the final published analyses, rather than create multiple mutated copies with some results pulled from one set of scripts and other results pulled from other scripts.

  • I should have saved the random number generator seeds.

  • I should have saved the key intermediate results to files and then loaded the files at the top of scripts that needed them, rather than rely on particular objects being present in the R environment.

  • I should have indicated (and loaded) the required R libraries.

Nevertheless, I was able to reconstruct the results to within a small and immaterial level of MCMC sampling error. I needed to make just a few small changes to the code.

  • Cutoffs for determining which genes and gene families to display in Tables 3 and 4 needed to be modified slightly.

  • One vector that was derived using the R function table() needed to be converted from class "table" to a simple numeric vector. This is due to some change in R between version 1.5.1 (July 2002) and 3.6.2 (current).

I was not able to reproduce Figure 3, however. The simulation results and the code to create the figure are available, but I could not find the code that was used to perform the simulations.

Session info

Here is information about the version of R and R packages that I used.

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       Pop!_OS 20.04 LTS           
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US:en                    
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2020-07-24                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
##  backports     1.1.8   2020-06-17 [1] CRAN (R 4.0.1)
##  broman      * 0.70-4  2020-05-22 [1] CRAN (R 4.0.0)
##  callr         3.4.3   2020-03-28 [1] CRAN (R 4.0.0)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.0.0)
##  checkmate     2.0.0   2020-02-06 [1] CRAN (R 4.0.0)
##  cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
##  colorspace    1.4-1   2019-03-18 [1] CRAN (R 4.0.0)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
##  devtools    * 2.3.1   2020-07-21 [1] CRAN (R 4.0.2)
##  digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.0)
##  dplyr         1.0.0   2020-05-29 [1] CRAN (R 4.0.0)
##  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
##  fs            1.4.2   2020-06-30 [1] CRAN (R 4.0.2)
##  generics      0.0.2   2018-11-29 [1] CRAN (R 4.0.0)
##  ggplot2       3.3.2   2020-06-19 [1] CRAN (R 4.0.1)
##  glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
##  gt          * 0.2.1   2020-05-23 [1] CRAN (R 4.0.0)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.0.0)
##  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.1)
##  knitr         1.29    2020-06-23 [1] CRAN (R 4.0.2)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.0.0)
##  negenes     * 1.0-12  2019-08-05 [1] CRAN (R 4.0.0)
##  pillar        1.4.6   2020-07-10 [1] CRAN (R 4.0.2)
##  pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
##  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
##  processx      3.4.3   2020-07-05 [1] CRAN (R 4.0.2)
##  ps            1.3.3   2020-05-08 [1] CRAN (R 4.0.0)
##  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  R.methodsS3 * 1.8.0   2020-02-14 [1] CRAN (R 4.0.0)
##  R.oo        * 1.23.0  2019-11-03 [1] CRAN (R 4.0.0)
##  R.utils     * 2.9.2   2019-12-08 [1] CRAN (R 4.0.0)
##  R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
##  Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
##  readxl      * 1.3.1   2019-03-13 [1] CRAN (R 4.0.0)
##  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
##  rlang         0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
##  rmarkdown     2.3     2020-06-18 [1] CRAN (R 4.0.1)
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.0)
##  sass          0.2.0   2020-03-18 [1] CRAN (R 4.0.0)
##  scales        1.1.1   2020-05-11 [1] CRAN (R 4.0.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  stringi       1.4.6   2020-02-17 [1] CRAN (R 4.0.0)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  testthat      2.3.2   2020-03-02 [1] CRAN (R 4.0.0)
##  tibble        3.0.3   2020-07-10 [1] CRAN (R 4.0.2)
##  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
##  usethis     * 1.6.1   2020-04-29 [1] CRAN (R 4.0.0)
##  vctrs         0.3.2   2020-07-15 [1] CRAN (R 4.0.2)
##  withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)
##  xfun          0.16    2020-07-24 [1] CRAN (R 4.0.2)
##  xtable      * 1.8-4   2019-04-21 [1] CRAN (R 4.0.0)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
## 
## [1] /home/kbroman/Rlibs
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library