Automate the process
You have the full process encoded in scripts, and the data and code are nicely documented and organized. But there’ll be a lot of scripts. Your ReadMe file may explain how to run them, and in what order, but it’ll be a pain to do all of that. (And your ReadMe file might be a tad out of date!)
Ideally, the reproduction of your results is a one-button operation. And this is valuable not just for others, but also for yourself (or your future self). For example, if the primary data should change (and it often does), wouldn’t it be nice to have one command that re-runs everything?
You could do this with a single all-encompassing script. Even better, though, is to use GNU Make. I would argue that Make is the single most important tool for reproducible research.
GNU Make was written for compiling programs from their source code, but it can be used to coordinate any command-line driven process, such as the various scripts that underlie a data analysis project, or the construction of the figures and tables for a paper.
The beauty of Make is that it both automates a process and documents the dependencies in the project: this file is turned into that file with this line of code, and this graph is produced from those files with that script.
You create a file, called Makefile
, and then on the command line
type make
. Here’s part of
an example
(for
this paper):
rqtlexper.pdf: rqtlexper.tex rqtlexper.bib fig1.eps fig2.eps
pdflatex rqtlexper
bibtex rqtlexper
pdflatex rqtlexper
pdflatex rqtlexper
rqtlexper.tex: rqtlexper.Rnw Data/lines_code_detail.txt
R -e 'library(knitr);knit("rqtlexper.Rnw")'
Data/lines_code_detail.txt: Data/lines_code_by_version.csv Python/countStuff.py
Python/countStuff.py > Data/lines_code_detail.txt
fig1.eps: R/lodcurve_fig1.R
cd R;R CMD BATCH lodcurve_fig1.R
fig2.eps: R/colors.R Data/lines_code_by_version.csv R/rqtl_lines_code.R
cd R;R CMD BATCH rqtl_lines_code.R
Data/lines_code_by_version.csv: Perl/grab_lines_code.pl Data/versions.txt
cd Perl;grab_lines_code.pl
Each batch of lines is like this:
targetfile: dependencies
[code to make the target from the dependencies]
In the example, there are bits of code for reformatting some data
files, for creating the two figures for the paper, for converting a
.Rnw
file (Latex with R code chunks) into a .tex
file, and for
creating the final PDF for the paper.
An advantage of GNU Make is that it will just re-run the necessary
bits, based on what files have changed. So if I’ve edited the text of
the paper (in rqtlexper.Rnw
) but haven’t changed the figures at all,
the first two bits in the example will be re-run, but the figures
won’t be reconstructed. If I just edit part of the bibliography (in
rqtlexper.bib
), only rqtlexper.pdf
will be reconstructed;
rqtlexper.tex
won’t be.
To learn more about GNU Make, see my minimal make tutorial and the resources listed there. It’s quirky but hugely valuable.
Now go to the page about turning scripts into reproducible reports.