initial steps toward reproducible research

A minimal standard for data analysis and other scientific computations is that they be reproducible: that the code and data are assembled in a way so that another group can re-create all of the results (e.g., the figures in a paper). Adopting a workflow that will make your results reproducible will ultimately make your life easier; if a problem (or question) arises somewhere down the line, it will be much easier to correct (or explain).

Organizing analyses so that they are reproducible is not easy. It requires diligence and a considerable investment of time: to learn new computational tools, and to organize and document analyses as you go.

But partially reproducible is better than not at all reproducible. Just try to make your next paper or project better organized than the last.

There are many paths toward reproducible research, and you shouldn’t try to change all aspects of your current practices all at once. Identify one weakness, adopt an improved approach, refine that a bit, and then move on to the next thing.

Inspired by Christine Bahlai’s “Baby steps for the open-curious,” I thought I’d write some suggestions for the initial steps to take towards making one’s work reproducible.

I’m focusing primarily on R, because that’s what I know, but I’ll try to at least sketch what a python person might do.

Again, you shouldn’t try to do these things all at once; start with one, or part of one. Then in your next project, do that plus another thing.

The source for this minimal tutorial is on github. If you have suggestions for changes or improvements, please submit an issue.

Also see my tutorials on git/github, GNU make, knitr, R packages, data organization, and making a web site with GitHub Pages.