A course in statistical programming - the stupidest thing...

Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.

For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.

Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.

Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.

One really learns to code in private, struggling over problems, but I benefited enormously from a statistical computing course I took from Phil Spector at Berkeley.

Brian Caffo, Ingo Ruczinski, Roger Peng, Rafael Irizarry, and I developed a statistical programming course at Hopkins that (I think) really did the job.

I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.

I have in mind several basic principles:

be self-sufficient
get the right answer
document what you did (so that you will understand what you did 6 months later)
if primary data change, be able to re-run the analysis without a lot of work
are your simulation results reproducible?
reuse of code (others' and your own) rather than starting from scratch every time
make methods accessible to (and used by) others

Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.

Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
Emacs/vim/other editors (rstudio/eclipse)
Latex (for papers; for presentations)
slides for talks; posters; figures/tables
Advanced R (fancy data structures; functions; object-oriented stuff)
Advanced R graphics
R packages
Sweave/asciidoc/knitr
minimal Perl (or Python or Ruby); example of data manipulation
Minimal C (or C++); examples of speed-up
version control (eg git or mercurial); backups
reproducible research ideas
data management
managing projects: data, analyses, results, papers
programming style (readable, modular); general but not too general
debugging/profiling/testing
high-throughput computing; parallel computing; managing big jobs
finding answers to questions: man pages; documentation; web
more on visualization; dynamic graphics
making a web page; html & css; simple cgi-type web forms?
writing and managing email
managing references to journal articles