I was an early adopter of R, having first learned S (yay!) and then S-plus (yuck!). But at times my knowledge of R seems stuck in 2001. I keep finding out about “new” R functions (like replicate, which was new in 2003).

This is a tutorial for people like me, or people who were taught by people like me.

Switch to knitr

If you use Sweave, it’s time you switched to knitr. You’ll find that the transition is easy.

A number of Sweave annoyances have been eliminated, but most importantly you can use knitr with R Markdown or with AsciiDoc for writing simple reports. The markup is much simpler than LaTeX, and you don’t have to worry about page breaks.

Learn Hadley Wickham’s packages

Start with dplyr, tidyr, purrr, and ggplot2.

These are the main packages for what’s now called the “tidyverse”, which has grown beyond Hadley. Also check out

  • lubridate for handing dates
  • stringr for handling strings
  • forcats for handling factors
  • readr for reading csv/tsv files
  • readxl for reading Excel files
  • broom for tidying statistical analysis objects

For R package development, check out devtools, roxygen2, testthat, and assertthat.

Also, read his books: Advanced R, R packages, R for Data Science, and ggplot2 (2nd edition).

Adopt the pipe operator

When you adopt Hadley’s dplyr and tidyr tools, you’ll want to also adopt the pipe operator %>%, from magrittr.

You’re old school, so you’re used to writing stuff like this:

x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
round(exp(diff(log(x))), 1)

Seems perfectly fine, but note how it’s read from the inside out. With the pipe operator, you can do the same series of steps, written in the order that they’re actually performed.

library(magrittr)
x %>% log %>%
    diff %>%
    exp %>%
    round(1)

The pipe operator does some magic that makes the bit on the left be the first argument of the function call on the right.

If you need the bit on the left of the pipe to be somewhere other than the first argument, you can use a period. For example, here’s a wacky way to get the log (base 2) of 5.

2 %>% log(5, base=.)

Note: Jenny Bryan suggests that we use the parentheses on the functions even when they’re not formally required, like this:

library(magrittr)
x %>% log() %>%
    diff() %>%
    exp() %>%
    round(1)

Consider RStudio

If you’re still using the R GUI (for Windows or Mac), you should switch to RStudio. Everything about it is better.

Personally, I stick with Emacs + ESS, because I’m writing code in multiple languages (not just R). (Another IDE option for R that many recommend: Eclipse with StatET.)

But I use RStudio for teaching: for demonstrations, and I have the students use it; it’s the best environment for learning R.

And note that RStudio makes it easy to use knitr with Markdown, and to develop R Packages. And RStudio also has some nice debugging features, like the ability to set breakpoints.

RStudio, the company, produces a number of other great tools, like shiny.

CRAN is huge, and there’s also GitHub

CRAN has over 8000 packages, with lots of great stuff like data.table, magrittr, RSQLite, XML, animation, and slidify.

And there are even more packages that live on GitHub (solely, or in addition to CRAN), and with the install_github() function in the devtools package, you can skip CRAN and install packages straight from GitHub. devtools also has an install_bitbucket() for installing from BitBucket.

I’d better mention Bioconductor; oodles of bioinformatics/genomics-related packages live there rather than CRAN.

And while I’m talking packages, I should mention ROpenSci, an effort to create packages to access all kinds of data repositories from R. Take a look at their list.

You can put underscores in names

It used to be that _ was a shortcut for <-. (That was always a bad idea. And it led me to use dots in function names, like calc.genoprob, which has been problematic due to the S3 class system.)

Then they started allowing = in place of <-.

And then they got rid of _ as a shortcut for <-. Good idea, and now we can have functions named like calc_genoprob.

Read about new features

Read about new features in R here.

Also look at what was new in older versions and even older versions and yet older versions

New apply-type functions

You probably know about apply, lapply, sapply, and tapply. But did you know about vapply and mapply? And how about replicate?

For truly modern functional programming in R, check out the purrr package (part of the tidyverse).

Parallel and Rcpp

Look at the parallel package, and perhaps read the Parallel R book.

Also look at Rcpp, a simpler way to call C/C++ functions from R. Read the Rcpp book.

Various

I searched through the [NEWS](https://cran.r-project.org/src/base/NEWS) files (mentioned above) and wrote down some of the functions that were new since 2002.

(Note that I have little experience with many of these, and some are not entirely recommended. For example, rickyars noted that inner_join and left_join in dplyr can be 10× faster than merge. Ben Bolker recommends the plyr::r*ply functions over replicate, as you get to define the return structure.)

Vectorize

which.min, which.max

stopifnot

strwrap

unsplit

rowSums, colSums, rowMeans, colMeans

rowsum

slice.index

runmed

addmargins

head, tail

arrayInd

droplevels

saveRDS, readRDS

paste0

anyNA

aggregate, by

merge

with

stack, reshape, relist


The source for this tutorial is on github.

I would be glad for suggestions, corrections, or additions.

Also see my tutorials on git/github, knitr, make, R packages, data organization, and initial steps towards reproducible research.