hipsteR re-educating people who learned R before it was cool
I was an early adopter of R, having first
learned S (yay!) and then S-plus (yuck!). But at times my knowledge of
R seems stuck in 2001. I keep finding out about “new” R
functions (like replicate
, which was new in 2003).
This is a tutorial for people like me, or people who were taught by people like me.
Switch to knitr
If you use Sweave, it’s time you switched to knitr. You’ll find that the transition is easy.
A number of Sweave annoyances have been eliminated, but most importantly you can use knitr with R Markdown or with AsciiDoc for writing simple reports. The markup is much simpler than LaTeX, and you don’t have to worry about page breaks.
Learn Hadley Wickham’s packages
Start with dplyr, tidyr, purrr, and ggplot2.
These are the main packages for what’s now called the “tidyverse”, which has grown beyond Hadley. Also check out
- lubridate for handing dates
- stringr for handling strings
- forcats for handling factors
- readr for reading csv/tsv files
- readxl for reading Excel files
- broom for tidying statistical analysis objects
For R package development, check out devtools, roxygen2, testthat, and assertthat.
Also, read his books: Advanced R, R packages, R for Data Science, and ggplot2 (2nd edition).
Adopt the pipe operator
When you adopt Hadley’s dplyr
and tidyr tools, you’ll want to
also adopt the pipe operator %>%
, from
magrittr.
You’re old school, so you’re used to writing stuff like this:
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
round(exp(diff(log(x))), 1)
Seems perfectly fine, but note how it’s read from the inside out. With the pipe operator, you can do the same series of steps, written in the order that they’re actually performed.
library(magrittr)
x %>% log %>%
diff %>%
exp %>%
round(1)
The pipe operator does some magic that makes the bit on the left be the first argument of the function call on the right.
If you need the bit on the left of the pipe to be somewhere other than the first argument, you can use a period. For example, here’s a wacky way to get the log (base 2) of 5.
2 %>% log(5, base=.)
Note: Jenny Bryan suggests that we use the parentheses on the functions even when they’re not formally required, like this:
library(magrittr)
x %>% log() %>%
diff() %>%
exp() %>%
round(1)
Consider RStudio
If you’re still using the R GUI (for Windows or Mac), you should switch to RStudio. Everything about it is better.
Personally, I stick with Emacs + ESS, because I’m writing code in multiple languages (not just R). (Another IDE option for R that many recommend: Eclipse with StatET.)
But I use RStudio for teaching: for demonstrations, and I have the students use it; it’s the best environment for learning R.
And note that RStudio makes it easy to use knitr with Markdown, and to develop R Packages. And RStudio also has some nice debugging features, like the ability to set breakpoints.
RStudio, the company, produces a number of other great tools, like shiny.
CRAN is huge, and there’s also GitHub
CRAN has over 8000 packages, with lots of great stuff like data.table, magrittr, RSQLite, XML, animation, and slidify.
And there are even more packages that live on GitHub (solely, or in
addition to CRAN), and with the install_github()
function in the
devtools package, you can skip
CRAN and install packages straight from GitHub. devtools also has an
install_bitbucket()
for installing from
BitBucket.
I’d better mention Bioconductor; oodles of bioinformatics/genomics-related packages live there rather than CRAN.
And while I’m talking packages, I should mention ROpenSci, an effort to create packages to access all kinds of data repositories from R. Take a look at their list.
You can put underscores in names
It used to be that _
was a shortcut for <-
. (That was always a bad
idea. And it led me to use dots in function names, like
calc.genoprob
,
which has been problematic due to the S3 class system.)
Then they started allowing =
in place of <-
.
And then they got rid of _
as a shortcut for <-
. Good idea, and
now we can have functions named like calc_genoprob
.
Read about new features
Read about new features in R here.
Also look at what was new in older versions and even older versions and yet older versions
New apply-type functions
You probably know about apply
, lapply
, sapply
, and tapply
. But
did you know about vapply
and mapply
? And how about replicate
?
For truly modern functional programming in R, check out the purrr package (part of the tidyverse).
Parallel and Rcpp
Look at the parallel package, and perhaps read the Parallel R book.
Also look at Rcpp, a simpler way to call C/C++ functions from R. Read the Rcpp book.
Various
I searched through the [NEWS](https://cran.r-project.org/src/base/NEWS)
files (mentioned above) and
wrote down some of the functions that were new since 2002.
(Note that I have little experience with many of these, and some are
not entirely recommended. For example,
rickyars
noted that inner_join
and
left_join
in dplyr can be 10×
faster than merge
. Ben Bolker recommends the plyr::r*ply
functions
over replicate
, as you get to define the return structure.)
rowSums
, colSums
, rowMeans
,
colMeans
The source for this tutorial is on github.
I would be glad for suggestions, corrections, or additions.
Also see my tutorials on git/github, knitr, make, R packages, data organization, and initial steps towards reproducible research.