reading/writing biggish data, revisited

Matt Dowle encouraged me to follow up on my post about sqlite, feather, and fst. One thing to emphasize is that saveRDS, by default, uses compression. If you use compress=FALSE you can skip that and it goes much faster. See, for example, his post on “Fast csv writing for R”. Also see his slides from a recent presentation on parallel fread.

I’ll first generate the same data that I was using before. And note, as @shabbychef mentioned on twitter, my iid simulations mean that compression isn’t likely to be useful, as we saw in my previous post. So don’t assume that these results apply generally; compression is useful much of the time.

n_ind <- 500
n_snps <- 1e5
ind_names <- paste0("ind", 1:n_ind)
snp_names <- paste0("snp", 1:n_snps)
sigX <- matrix(rnorm(n_ind*n_snps), nrow=n_ind)
sigY <- matrix(rnorm(n_ind*n_snps), nrow=n_ind)
dimnames(sigX) <- list(ind_names, paste0(snp_names, ".X"))
dimnames(sigY) <- list(ind_names, paste0(snp_names, ".Y"))
db <- cbind(data.frame(id=ind_names, stringsAsFactors=FALSE),
            sigX, sigY)

Now, let’s look at the time to write an RDS file, when compressed and when not. I’m again going to cache my results and just tell you what happened.

rds_file <- "db.rds"
saveRDS(db, rds_file, compress=FALSE)
rds_comp_file <- "db_comp.rds"
saveRDS(db, rds_comp_file)
db_copy1 <- readRDS(rds_file)
db_copy2 <- readRDS(rds_comp_file)

Writing the data to an RDS file took 5.5 sec when uncompressed and 51.4 sec when compressed. Reading them back in took 2.4 sec for the uncompressed file and 11.0 sec for the compressed file. The uncompressed RDS file was 805 MB, while the compressed one was 769 MB.

So, holy crap reading and writing the RDS files is fast when you use compress=FALSE. Don’t tell your system administrator I said this, but if you’re working on a server with loads of disk space, for sure go with compress=FALSE with your RDS files. On your laptop where uncompressed RDS files might get in the way of your music and movie libraries, you might want to use the compression.

How about CSV?

Dirk Eddelbuettel suggested that I might just use a plain CSV file, since data.table::fread and data.table::fwrite are so fast. How fast?

To make use of the multi-threaded version of data.table’s fread, I need version 1.10.5 which is on GitHub. The version on CRAN (1.10.4) has multi-threaded fwrite but only single-threaded fread.

But the GitHub version needs to be compiled with OpenMP, and after a lot of screwing around to do that, I ended up getting segfaults from fwrite, so I just dumped this plan.

So we’ll look at multi-threaded fwrite but only single-threaded fread. But we can all look forward to the multi-threaded fread in the near future.

For fwrite, the number of threads is controlled by the argument nThread. The default is to call data.table::getDTthreads() which detects the maximum number of cores. On my Mac desktop at work, that’s 24. I’m going to hard-code it in.

csv_file <- "db.csv"
library(data.table)
fwrite(db, csv_file, quote=FALSE)
db_copy3 <- data.table::fread(csv_file)

That took 41.6 sec to write and 55.0 sec to read, and the file size is 1818 MB.

How about if I set nThread=1 with fwrite?

fwrite(db, csv_file, quote=FALSE, nThread=1)

Single-threaded, fwrite took 69.1 sec.

But the data set is 500 rows by 200k columns. How about if I used the transpose?

t_db <- cbind(data.frame(snp=rep(snp_names, 2),
                         signal=rep(c("X", "Y"), each=n_snps),
                         stringsAsFactors=FALSE),
              rbind(t(sigX), t(sigY)))

Now to write and read this.

csv_t_file <- "db_t.csv"
fwrite(t_db, csv_t_file, quote=FALSE, nThread=24)
t_db_copy <- fread(csv_t_file)

That took 8.3 sec to write and 26.6 sec to read, and the file size is 1818 MB.

And how about if I do fwrite single-threaded?

fwrite(t_db, csv_t_file, quote=FALSE, nThread=1)

Single-threaded, the transposed data took 30.2 sec to write.

(I’m not even going to try read.csv and write.csv. I’ll leave that to the reader.)

Here’s a summary of the times:

function	method	data size	time (s)
saveRDS	not compressed	500 × 200k	5.5
saveRDS	compressed	500 × 200k	51.4
fwrite	24 threads	500 × 200k	41.6
fwrite	1 thread	500 × 200k	69.1
fwrite	24 threads	200k × 500	8.3
fwrite	1 thread	200k × 500	30.2
readRDS	not compressed	500 × 200k	2.4
readRDS	compressed	200k × 500	11.0
fread	1 thread	500 × 200k	55.0
fread	1 thread	200k × 500	26.6

For sure, fread and fwrite are impressive. And I’d never have thought you could get advantage from parallel reads and writes.

I’m going to stick with RDS (making use of compress=FALSE when I don’t care much about disk space) when I want to read/write whole files from R. And I’ll go with SQLite, feather, or fst when I want super fast access to a single row or column. But I also do a lot of reading and writing of CSV files, and I’ve enjoyed data.table::fread and will now be using data.table::fwrite, too.