Matt Dowle encouraged me to
follow up on my post about sqlite, feather, and
fst. One thing to emphasize is
that saveRDS
, by default, uses compression. If you use
compress=FALSE
you can skip that and it goes much faster. See, for
example, his post on “Fast csv writing for
R”. Also see his
slides from a recent presentation on parallel
fread.
I’ll first generate the same data that I was using before. And note, as @shabbychef mentioned on twitter, my iid simulations mean that compression isn’t likely to be useful, as we saw in my previous post. So don’t assume that these results apply generally; compression is useful much of the time.
n_ind <- 500
n_snps <- 1e5
ind_names <- paste0("ind", 1:n_ind)
snp_names <- paste0("snp", 1:n_snps)
sigX <- matrix(rnorm(n_ind*n_snps), nrow=n_ind)
sigY <- matrix(rnorm(n_ind*n_snps), nrow=n_ind)
dimnames(sigX) <- list(ind_names, paste0(snp_names, ".X"))
dimnames(sigY) <- list(ind_names, paste0(snp_names, ".Y"))
db <- cbind(data.frame(id=ind_names, stringsAsFactors=FALSE),
sigX, sigY)
Now, let’s look at the time to write an RDS file, when compressed and when not. I’m again going to cache my results and just tell you what happened.
rds_file <- "db.rds"
saveRDS(db, rds_file, compress=FALSE)
rds_comp_file <- "db_comp.rds"
saveRDS(db, rds_comp_file)
db_copy1 <- readRDS(rds_file)
db_copy2 <- readRDS(rds_comp_file)
Writing the data to an RDS file took 5.5 sec when uncompressed and 51.4 sec when compressed. Reading them back in took 2.4 sec for the uncompressed file and 11.0 sec for the compressed file. The uncompressed RDS file was 805 MB, while the compressed one was 769 MB.
So, holy crap reading and writing the RDS files is fast when you use
compress=FALSE
. Don’t tell your system administrator I said this,
but if you’re working on a server with loads of disk space, for sure
go with compress=FALSE
with your RDS files. On your laptop where
uncompressed RDS files might get in the way of your music and movie
libraries, you might want to use the compression.
How about CSV?
Dirk Eddelbuettel suggested that I
might just use a plain CSV file, since data.table::fread
and
data.table::fwrite
are so fast. How fast?
To make use of the multi-threaded version of
data.table’s fread
,
I need version 1.10.5 which is on
GitHub. The version on
CRAN
(1.10.4) has
multi-threaded fwrite
but only single-threaded fread
.
But the GitHub version needs to be compiled with OpenMP, and after a
lot of screwing around to do that, I ended up getting segfaults from
fwrite
, so I just dumped this plan.
So we’ll look at multi-threaded fwrite
but only single-threaded
fread
. But we can all look forward to the multi-threaded fread
in
the near future.
For fwrite
, the number of threads is controlled by the argument
nThread
. The default is to call data.table::getDTthreads()
which
detects the maximum number of cores. On my Mac desktop at work, that’s
24. I’m going to hard-code it in.
csv_file <- "db.csv"
library(data.table)
fwrite(db, csv_file, quote=FALSE)
db_copy3 <- data.table::fread(csv_file)
That took 41.6 sec to write and 55.0 sec to read, and the file size is 1818 MB.
How about if I set nThread=1
with fwrite
?
fwrite(db, csv_file, quote=FALSE, nThread=1)
Single-threaded, fwrite
took 69.1 sec.
But the data set is 500 rows by 200k columns. How about if I used the transpose?
t_db <- cbind(data.frame(snp=rep(snp_names, 2),
signal=rep(c("X", "Y"), each=n_snps),
stringsAsFactors=FALSE),
rbind(t(sigX), t(sigY)))
Now to write and read this.
csv_t_file <- "db_t.csv"
fwrite(t_db, csv_t_file, quote=FALSE, nThread=24)
t_db_copy <- fread(csv_t_file)
That took 8.3 sec to write and 26.6 sec to read, and the file size is 1818 MB.
And how about if I do fwrite
single-threaded?
fwrite(t_db, csv_t_file, quote=FALSE, nThread=1)
Single-threaded, the transposed data took 30.2 sec to write.
(I’m not even going to try read.csv
and write.csv
.
I’ll leave that to the reader.)
Here’s a summary of the times:
function | method | data size | time (s) |
---|---|---|---|
saveRDS | not compressed | 500 × 200k | 5.5 |
saveRDS | compressed | 500 × 200k | 51.4 |
fwrite | 24 threads | 500 × 200k | 41.6 |
fwrite | 1 thread | 500 × 200k | 69.1 |
fwrite | 24 threads | 200k × 500 | 8.3 |
fwrite | 1 thread | 200k × 500 | 30.2 |
readRDS | not compressed | 500 × 200k | 2.4 |
readRDS | compressed | 200k × 500 | 11.0 |
fread | 1 thread | 500 × 200k | 55.0 |
fread | 1 thread | 200k × 500 | 26.6 |
For sure, fread
and fwrite
are impressive. And I’d never have
thought you could get advantage from parallel reads and writes.
I’m going to stick with RDS (making use of compress=FALSE
when
I don’t care much about disk space) when I want to read/write whole
files from R. And I’ll go with SQLite, feather, or fst when I want
super fast access to a single row or column. But I also do a lot of reading
and writing of CSV files, and I’ve enjoyed data.table::fread
and will now be using data.table::fwrite
, too.