Matt Dowle encouraged me to follow up on my post about sqlite, feather, and fst. One thing to emphasize is that
saveRDS, by default, uses compression. If you use
compress=FALSE you can skip that and it goes much faster. See, for example, his post on “Fast csv writing for R”. Also see his slides from a recent presentation on parallel fread.
I’ll first generate the same data that I was using before. And note, as @shabbychef mentioned on twitter, my iid simulations mean that compression isn’t likely to be useful, as we saw in my previous post. So don’t assume that these results apply generally; compression is useful much of the time.
n_ind <- 500 n_snps <- 1e5 ind_names <- paste0("ind", 1:n_ind) snp_names <- paste0("snp", 1:n_snps) sigX <- matrix(rnorm(n_ind*n_snps), nrow=n_ind) sigY <- matrix(rnorm(n_ind*n_snps), nrow=n_ind) dimnames(sigX) <- list(ind_names, paste0(snp_names, ".X")) dimnames(sigY) <- list(ind_names, paste0(snp_names, ".Y")) db <- cbind(data.frame(id=ind_names, stringsAsFactors=FALSE), sigX, sigY)
Now, let’s look at the time to write an RDS file, when compressed and when not. I’m again going to cache my results and just tell you what happened.
rds_file <- "db.rds" saveRDS(db, rds_file, compress=FALSE) rds_comp_file <- "db_comp.rds" saveRDS(db, rds_comp_file) db_copy1 <- readRDS(rds_file) db_copy2 <- readRDS(rds_comp_file)
Writing the data to an RDS file took 5.5 sec when uncompressed and 51.4 sec when compressed. Reading them back in took 2.4 sec for the uncompressed file and 11.0 sec for the compressed file. The uncompressed RDS file was 805 MB, while the compressed one was 769 MB.
So, holy crap reading and writing the RDS files is fast when you use
compress=FALSE. Don’t tell your system administrator I said this, but if you’re working on a server with loads of disk space, for sure go with
compress=FALSE with your RDS files. On your laptop where uncompressed RDS files might get in the way of your music and movie libraries, you might want to use the compression.
How about CSV?
Dirk Eddelbuettel suggested that I might just use a plain CSV file, since
data.table::fwrite are so fast. How fast?
But the GitHub version needs to be compiled with OpenMP, and after a lot of screwing around to do that, I ended up getting segfaults from
fwrite, so I just dumped this plan.
So we’ll look at multi-threaded
fwrite but only single-threaded
fread. But we can all look forward to the multi-threaded
fread in the near future.
fwrite, the number of threads is controlled by the argument
nThread. The default is to call
data.table::getDTthreads() which detects the maximum number of cores. On my Mac desktop at work, that’s 24. I’m going to hard-code it in.
csv_file <- "db.csv" library(data.table) fwrite(db, csv_file, quote=FALSE) db_copy3 <- data.table::fread(csv_file)
That took 41.6 sec to write and 55.0 sec to read, and the file size is 1818 MB.
How about if I set
fwrite(db, csv_file, quote=FALSE, nThread=1)
fwrite took 69.1 sec.
But the data set is 500 rows by 200k columns. How about if I used the transpose?
t_db <- cbind(data.frame(snp=rep(snp_names, 2), signal=rep(c("X", "Y"), each=n_snps), stringsAsFactors=FALSE), rbind(t(sigX), t(sigY)))
Now to write and read this.
csv_t_file <- "db_t.csv" fwrite(t_db, csv_t_file, quote=FALSE, nThread=24) t_db_copy <- fread(csv_t_file)
That took 8.3 sec to write and 26.6 sec to read, and the file size is 1818 MB.
And how about if I do
fwrite(t_db, csv_t_file, quote=FALSE, nThread=1)
Single-threaded, the transposed data took 30.2 sec to write.
(I’m not even going to try
write.csv. I’ll leave that to the reader.)
Here’s a summary of the times:
|function||method||data size||time (s)|
|saveRDS||not compressed||500 × 200k||5.5|
|saveRDS||compressed||500 × 200k||51.4|
|fwrite||24 threads||500 × 200k||41.6|
|fwrite||1 thread||500 × 200k||69.1|
|fwrite||24 threads||200k × 500||8.3|
|fwrite||1 thread||200k × 500||30.2|
|readRDS||not compressed||500 × 200k||2.4|
|readRDS||compressed||200k × 500||11.0|
|fread||1 thread||500 × 200k||55.0|
|fread||1 thread||200k × 500||26.6|
fwrite are impressive. And I’d never have thought you could get advantage from parallel reads and writes.
I’m going to stick with RDS (making use of
compress=FALSE when I don’t care much about disk space) when I want to read/write whole files from R. And I’ll go with SQLite, feather, or fst when I want super fast access to a single row or column. But I also do a lot of reading and writing of CSV files, and I’ve enjoyed
data.table::fread and will now be using