I don’t think I’m unusual among statisticians in having avoided working directly with databases for much of my career. The data for my projects have been reasonably small. (In fact, basically all of the data for my 20 years of projects are on my laptop’s drive.) Flat files (such as CSV files) were sufficient.
But I’ve finally entered the modern era of biggish data. (Why do they call it big data? That doesn’t leave us much room for gradations of size. In the 90’s, statisticians talked about massive data.) And particularly for visualization of large-scale data, I don’t want to load everything in advance, and I want rapid access to slices of data.
So I’ve been playing with SQLite and MongoDB, and more recently feather and fst. And I thought I’d show a few examples. I’m interested mostly in quick access, from R, to small portions of a large file.
data, rds
Let me start by simulating some data. I’m mostly thinking about the case of 500 100k-SNP arrays. So the data are pairs of intensity measures for the two alleles at each of 100k SNPs in 500 samples. And typically I want to grab the 500 pairs of intensities for a given SNP. I’m going to just simulate IID noise, because for these illustrations I don’t really care about the contents so much as the storage size and I/O speed.
(Note: this stuff takes a long time to run, so the actual code behind the scenes is more complicated, with me having cached the timings and skipped the actual runs.)
n_ind <- 500
n_snps <- 1e5
ind_names <- paste0("ind", 1:n_ind)
snp_names <- paste0("snp", 1:n_snps)
sigX <- matrix(rnorm(n_ind*n_snps), nrow=n_ind)
sigY <- matrix(rnorm(n_ind*n_snps), nrow=n_ind)
dimnames(sigX) <- list(ind_names, paste0(snp_names, ".X"))
dimnames(sigY) <- list(ind_names, paste0(snp_names, ".Y"))
db <- cbind(data.frame(id=ind_names, stringsAsFactors=FALSE),
sigX, sigY)
My typical approach with data like this is to save it in an RDS file and just read the whole thing into memory if I want to work with it. But it’s rather slow to write and read such a big data set.
rds_file <- "db.rds"
saveRDS(db, rds_file)
db_copy <- readRDS(rds_file)
It was like 49 sec to write the RDS file, and 7.0 sec to read it. The file itself is 769 MB.
sqlite
Mongo is cool, and I think ultimately it
will be useful to me, but SQLite has the
advantage of being a single file that you can hand to others. And
installation is easy: you just need
install.packages("RSQLite")
.
SQLite won’t take more than 2000 columns (or maybe 32,767 if you change a compile-time parameter), so we need to take the transpose of our data frame.
t_db <- cbind(data.frame(snp=rep(snp_names, 2),
signal=rep(c("X", "Y"), each=n_snps),
stringsAsFactors=FALSE),
rbind(t(sigX), t(sigY)))
Let’s rearrange it so that the two rows for a given SNP are next to each other.
db_rows <- as.numeric(matrix(1:nrow(t_db), byrow=TRUE, nrow=2))
t_db <- t_db[db_rows,]
To write to a SQLite file, we use dbConnect
to create a database
connection, and then dbWriteTable
. We can use dbDisconnect
to
disconnect afterwards, if we’re done.
library(RSQLite)
sqlite_file <- "t_db.sqlite"
sqldb <- dbConnect(SQLite(), dbname=sqlite_file)
dbWriteTable(sqldb, "snps", t_db, row.names=FALSE, overwrite=TRUE,
append=FALSE, field.types=NULL)
dbDisconnect(sqldb)
The writing took 35 sec, and the resulting file is 923 MB.
A key advantage of SQLite is to be able to quickly access a portion of
the data, for example to grab the two rows for a particular SNP. You’d
need to know the SNP names, first, which you can get by grabbing that
column (or field) with dbGetQuery
. A data frame is returned, so we
select the first column.
sqldb <- dbConnect(SQLite(), dbname=sqlite_file)
snp_names <- dbGetQuery(sqldb, 'select snp from snps')[,1]
We can call dbGetQuery
again to get the two rows of data for a given SNP.
random_snp <- sample(snp_names, 1)
query <- paste0('select * from snps where snp == "', random_snp, '"')
system.time(z <- dbGetQuery(sqldb, query))
## user system elapsed
## 0.353 0.485 0.838
Such queries are faster if we first add an index on the SNP names.
dbGetQuery(sqldb, "CREATE INDEX snp ON snps(snp)")
The file is basically the same size, 926 MB, and queries are now all but instantaneous.
random_snp <- sample(snp_names, 1)
query <- paste0('select * from snps where snp == "', random_snp, '"')
system.time(z <- dbGetQuery(sqldb, query))
## user system elapsed
## 0.004 0.000 0.004
feather
I’d understood feather to be a quick way of transferring data between python and R; Petr Simacek convinced me of its more-broad uses, such as to take the place of a single-table database.
Writing a feather file is surprisingly fast, and reading it back in is even faster.
library(feather)
feather_file <- "t_db.feather"
write_feather(t_db, feather_file)
t_db_clone <- read_feather(feather_file)
That took about 3 sec to write, and 1.0 sec to read, and the file is about 803 MB.
But queries of particular columns or rows are fast, too. So you can basically use feather like a database.
db_f <- feather(feather_file)
snp_names <- unlist(db_f[,"snp"])
random_snp <- sample(snp_names, 1)
system.time(z <- db_f[snp_names==random_snp,])
## user system elapsed
## 0.381 0.353 0.735
In this particular case, it’s actually quite a bit faster to work with feather the other way around; that is, in the original format of 500 arrays x 100k SNPs.
feather_file_2 <- "db.feather"
write_feather(db, feather_file_2)
db_clone <- read_feather(feather_file_2)
That took about 3 sec to write, and 1.8 sec to read, and the file is about 815 MB.
Accessing particular rows is just as easy. First a bit of code to grab
the SNP names by grabbing the column names,
getting rid of the ".X"
or ".Y"
bits at the end, and then taking
the first half.
db_f_2 <- feather(feather_file_2)
snp_names <- sub("\\.[XY]$", "", colnames(db_f_2))
snp_names <- snp_names[1:(length(snp_names)/2)]
Now, we grab the data for a random SNP by pasting ".X"
and ".Y"
back onto the SNP name.
random_snp <- sample(snp_names, 1)
system.time(z <- db_f_2[,c("id", paste0(random_snp, c(".X", ".Y")))])
## user system elapsed
## 0.045 0.000 0.045
Note that you can also use dplyr with feather as if you’ve got an in-memory data frame.
fst
After tweeting about feather, Dirk Eddelbuettel suggested that I look at the fst package. It’s not quite as slick to take data slices, but it’s potentially faster and you can write a compressed file to save disk space.
Like SQLite, it’s best not to have too many columns, so we’ll work with the transposed version of the data frame, with SNPs as rows. Writing and reading are fast.
library(fst)
fst_file <- "db.fst"
write.fst(t_db, fst_file)
t_db_clone <- read.fst(fst_file)
That took 3.0 sec to write, 1.1 sec to read, and the file is about 803 MB.
Writing a compressed file is quite a bit slower. Here at 80% compression.
fstcomp_file <- "db_comp.fst"
write.fst(t_db, fstcomp_file, 80)
t_db_clone <- read.fst(fstcomp_file)
That took 6.7 sec to write, 2.7 sec to read, and the file is about 781 MB.
Doing queries on an fst file is not quite
as slick as for feather, but it’s
fast. The read.fst
function has a columns
argument to grab
particular columns, and from
and to
arguments to grab a slice of
rows.
We’ll first grab the snp
column to get the SNP names. And let’s just
work with the compressed version of the file. Since read.fst
will
return a one-column data frame, we grab the first column to make it a
vector.
snp_names <- read.fst(fstcomp_file, "snp")[,1]
Now we can choose a random SNP, find the corresponding rows, and
then use from
and to
to grab those two rows. You can see it’s
useful to have the the pairs of rows for each SNP be contiguous.
random_snp <- sample(snp_names, 1)
wh_rows <- which(random_snp == snp_names)
system.time(z <- read.fst(fstcomp_file, from=wh_rows[1], to=wh_rows[2]))
## user system elapsed
## 0.078 0.011 0.089
timings
Let’s use the microbenchmark package to compare timings for grabbing a random SNP. First a bit of set-up.
random_snp <- sample(snp_names, 1)
library(microbenchmark)
sqlite_file <- "t_db.sqlite"
sqldb <- dbConnect(SQLite(), dbname=sqlite_file)
query <- paste0('select * from snps where snp == "', random_snp, '"')
db_f <- feather(feather_file)
db_f_2 <- feather(feather_file_2)
Now the timings.
microbenchmark(sqlite=dbGetQuery(sqldb, query),
feather=db_f[snp_names==random_snp,],
feather_t=db_f_2[,c("id", paste0(random_snp, c(".X",".Y")))],
fst={wh_rows <- which(random_snp == snp_names)
read.fst(fst_file, from=wh_rows[1], to=wh_rows[2])},
fstcomp={wh_rows <- which(random_snp == snp_names)
read.fst(fstcomp_file, from=wh_rows[1], to=wh_rows[2])},
times=100)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## sqlite 3 4 4 4 4 4 100 a
## feather 618 641 717 660 766 1215 100 e
## feather_t 43 47 48 48 49 54 100 b
## fst 61 63 80 64 65 341 100 c
## fstcomp 91 95 117 95 97 691 100 d
Here’s a summary of all of the results:
knitr::kable(results, digits=c(1,1,0,1),
col.names=c("write time (s)", "read time (s)",
"file size (MB)", "access time (ms)"))
write time (s) | read time (s) | file size (MB) | access time (ms) | |
---|---|---|---|---|
rds | 49.3 | 7.0 | 769 | NA |
sqlite | 34.7 | NA | 923 | 3.7 |
feather | 3.0 | 1.0 | 803 | 716.5 |
feather (tr) | 3.3 | 1.8 | 815 | 48.3 |
fst | 3.0 | 1.1 | 803 | 79.9 |
fst (compr) | 6.7 | 2.7 | 781 | 116.7 |
I don’t think we can draw general conclusions about the relative speed and file size of the three approaches from these results. And I think they’re all really useful and interesting.
But in this particular case, the file compression didn’t really help with fst and slowed things down. When accessing the data, feather was considerably faster than fst when the data were organized with the SNPs as columns, but was considerably slower when the data were in the opposite orientation. SQLite has much faster access times, but with a larger file size that takes longer to write.