8 min read

sqlite, feather, and fst

I don’t think I’m unusual among statisticians in having avoided working directly with databases for much of my career. The data for my projects have been reasonably small. (In fact, basically all of the data for my 20 years of projects are on my laptop’s drive.) Flat files (such as CSV files) were sufficient.

But I’ve finally entered the modern era of biggish data. (Why do they call it big data? That doesn’t leave us much room for gradations of size. In the 90’s, statisticians talked about massive data.) And particularly for visualization of large-scale data, I don’t want to load everything in advance, and I want rapid access to slices of data.

So I’ve been playing with SQLite and MongoDB, and more recently feather and fst. And I thought I’d show a few examples. I’m interested mostly in quick access, from R, to small portions of a large file.

data, rds

Let me start by simulating some data. I’m mostly thinking about the case of 500 100k-SNP arrays. So the data are pairs of intensity measures for the two alleles at each of 100k SNPs in 500 samples. And typically I want to grab the 500 pairs of intensities for a given SNP. I’m going to just simulate IID noise, because for these illustrations I don’t really care about the contents so much as the storage size and I/O speed.

(Note: this stuff takes a long time to run, so the actual code behind the scenes is more complicated, with me having cached the timings and skipped the actual runs.)

n_ind <- 500
n_snps <- 1e5
ind_names <- paste0("ind", 1:n_ind)
snp_names <- paste0("snp", 1:n_snps)
sigX <- matrix(rnorm(n_ind*n_snps), nrow=n_ind)
sigY <- matrix(rnorm(n_ind*n_snps), nrow=n_ind)
dimnames(sigX) <- list(ind_names, paste0(snp_names, ".X"))
dimnames(sigY) <- list(ind_names, paste0(snp_names, ".Y"))
db <- cbind(data.frame(id=ind_names, stringsAsFactors=FALSE),
            sigX, sigY)

My typical approach with data like this is to save it in an RDS file and just read the whole thing into memory if I want to work with it. But it’s rather slow to write and read such a big data set.

rds_file <- "db.rds"
saveRDS(db, rds_file)
db_copy <- readRDS(rds_file)

It was like 49 sec to write the RDS file, and 7.0 sec to read it. The file itself is 769 MB.

sqlite

Mongo is cool, and I think ultimately it will be useful to me, but SQLite has the advantage of being a single file that you can hand to others. And installation is easy: you just need install.packages("RSQLite").

SQLite won’t take more than 2000 columns (or maybe 32,767 if you change a compile-time parameter), so we need to take the transpose of our data frame.

t_db <- cbind(data.frame(snp=rep(snp_names, 2),
                         signal=rep(c("X", "Y"), each=n_snps),
                         stringsAsFactors=FALSE),
              rbind(t(sigX), t(sigY)))

Let’s rearrange it so that the two rows for a given SNP are next to each other.

db_rows <- as.numeric(matrix(1:nrow(t_db), byrow=TRUE, nrow=2))
t_db <- t_db[db_rows,]

To write to a SQLite file, we use dbConnect to create a database connection, and then dbWriteTable. We can use dbDisconnect to disconnect afterwards, if we’re done.

library(RSQLite)
sqlite_file <- "t_db.sqlite"
sqldb <- dbConnect(SQLite(), dbname=sqlite_file)
dbWriteTable(sqldb, "snps", t_db, row.names=FALSE, overwrite=TRUE,
             append=FALSE, field.types=NULL)
dbDisconnect(sqldb)

The writing took 35 sec, and the resulting file is 923 MB.

A key advantage of SQLite is to be able to quickly access a portion of the data, for example to grab the two rows for a particular SNP. You’d need to know the SNP names, first, which you can get by grabbing that column (or field) with dbGetQuery. A data frame is returned, so we select the first column.

sqldb <- dbConnect(SQLite(), dbname=sqlite_file)
snp_names <- dbGetQuery(sqldb, 'select snp from snps')[,1]

We can call dbGetQuery again to get the two rows of data for a given SNP.

random_snp <- sample(snp_names, 1)
query <- paste0('select * from snps where snp == "', random_snp, '"')
system.time(z <- dbGetQuery(sqldb, query))
##    user  system elapsed 
##   0.353   0.485   0.838

Such queries are faster if we first add an index on the SNP names.

dbGetQuery(sqldb, "CREATE INDEX snp ON snps(snp)")

The file is basically the same size, 926 MB, and queries are now all but instantaneous.

random_snp <- sample(snp_names, 1)
query <- paste0('select * from snps where snp == "', random_snp, '"')
system.time(z <- dbGetQuery(sqldb, query))
##    user  system elapsed 
##   0.004   0.000   0.004

feather

I’d understood feather to be a quick way of transferring data between python and R; Petr Simacek convinced me of its more-broad uses, such as to take the place of a single-table database.

Writing a feather file is surprisingly fast, and reading it back in is even faster.

library(feather)
feather_file <- "t_db.feather"
write_feather(t_db, feather_file)
t_db_clone <- read_feather(feather_file)

That took about 3 sec to write, and 1.0 sec to read, and the file is about 803 MB.

But queries of particular columns or rows are fast, too. So you can basically use feather like a database.

db_f <- feather(feather_file)
snp_names <- unlist(db_f[,"snp"])
random_snp <- sample(snp_names, 1)
system.time(z <- db_f[snp_names==random_snp,])
##    user  system elapsed 
##   0.381   0.353   0.735

In this particular case, it’s actually quite a bit faster to work with feather the other way around; that is, in the original format of 500 arrays x 100k SNPs.

feather_file_2 <- "db.feather"
write_feather(db, feather_file_2)
db_clone <- read_feather(feather_file_2)

That took about 3 sec to write, and 1.8 sec to read, and the file is about 815 MB.

Accessing particular rows is just as easy. First a bit of code to grab the SNP names by grabbing the column names, getting rid of the ".X" or ".Y" bits at the end, and then taking the first half.

db_f_2 <- feather(feather_file_2)
snp_names <- sub("\\.[XY]$", "", colnames(db_f_2))
snp_names <- snp_names[1:(length(snp_names)/2)]

Now, we grab the data for a random SNP by pasting ".X" and ".Y" back onto the SNP name.

random_snp <- sample(snp_names, 1)
system.time(z <- db_f_2[,c("id", paste0(random_snp, c(".X", ".Y")))])
##    user  system elapsed 
##   0.045   0.000   0.045

Note that you can also use dplyr with feather as if you’ve got an in-memory data frame.

fst

After tweeting about feather, Dirk Eddelbuettel suggested that I look at the fst package. It’s not quite as slick to take data slices, but it’s potentially faster and you can write a compressed file to save disk space.

Like SQLite, it’s best not to have too many columns, so we’ll work with the transposed version of the data frame, with SNPs as rows. Writing and reading are fast.

library(fst)
fst_file <- "db.fst"
write.fst(t_db, fst_file)
t_db_clone <- read.fst(fst_file)

That took 3.0 sec to write, 1.1 sec to read, and the file is about 803 MB.

Writing a compressed file is quite a bit slower. Here at 80% compression.

fstcomp_file <- "db_comp.fst"
write.fst(t_db, fstcomp_file, 80)
t_db_clone <- read.fst(fstcomp_file)

That took 6.7 sec to write, 2.7 sec to read, and the file is about 781 MB.

Doing queries on an fst file is not quite as slick as for feather, but it’s fast. The read.fst function has a columns argument to grab particular columns, and from and to arguments to grab a slice of rows.

We’ll first grab the snp column to get the SNP names. And let’s just work with the compressed version of the file. Since read.fst will return a one-column data frame, we grab the first column to make it a vector.

snp_names <- read.fst(fstcomp_file, "snp")[,1]

Now we can choose a random SNP, find the corresponding rows, and then use from and to to grab those two rows. You can see it’s useful to have the the pairs of rows for each SNP be contiguous.

random_snp <- sample(snp_names, 1)
wh_rows <- which(random_snp == snp_names)
system.time(z <- read.fst(fstcomp_file, from=wh_rows[1], to=wh_rows[2]))
##    user  system elapsed 
##   0.078   0.011   0.089

timings

Let’s use the microbenchmark package to compare timings for grabbing a random SNP. First a bit of set-up.

random_snp <- sample(snp_names, 1)
library(microbenchmark)
sqlite_file <- "t_db.sqlite"
sqldb <- dbConnect(SQLite(), dbname=sqlite_file)
query <- paste0('select * from snps where snp == "', random_snp, '"')
db_f <- feather(feather_file)
db_f_2 <- feather(feather_file_2)

Now the timings.

microbenchmark(sqlite=dbGetQuery(sqldb, query),
               feather=db_f[snp_names==random_snp,],
               feather_t=db_f_2[,c("id", paste0(random_snp, c(".X",".Y")))],
               fst={wh_rows <- which(random_snp == snp_names)
                    read.fst(fst_file, from=wh_rows[1], to=wh_rows[2])},
               fstcomp={wh_rows <- which(random_snp == snp_names)
                    read.fst(fstcomp_file, from=wh_rows[1], to=wh_rows[2])},
               times=100)
## Unit: milliseconds
##       expr min  lq mean median  uq  max neval   cld
##     sqlite   3   4    4      4   4    4   100 a    
##    feather 618 641  717    660 766 1215   100     e
##  feather_t  43  47   48     48  49   54   100  b   
##        fst  61  63   80     64  65  341   100   c  
##    fstcomp  91  95  117     95  97  691   100    d

Here’s a summary of all of the results:

knitr::kable(results, digits=c(1,1,0,1),
             col.names=c("write time (s)", "read time (s)",
                         "file size (MB)", "access time (ms)"))
write time (s) read time (s) file size (MB) access time (ms)
rds 49.3 7.0 769 NA
sqlite 34.7 NA 923 3.7
feather 3.0 1.0 803 716.5
feather (tr) 3.3 1.8 815 48.3
fst 3.0 1.1 803 79.9
fst (compr) 6.7 2.7 781 116.7

I don’t think we can draw general conclusions about the relative speed and file size of the three approaches from these results. And I think they’re all really useful and interesting.

But in this particular case, the file compression didn’t really help with fst and slowed things down. When accessing the data, feather was considerably faster than fst when the data were organized with the SNPs as columns, but was considerably slower when the data were in the opposite orientation. SQLite has much faster access times, but with a larger file size that takes longer to write.