Bootstrap example

This is a short example to illustrate the bootstrap, in R code. We’ll use the example from lecture, of estimating the 95th percentile of a population.

I’ll first simulate some data from a scaled chi-squared distribution. I simulate data from a chi-square with 5 degrees of freedom and multiply it by 6. I’ll simulate a sample of size n=800.

n <- 800
df <- 5
mult <- 6
dat <- rchisq(n, df=df)*mult

Here’s a histogram of the data.

par(mar=c(3.1, 1.1, 1.1, 1.1))
hist(dat, breaks=2*sqrt(n), xlab="", ylab="", yaxt="n", main="")

The 95th percentile is 65.0. (Note that the true percentile for the underlying population is 66.4.)

Non-parametric bootstrap

Let’s start with the more generic non-parametric bootstrap. We’ll perform 1000 bootstrap replicates.

I’ll use sample() to sample with replacement from the data. I can do all 1000 bootstrap replicates at the same time, arrange them to form a matrix with 1000 columns, and then take the 95th percentile of each column.

n_boot <- 1000
boot_data <- matrix( sample(dat, n_boot*n, replace=TRUE), ncol=n_boot )
boot_results <- apply(boot_data, 2, quantile, 0.95)

Here’s a histogram of the bootstrap results. The tick marks underneath the histogram are the locations of my actual data points. Gaps between data points are what leads to the gaps in the histogram.

par(mar=c(4.1, 1.1, 1.1, 1.1))
hist(boot_results, breaks=2*sqrt(n_boot), main="", ylab="", yaxt="n",
     xlab="estimated 95th percentile")
rug(dat)

The SD of these results is 1.63, so that’s my estimate of the standard error of my estimate.

Parametric bootstrap

Suppose I happened to know that a scaled chi-square distribution would be a good model for the underlying population. How could I use the parametric bootstrap in that case?

Well I first need to estimate the two parameters to get a fitted model: the degrees of freedom and the multiplier. I could use the method of moments for this.

The variance of the scaled chi-squared is (2 m² df) and the mean is (m df). So I could estimate m by var/mean/2 = 5.43 and then estimate df by mean/m̂ = 5.70.

m_hat <- var(dat)/mean(dat)/2
df_hat <- mean(dat)/m_hat

I then simulate from the fitted model with those estimates as the parameters, and then take the quantile of each simulated data set.

pboot_data <- matrix( rchisq(n_boot*n, df_hat)*m_hat, ncol=n_boot )
pboot_results <- apply(pboot_data, 2, quantile, 0.95)

Here’s a histogram of the bootstrap results. The tick marks underneath the histogram are the locations of my actual data points, but note that we don’t see the odd gaps that we’d seen before.

par(mar=c(4.1, 1.1, 1.1, 1.1))
hist(pboot_results, breaks=2*sqrt(n_boot), main="", ylab="", yaxt="n",
     xlab="estimated 95th percentile")
rug(dat)

The SD of these results is 2.30, so that’s my estimate of the standard error of my estimate, by the parametric bootstrap.

Session details

Below are the versions of R and R packages I used.

The R Markdown source for this document is on GitHub.

devtools::session_info()

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.2 (2019-12-12)
##  os       Pop!_OS 19.10               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US:en                    
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2020-03-24                  
## 
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
##  broman      * 0.69-5  2019-04-11 [1] CRAN (R 3.6.2)
##  callr         3.4.2   2020-02-12 [1] CRAN (R 3.6.1)
##  cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.2)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
##  devtools      2.2.2   2020-02-17 [1] CRAN (R 3.6.1)
##  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.2)
##  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.1)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.1)
##  fs            1.3.2   2020-03-05 [1] CRAN (R 3.6.2)
##  glue          1.3.2   2020-03-12 [1] CRAN (R 3.6.2)
##  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.1)
##  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.1)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
##  pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.1)
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.1)
##  processx      3.4.2   2020-02-09 [1] CRAN (R 3.6.1)
##  ps            1.3.2   2020-02-13 [1] CRAN (R 3.6.1)
##  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)
##  Rcpp          1.0.4   2020-03-17 [1] CRAN (R 3.6.2)
##  remotes       2.1.1   2020-02-15 [1] CRAN (R 3.6.1)
##  rlang         0.4.5   2020-03-01 [1] CRAN (R 3.6.2)
##  rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.1)
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.1)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
##  testthat      2.3.2   2020-03-02 [1] CRAN (R 3.6.2)
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.0)
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
##  xfun          0.12    2020-01-13 [1] CRAN (R 3.6.1)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.1)
## 
## [1] /home/kbroman/Rlibs
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library