Introduction

In the document describing my analysis of the fruit snacks data, I focused on high-level results and suppressed discussion of the details of my analyses. In the present document, I’ll describe some of the tricks I used.

With the RCurl package, we can load the data directly from the web. (We need RCurl, because GitHub uses https rather than just http.)

library(RCurl)
fsdat <- getURL("https://raw.githubusercontent.com/kbroman/FruitSnacks/master/Data/fruit_snacks.csv")
fs <- read.csv(text=fsdat)

I’ll drop the first column, which just contains identifiers (1–81) for the fruit snack packages.

fs <- fs[,-1]

Paired permutation tests

To evaluate differences in the frequencies of different colors of snacks, I consider a pair of colors and then used a paired permutation test (with the t-statistic). This was accomplished with the paired.perm.test() function in the R/broman package. That function can do an exhaustive permutation test (for small samples), or a simulation-based permutation test, as used here.

Here’s a simplified version of the function, just for the simulation-based permutation test:

paired.perm.test <-
function(d, n.perm=10000)
{
    n <- length(d)
    tobs <- t.test(d)$statistic

    allt <- 1:n.perm
    for(i in 1:n.perm) {
        permd <- d*sample(c(-1,1), n, replace=TRUE)
        allt[i] <- t.test(permd)$statistic
    }

    mean(abs(allt) >= abs(tobs))
}

The input is a set of differences, d. I calculate the t-statistic with the R function t.test(), and the central permutation test part involves a for loop. I use sample() to apply a random sign (positive or negative) to each difference, and then t.test() again to calculate the t-statistic. The returned p-value is the proportion of t-statistics from the permutations that are ≥ the observed one, in absolute value.

Calculation of SDs for each color

The data are contained in a matrix fs, with 81 rows (the observed packages) and 5 columns (the different colors). To calculate the observed SD for the number of snacks of each color, I use apply, as follows:

obs_sd <- apply(fs, 2, sd)

To calculate the SD of the proportions, I first calculate the number of snacks per package, using rowSums. I then convert the counts to proportions by dividing by those sums. (This makes use the fact that the data are stored by column, and the column of row sums will be repeated. I then use apply again.

n_per_package <- rowSums(fs)
props <- fs / n_per_package
obs_sd_prop <- apply(props, 2, sd)

Expected SD under binomial mixture model

In the analysis of clustering, the null model is that the colors were randomly assigned to packages (but at color-specific frequencies). Thus the number of snacks of a particular color, given the total number of snacks in a package, follows a binomial distribution. The distribution of the counts of a particular color across packages then follows a mixture of binomial distributions.

I wrote a function to calculate the SD for such a binomial mixture. This is maybe a bit messy.

calc_sd_binommixture <-
    function(n=rowSums(fs), p=sum(fs[,1])/sum(fs))
{
    maxn <- max(n)
    tabn <- table(n)
    prop_n <- tabn/length(n)
    n <- as.numeric(names(tabn))
    probs <- t(vapply(n, function(number) dbinom(0:maxn, number, prob=p), rep(0, maxn+1)))

    probs <- colSums(probs * rep(prop_n, ncol(probs)))

    xmean <- sum(probs*(0:maxn))

    sqrt(sum(probs * ((0:maxn) - xmean)^2))
}

The input is a vector of numbers of snacks per package (n) and the frequency of a particular color (p). I determine the unique values in n, and then use dbinom() and vapply() to get the binomial probabilities. I then use colSums() to get the probabilities for the binomial mixture. Finally, I calculate the mean and then the SD of that mixture distribution.

To actually calculate the set of expected SDs, I use apply().

exp_sd <- apply(fs, 2, function(a, b) calc_sd_binommixture(rowSums(b), sum(a)/sum(b)), fs)

There’s a similar function for calculation the SD of the proportions, for this sort of binomial mixture. The only difference is that I need to use values that are proportions rather than numbers. I probably should have merged these two functions into one, to not have all of the repeated code.

calc_sdprop_binommixture <-
    function(n=rowSums(fs), p=sum(fs[,1])/sum(fs))
{
    maxn <- max(n)
    tabn <- table(n)
    prop_n <- tabn/length(n)
    n <- as.numeric(names(tabn))
    probs <- t(vapply(n, function(number) dbinom(0:maxn, number, prob=p), rep(0, maxn+1)))

    probs <- probs * rep(prop_n, ncol(probs))
    vals <- t(vapply(n, function(number) (0:maxn)/number, rep(0, maxn+1)))

    xmean <- sum(probs*vals)

    sqrt(sum(probs * (vals - xmean)^2))
}

Permuting the main data set

To evaluate clustering, I could have compared the observed SD for a color to the distribution obtained by simulating data under the binomial mixture. I choose instead to do a permutation test: take the 1029 snacks and randomly assign them to packages, keeping the number of snacks of each color constrained. I don’t think it matters so much; I think the permutation test is a bit more cute.

To perform the permutation test, I first create a “tidy table”, with each row being a particular snack and with two columns: a numeric index (1–81) for the package it was in, and a numeric index (1–5) for its color.

I create this table using a bit of apply(), rep(), and a for loop.

tidytab <- data.frame(package=rep(0, sum(fs)), color=rep(0, sum(fs)))
tidytab$package <- unlist(apply(fs, 2, function(a) rep(1:nrow(fs), a)))
cur <- 0
for(i in 1:ncol(fs)) {
    tidytab$color[cur + 1:sum(fs[,i])] <- rep(i, sum(fs[,i]))
    cur <- cur + sum(fs[,i])
}

The first part of the table looks like this:

head(tidytab)

To get this back to the table of packages × colors, use table(). It’s good to check that you get the same result.

tab <- table(tidytab$package, tidytab$color)
all(tab == fs)

In the permutation test, I just need to shuffle one column relative to the other (using sample) and then use table() to get the counts.

permtab <- table(tidytab$package, sample(tidytab$color))

Permutation test with SD as statistic

With the code from the previous section, the permutation test is then pretty simple: I use a for loop to repeated permute the data, in each case using apply to calculate the column SDs.

n.perm <- 10000
permsd <- matrix(ncol=ncol(fs), nrow=n.perm)
for(i in 1:n.perm) {
    permdat <- table(tidytab$package, sample(tidytab$color))
    permsd[i,] <- apply(permdat, 2, sd)
}

To calculate p-values, I look at the proportion of permutation replicates that gave an SD that was farther away from the expected SD, in absolute value, than what was actually observed.

pval <- rep(0, ncol(fs))
names(pval) <- colnames(fs)
for(i in seq(along=pval))
    pval[i] <- mean(abs(permsd[,i] - exp_sd[i]) >= abs(obs_sd[i] - exp_sd[i]))

The permutation test using the SD of the proportions rather than the SD of the counts is basically the same, I just need to convert to proportions before calculating the SDs.

n.perm <- 10000
permsdprop <- matrix(ncol=ncol(fs), nrow=n.perm)
for(i in 1:n.perm) {
    permdat <- table(tidytab$package, sample(tidytab$color))
    permsdprop[i,] <- apply(permdat/rowSums(permdat), 2, sd)
}

Source on GitHub

CC0