In the document describing my analysis of the fruit snacks data, I focused on high-level results and suppressed discussion of the details of my analyses. In the present document, I’ll describe some of the tricks I used.

With the RCurl package, we can load the data directly from the web. (We need RCurl, because GitHub uses https rather than just http.)

```
library(RCurl)
fsdat <- getURL("https://raw.githubusercontent.com/kbroman/FruitSnacks/master/Data/fruit_snacks.csv")
fs <- read.csv(text=fsdat)
```

I’ll drop the first column, which just contains identifiers (1–81) for the fruit snack packages.

`fs <- fs[,-1]`

To evaluate differences in the frequencies of different colors of snacks, I consider a pair of colors and then used a paired permutation test (with the t-statistic). This was accomplished with the `paired.perm.test()`

function in the R/broman package. That function can do an *exhaustive* permutation test (for small samples), or a simulation-based permutation test, as used here.

Here’s a simplified version of the function, just for the simulation-based permutation test:

```
paired.perm.test <-
function(d, n.perm=10000)
{
n <- length(d)
tobs <- t.test(d)$statistic
allt <- 1:n.perm
for(i in 1:n.perm) {
permd <- d*sample(c(-1,1), n, replace=TRUE)
allt[i] <- t.test(permd)$statistic
}
mean(abs(allt) >= abs(tobs))
}
```

The input is a set of differences, `d`

. I calculate the t-statistic with the R function `t.test()`

, and the central permutation test part involves a `for`

loop. I use `sample()`

to apply a random sign (positive or negative) to each difference, and then `t.test()`

again to calculate the t-statistic. The returned p-value is the proportion of t-statistics from the permutations that are ≥ the observed one, in absolute value.

The data are contained in a matrix `fs`

, with 81 rows (the observed packages) and 5 columns (the different colors). To calculate the observed SD for the number of snacks of each color, I use `apply`

, as follows:

`obs_sd <- apply(fs, 2, sd)`

To calculate the SD of the *proportions*, I first calculate the number of snacks per package, using `rowSums`

. I then convert the counts to proportions by dividing by those sums. (This makes use the fact that the data are stored by column, and the column of row sums will be repeated. I then use `apply`

again.

```
n_per_package <- rowSums(fs)
props <- fs / n_per_package
obs_sd_prop <- apply(props, 2, sd)
```

In the analysis of clustering, the *null* model is that the colors were randomly assigned to packages (but at color-specific frequencies). Thus the number of snacks of a particular color, given the total number of snacks in a package, follows a binomial distribution. The distribution of the counts of a particular color across packages then follows a mixture of binomial distributions.

I wrote a function to calculate the SD for such a binomial mixture. This is maybe a bit messy.

```
calc_sd_binommixture <-
function(n=rowSums(fs), p=sum(fs[,1])/sum(fs))
{
maxn <- max(n)
tabn <- table(n)
prop_n <- tabn/length(n)
n <- as.numeric(names(tabn))
probs <- t(vapply(n, function(number) dbinom(0:maxn, number, prob=p), rep(0, maxn+1)))
probs <- colSums(probs * rep(prop_n, ncol(probs)))
xmean <- sum(probs*(0:maxn))
sqrt(sum(probs * ((0:maxn) - xmean)^2))
}
```

The input is a vector of numbers of snacks per package (`n`

) and the frequency of a particular color (`p`

). I determine the unique values in `n`

, and then use `dbinom()`

and `vapply()`

to get the binomial probabilities. I then use `colSums()`

to get the probabilities for the binomial mixture. Finally, I calculate the mean and then the SD of that mixture distribution.

To actually calculate the set of expected SDs, I use `apply()`

.

`exp_sd <- apply(fs, 2, function(a, b) calc_sd_binommixture(rowSums(b), sum(a)/sum(b)), fs)`

There’s a similar function for calculation the SD of the proportions, for this sort of binomial mixture. The only difference is that I need to use values that are proportions rather than numbers. I probably should have merged these two functions into one, to not have all of the repeated code.

```
calc_sdprop_binommixture <-
function(n=rowSums(fs), p=sum(fs[,1])/sum(fs))
{
maxn <- max(n)
tabn <- table(n)
prop_n <- tabn/length(n)
n <- as.numeric(names(tabn))
probs <- t(vapply(n, function(number) dbinom(0:maxn, number, prob=p), rep(0, maxn+1)))
probs <- probs * rep(prop_n, ncol(probs))
vals <- t(vapply(n, function(number) (0:maxn)/number, rep(0, maxn+1)))
xmean <- sum(probs*vals)
sqrt(sum(probs * (vals - xmean)^2))
}
```

To evaluate clustering, I could have compared the observed SD for a color to the distribution obtained by simulating data under the binomial mixture. I choose instead to do a permutation test: take the 1029 snacks and randomly assign them to packages, keeping the number of snacks of each color constrained. I don’t think it matters so much; I think the permutation test is a bit more cute.

To perform the permutation test, I first create a “tidy table”, with each row being a particular snack and with two columns: a numeric index (1–81) for the package it was in, and a numeric index (1–5) for its color.

I create this table using a bit of `apply()`

, `rep()`

, and a `for`

loop.

```
tidytab <- data.frame(package=rep(0, sum(fs)), color=rep(0, sum(fs)))
tidytab$package <- unlist(apply(fs, 2, function(a) rep(1:nrow(fs), a)))
cur <- 0
for(i in 1:ncol(fs)) {
tidytab$color[cur + 1:sum(fs[,i])] <- rep(i, sum(fs[,i]))
cur <- cur + sum(fs[,i])
}
```

The first part of the table looks like this:

`head(tidytab)`

To get this back to the table of packages × colors, use `table()`

. It’s good to check that you get the same result.

```
tab <- table(tidytab$package, tidytab$color)
all(tab == fs)
```

In the permutation test, I just need to shuffle one column relative to the other (using `sample`

) and then use `table()`

to get the counts.

`permtab <- table(tidytab$package, sample(tidytab$color))`

With the code from the previous section, the permutation test is then pretty simple: I use a `for`

loop to repeated permute the data, in each case using `apply`

to calculate the column SDs.

```
n.perm <- 10000
permsd <- matrix(ncol=ncol(fs), nrow=n.perm)
for(i in 1:n.perm) {
permdat <- table(tidytab$package, sample(tidytab$color))
permsd[i,] <- apply(permdat, 2, sd)
}
```

To calculate p-values, I look at the proportion of permutation replicates that gave an SD that was farther away from the expected SD, in absolute value, than what was actually observed.

```
pval <- rep(0, ncol(fs))
names(pval) <- colnames(fs)
for(i in seq(along=pval))
pval[i] <- mean(abs(permsd[,i] - exp_sd[i]) >= abs(obs_sd[i] - exp_sd[i]))
```

The permutation test using the SD of the proportions rather than the SD of the counts is basically the same, I just need to convert to proportions before calculating the SDs.

```
n.perm <- 10000
permsdprop <- matrix(ncol=ncol(fs), nrow=n.perm)
for(i in 1:n.perm) {
permdat <- table(tidytab$package, sample(tidytab$color))
permsdprop[i,] <- apply(permdat/rowSums(permdat), 2, sd)
}
```