## Homework 2: Pest management trial

*due 12 March 2020*

A randomized clinical trial in Baltimore and Boston sought to assess whether integrated pest management (IPM) could reduce asthma symptoms in exposed children. Households were randomized to two groups: control (education about management of mice) and treatment (education plus professional pest management). Subjects were followed for one year.

Read the published study, Matsui et al. (2017). Also poke through the supplements: study protocol and eMethods/tables/figure. Further, maybe take a look at the summary videos at the study website.

Download the data:

- published dataset
- codebook as csv file or xlsx file

a. Reproduce the first row in Table 2 of Matsui et al. (2017), on the primary outcome, “Maximal symptom days/2 wk”.

The analysis does not seem to be not described in much detail. The most detail I could find was on page 72 of the study protocol supplement (labeled page 56). But the analysis doesn’t seem to use site (and this doesn’t seem to be included in the data file), even though the protocol said it would.

You can use the gee package.
The outcome variable is `sxsmaxday`

, and I believe you’ll omit `VisitNum`

0 and 1 and then fit a log-linear model something like this:

```
out <- gee(sxsmaxday ~ group, id=ID, family=poisson(link=log), corstr="exchangeable")
```

Note that the first and second columns in that row, with the median, 25th, and 75th percentiles, are also tricky. It seems like for each subject they took the average outcome across the last three visits, and that the Table 2 values are the median, 25th, and 75th percentiles of those averages.

b. Instead of using GEE, combine the outcomes for visits 2-4 in a
subject by taking the sum, here focusing on the subjects that had
complete data for the three visits. Then fit a log-linear model with
`glm()`

, something like the following.

```
out <- glm(sxsmaxday_sum ~ group, family=quasipoisson(link=log))
```

By using `quasipoisson()`

for the family rather than `poisson()`

,
you’re allowing for overdispersion in the variance.

With this approach, what is the confidence interval for the treatment effect?

c. Create a data visualization of the `sxsmaxday`

outcome, to try to
reveal both the longitudinal nature of the data and the treatment
effect. Discuss your design choices.

d. Turn the first eight rows of Table 2 of Matsui et al. (2017) into a graph (or graphs), showing the treatment effects on the primary and secondary outcomes. Discuss your design choices.

*Where I say, “Discuss your design choices,” I am referring to your
choices of plot type, layout, colors, and so forth. I want you to
explain: why did you choose this particular arrangement? Every graph
has trade-offs (for example, some comparisons are made easier, others
are made more difficult); why did you choose this particular graph?*

### Resources

- Lesson on generalized estimating equations
- Using the gee package vignette from package for Handbook of statistical analyses using R
- Gelman et al. (2002) Let’s practice what we preach: Turning tables into graphs. Am Stat 56:121-130

This assignment is derived from a data science homework designed by Stephanie Hicks and Roger Peng at Johns Hopkins.