due 30 April 2020

The goal of this project is to build a model for predicting diabetes. We will consider data from the NHANES survey.

I extracted data for a set of 16 predictor variables for the first three waves of the survey (1999-2003). I defined diabetes to be having been diagnosed or having fasting glucose >= 126 mg/dL. I excluded subjects < 20 years old, or pregnant.

In the zip file hw4_data.zip, there is a data file plus a data dictionary, both as CSV files.

Note that some of the variables have some strange categories, for example age_smoking and household_income.

  1. Build a model to predict diabetes from the other variables, for example using logistic regression, a nearest neighbor classifier, or other means. Discuss your choices.

  2. Discuss your handling of missing data.

  3. Assess the performance of your predictions. What are the specificity and sensitivity? Use separate training and test sets, or k-fold cross validation.

  4. Assess the relative importance of the variables. Which variables are most important for predicting diabetes?

  • Yu et al. (2010) Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. doi:10.1186/1472-6947-10-16

  • Semerdjian and Frank (2017) An ensemble classifier for predicting the onset of Type II diabetes. arXiv:1708.07480