Homework 4: Model building for prediction
due 30 April 2020
The goal of this project is to build a model for predicting diabetes. We will consider data from the NHANES survey.
I extracted data for a set of 16 predictor variables for the first three waves of the survey (1999-2003). I defined diabetes to be having been diagnosed or having fasting glucose >= 126 mg/dL. I excluded subjects < 20 years old, or pregnant.
In the zip file hw4_data.zip
, there is a
data file plus a data dictionary, both as CSV files.
Note that some of the variables have some strange categories, for
example age_smoking
and household_income
.
-
Build a model to predict diabetes from the other variables, for example using logistic regression, a nearest neighbor classifier, or other means. Discuss your choices.
-
Discuss your handling of missing data.
-
Assess the performance of your predictions. What are the specificity and sensitivity? Use separate training and test sets, or k-fold cross validation.
-
Assess the relative importance of the variables. Which variables are most important for predicting diabetes?
Related work
-
Yu et al. (2010) Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. doi:10.1186/1472-6947-10-16
-
Semerdjian and Frank (2017) An ensemble classifier for predicting the onset of Type II diabetes. arXiv:1708.07480