EXPACT NHANES dataset

Citation Author(s):
Wei
Qiu
Submitted by:
Wei Qiu
Last updated:
Thu, 07/30/2020 - 01:37
DOI:
10.21227/69wh-ya88
License:
735 Views
Categories:
0
0 ratings - Please login to submit your rating.

Abstract 

The dataset is used in "EXPACT: Explainable complex machine learning prediction of all-cause mortality in the U.S." 

The National Health and Nutrition Examination Survey (NHANES) from the National Center for Health Statistics (NCHS) \footnote[1]{http://www.cdc.gov/nchs/nhanes.htm} conducts interviews and physical examinations to assess the health and nutrition data for all ages in the United States. The interviews include demographic, socioeconomic, dietary, and health-related questions. The examinations include medical, dental, physiological measurements, and laboratory tests administered by highly trained medical personnel. Since 1999, data was collected and released at 2-year intervals. Each year NHANES examines a nationally representative sample of roughly 5,000 individuals across the Unites States. In this study, we include NHANES data sampled between 1999 and 2014. All-cause mortality is ascertained by a linked NHANES mortality file that provides follow-up mortality data from the date of survey participation through December 31, 2015. Our study includes samples with known mortality status who participated in NHANES 1999-2014 (n = 47,261). We include all demographic, laboratory, examination, and questionnaire features that could be automatically matched across different NHANES cycles. 

Instructions: 

Our study includes samples with known mortality status who participated in NHANES 1999-2014 (n = 47,261). We include all demographic, laboratory, examination, and questionnaire features that could be automatically matched across different NHANES cycles. We exclude variables that are missing for more than 50% of the participants and highly correlated features with correlations greater than 0.98; after filtering, 133 features remain. We impute missing data using MissForest, a nonparametric multiple imputation method for mixed-type data using a random forest model, with seven iterations.  We predict all-cause mortality for two broad categories: (1) follow-up times of 1-year, 3-year, and 5-year (2) age groups of $<$40, 40-65, 65-80, and $\geq$80 years old. For different follow-up times, we remove samples with unconfirmed mortality status. For different age groups, we predict 5-year mortality.