Datasets
Standard Dataset
EXPACT NHANES dataset
- Citation Author(s):
- Submitted by:
- Wei Qiu
- Last updated:
- Thu, 07/30/2020 - 01:37
- DOI:
- 10.21227/69wh-ya88
- License:
- Categories:
Abstract
The dataset is used in "EXPACT: Explainable complex machine learning prediction of all-cause mortality in the U.S."
The National Health and Nutrition Examination Survey (NHANES) from the National Center for Health Statistics (NCHS) \footnote[1]{http://www.cdc.gov/nchs/nhanes.htm} conducts interviews and physical examinations to assess the health and nutrition data for all ages in the United States. The interviews include demographic, socioeconomic, dietary, and health-related questions. The examinations include medical, dental, physiological measurements, and laboratory tests administered by highly trained medical personnel. Since 1999, data was collected and released at 2-year intervals. Each year NHANES examines a nationally representative sample of roughly 5,000 individuals across the Unites States. In this study, we include NHANES data sampled between 1999 and 2014. All-cause mortality is ascertained by a linked NHANES mortality file that provides follow-up mortality data from the date of survey participation through December 31, 2015. Our study includes samples with known mortality status who participated in NHANES 1999-2014 (n = 47,261). We include all demographic, laboratory, examination, and questionnaire features that could be automatically matched across different NHANES cycles.
Our study includes samples with known mortality status who participated in NHANES 1999-2014 (n = 47,261). We include all demographic, laboratory, examination, and questionnaire features that could be automatically matched across different NHANES cycles. We exclude variables that are missing for more than 50% of the participants and highly correlated features with correlations greater than 0.98; after filtering, 133 features remain. We impute missing data using MissForest, a nonparametric multiple imputation method for mixed-type data using a random forest model, with seven iterations. We predict all-cause mortality for two broad categories: (1) follow-up times of 1-year, 3-year, and 5-year (2) age groups of $<$40, 40-65, 65-80, and $\geq$80 years old. For different follow-up times, we remove samples with unconfirmed mortality status. For different age groups, we predict 5-year mortality.