electronic health records
We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing tabular variables (like symptoms, diagnoses and underlying conditions) and associated clinical notes describing the fictional patient encounter in the domain of respiratory diseases. The tabular portion of the data is generated through a Bayesian network, where both the causal structure between the variables and the conditional probabilities are proposed by an expert based on domain knowledge.
- Categories:
A refined data from supplementary materials of "Predicting 30-days mortality for MIMIC-III patients with sepsis-3: a machine learning approach using XGboost". Rows with invalid age values were removed and feature columns were selected, and the data type of each column was adjusted.
- Categories: