We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing tabular variables (like symptoms, diagnoses and underlying conditions) and associated clinical notes describing the fictional patient encounter in the domain of respiratory diseases. The tabular portion of the data is generated through a Bayesian network, where both the causal structure between the variables and the conditional probabilities are proposed by an expert based on domain knowledge.

Dataset Files

You must be an IEEE Dataport Subscriber to access these files. Subscribe now or login.

[1] Paloma Rabaey, Stefan Heytens, Thomas Demeester, "SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records", IEEE Dataport, 2025. [Online]. Available: http://dx.doi.org/10.21227/3sk0-2015. Accessed: Mar. 18, 2025.
@data{3sk0-2015-25,
doi = {10.21227/3sk0-2015},
url = {http://dx.doi.org/10.21227/3sk0-2015},
author = {Paloma Rabaey; Stefan Heytens; Thomas Demeester },
publisher = {IEEE Dataport},
title = {SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records},
year = {2025} }
TY - DATA
T1 - SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records
AU - Paloma Rabaey; Stefan Heytens; Thomas Demeester
PY - 2025
PB - IEEE Dataport
UR - 10.21227/3sk0-2015
ER -
Paloma Rabaey, Stefan Heytens, Thomas Demeester. (2025). SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records. IEEE Dataport. http://dx.doi.org/10.21227/3sk0-2015
Paloma Rabaey, Stefan Heytens, Thomas Demeester, 2025. SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records. Available at: http://dx.doi.org/10.21227/3sk0-2015.
Paloma Rabaey, Stefan Heytens, Thomas Demeester. (2025). "SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records." Web.
1. Paloma Rabaey, Stefan Heytens, Thomas Demeester. SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records [Internet]. IEEE Dataport; 2025. Available from : http://dx.doi.org/10.21227/3sk0-2015
Paloma Rabaey, Stefan Heytens, Thomas Demeester. "SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records." doi: 10.21227/3sk0-2015