Datasets
Standard Dataset
MedCD: A Medical Clinical Dataset

- Citation Author(s):
- Submitted by:
- Ye Chen
- Last updated:
- Mon, 02/10/2025 - 00:05
- DOI:
- 10.21227/kh7n-8n28
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
We curated and release a real-world medical clinical dataset, namely MedCD, in the context of building generative artificial intelligence (AI) applications in the clinical setting. The MedCD dataset is one of the accomplishments from our longitudinal applied AI research and deployment in a tertiary care hospital in China. First, the dataset is real and comprehensive, in that it was sourced from real-world electronic health records (EHRs), clinical notes, lab examination reports and more. Second, the dataset is large, that contains 1·7 million EHR examples involving more than 250K patients, collected from 30 clinical departments over the first quarter of year 2024. The scale is comparable to that of MIMIC-IV. The data was de-identified and organized into a format similar to MIMIC-IV free-text clinical notes. Moreover, the objective of this dataset is to accelerate generative AI research and development in healthcare. MedCD not only contains millions of patients' data, but also features supervised data for a variety of real fundamental clinical tasks with months' worth of annotation endeavors by clinicians. Following the general paradigm of generative AI application development, the MedCD dataset consists of: (1) unsupervised pretraining data where each patient data is organized as a medical document, (2) supervised fine-tuning data for a wide spectrum of clinical applications including NER, retrieval and summarization, and (3) benchmark data for evaluating fundamental clinical tasks such as patient triage and notes generation. Further, we describe a spectrum of deployed clinical applications making use of this data, as reference implementation and baseline. We believe that MedCD is to-date the most comprehensive and largest scale clinical dataset in Chinese, and the first designed for generative AI research and development in healthcare.
A Medical Clinical Dataset for Building Generative AI in Healthcare.
Comments
..
thnk you
thanks
thx