MedCD: A Medical Clinical Dataset

Citation Author(s):
Ye
Chen
Tiger Research
Submitted by:
Ye Chen
Last updated:
Mon, 02/10/2025 - 00:05
DOI:
10.21227/kh7n-8n28
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

We curated and release a real-world medical clinical dataset, namely MedCD, in the context of building generative artificial intelligence (AI) applications in the clinical setting. The MedCD dataset is one of the accomplishments from our longitudinal applied AI research and deployment in a tertiary care hospital in China. First, the dataset is real and comprehensive, in that it was sourced from real-world electronic health records (EHRs), clinical notes, lab examination reports and more. Second, the dataset is large, that contains 1·7 million EHR examples involving more than 250K patients, collected from 30 clinical departments over the first quarter of year 2024. The scale is comparable to that of MIMIC-IV. The data was de-identified and organized into a format similar to MIMIC-IV free-text clinical notes. Moreover, the objective of this dataset is to accelerate generative AI research and development in healthcare. MedCD not only contains millions of patients' data, but also features supervised data for a variety of real fundamental clinical tasks with months' worth of annotation endeavors by clinicians. Following the general paradigm of generative AI application development, the MedCD dataset consists of: (1) unsupervised pretraining data where each patient data is organized as a medical document, (2) supervised fine-tuning data for a wide spectrum of clinical applications including NER, retrieval and summarization, and (3) benchmark data for evaluating fundamental clinical tasks such as patient triage and notes generation. Further, we describe a spectrum of deployed clinical applications making use of this data, as reference implementation and baseline. We believe that MedCD is to-date the most comprehensive and largest scale clinical dataset in Chinese, and the first designed for generative AI research and development in healthcare.

Instructions: 

A Medical Clinical Dataset for Building Generative AI in Healthcare.

Comments

..

Submitted by Amara Sirat on Tue, 02/11/2025 - 18:53

thnk you

Submitted by Amara Sirat on Tue, 02/11/2025 - 18:55

thanks

Submitted by Jamie Humphries on Tue, 03/04/2025 - 19:24

thx

Submitted by Jamie Humphries on Tue, 03/04/2025 - 19:37

Dataset Files

LOGIN TO ACCESS DATASET FILES