Datasets
Standard Dataset
NCBI; BC5CDR; i2b2 2010; HPRD50; AIMed; MedNLI
- Citation Author(s):
- Submitted by:
- chen peng
- Last updated:
- Tue, 04/02/2024 - 01:16
- DOI:
- 10.21227/ardx-5f55
- Data Format:
- License:
Abstract
NCBI: The NCBI dataset is a biomedical corpus containing 793 PubMed abstracts, each manually annotated to include disease mentions and their corresponding concepts, providing a high-quality gold standard for disease name recognition and normalization research.
BC5CDR-disease: BioCreative V Chemical-Disease Relation (BC5CDR) is annotated for biomedical named entity recognition and relation extraction task, consisting of 1500 PubMed articles, covering annotations of disease and chemical entities, as well as their interactions. In this paper, we only consider the disease entity of the named entity recognition task.
i2b2 2010: The i2b2 2010 dataset was sourced from three distinct medical institutions and was annotated by medical professionals to identify eight types of relations between medical problems and corresponding treatments, i.e., TrIP, TrWP, TrCP, TrAP, TrNAP, PIP, TeRP, TeCP.
HPRD50: The HPRD50 dataset is sourced from the HPRD database and used for studying human proteinprotein interactions (PPI). HPRD50 corpus consists of 43 documents annotated by true and false protein-protein interaction (PPI) relation.
AIMed: The AImed dataset is developed to evaluate protein name recognition and protein-protein interaction (PPI) extraction. AIMed corpus consists of 225 documents annotated by true and false protein-protein interaction (PPI) relation.
MedNLI: The MedNLI is collected from MIMIC-III with a form of premise-hypothesis pairs. And annotated by radiologists, the dataset is graded for entailment, contradiction, or neutrality based on whether the premise entails the hypothesis.
Just specify the file path, and run the run.sh script file to start the program. The code has been uploaded to Github.
Comments
Dataset
data
data