NCBI; BC5CDR; i2b2 2010; HPRD50; AIMed; MedNLI

Name: NCBI; BC5CDR; i2b2 2010; HPRD50; AIMed; MedNLI
Creator: chen peng
License: https://creativecommons.org/licenses/by/4.0/
Keywords: Artificial Intelligence

Citation Author(s):: Rezarta Islamaj Dogan

Jiao Li

Uzuner Özlem

Katrin Fundel

Razvan C. Bunescu

Alexey Romanov

Soumya Sanyal
Submitted by:: chen peng
Last updated:: Tue, 04/02/2024 - 05:16
DOI:: 10.21227/ardx-5f55
Data Format:: *.JSON (ZIP)

154 views

Categories:

Artificial Intelligence

Keywords:

artificial intelligence; machine learning; natural language processing; named entity recognition; relation extraction; text entailment

ACCESS DATASET CITE

Abstract

NCBI: The NCBI dataset is a biomedical corpus containing 793 PubMed abstracts, each manually annotated to include disease mentions and their corresponding concepts, providing a high-quality gold standard for disease name recognition and normalization research.

BC5CDR-disease: BioCreative V Chemical-Disease Relation (BC5CDR) is annotated for biomedical named entity recognition and relation extraction task, consisting of 1500 PubMed articles, covering annotations of disease and chemical entities, as well as their interactions. In this paper, we only consider the disease entity of the named entity recognition task.

i2b2 2010: The i2b2 2010 dataset was sourced from three distinct medical institutions and was annotated by medical professionals to identify eight types of relations between medical problems and corresponding treatments, i.e., TrIP, TrWP, TrCP, TrAP, TrNAP, PIP, TeRP, TeCP.

HPRD50: The HPRD50 dataset is sourced from the HPRD database and used for studying human proteinprotein interactions (PPI). HPRD50 corpus consists of 43 documents annotated by true and false protein-protein interaction (PPI) relation.

AIMed: The AImed dataset is developed to evaluate protein name recognition and protein-protein interaction (PPI) extraction. AIMed corpus consists of 225 documents annotated by true and false protein-protein interaction (PPI) relation.

MedNLI: The MedNLI is collected from MIMIC-III with a form of premise-hypothesis pairs. And annotated by radiologists, the dataset is graded for entailment, contradiction, or neutrality based on whether the premise entails the hypothesis.