Datasets
Standard Dataset
Dataset used in NDKG
- Citation Author(s):
- Submitted by:
- Peifu Han
- Last updated:
- Fri, 06/28/2024 - 03:22
- DOI:
- 10.21227/hzkm-gx73
- License:
- Categories:
- Keywords:
Abstract
Data sources of MKG with structured medical knowledge database and unstructured scientific publications
Source Type
Name
Related researches
Structured medical knowledge database
KEGG
[20]
SIDER
[21]
ICD-10
[22]
InterBioScreen
[23]
DrugBank
[24]
Unstructured scientific publications
literature
[25,26]
textbooks
[27]
Online resources
[28]
[20]Kanehisa, Minoru. "The KEGG database." ‘In silico’ simulation of biological processes: Novartis Foundation Symposium 247. Vol. 247. Chichester, UK: John Wiley & Sons, Ltd, 2002.
[21]Kuhn, Michael, et al. "The SIDER database of drugs and side effects." Nucleic acids research 44.D1 (2016): D1075-D1079.
[22]World Health Organization. The ICD-10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines. Vol. 1. World Health Organization, 1992.
[23]InterBioScreen Databases, Retrieved from InterBioScreen Databases: https://www.ibscreen.com/, 2022.
[24]Wishart, David S., et al. "DrugBank 5.0: a major update to the DrugBank database for 2018." Nucleic acids research 46.D1 (2018): D1074-D1082.
[25]White, Jacob. "PubMed 2.0." Medical reference services quarterly 39.4 (2020): 382-387.
[26]Kim, Sunghwan, et al. "PubChem 2023 update." Nucleic acids research 51.D1 (2023): D1373-D1380.
[27]Sun, Haixia, et al. "Medical knowledge graph to enhance fraud, waste, and abuse detection on claim data: model development and performance evaluation." JMIR Medical Informatics 8.7 (2020): e17653.
[28]Bao, Qiming, Lin Ni, and Jiamou Liu. "HHH: an online medical chatbot system based on knowledge graph and hierarchical bi-directional attention." Proceedings of the Australasian computer science week multiconference. 2020.
Data resources are essential for developing a trustworthy KG. The large amount of accessible medical data makes it possible to build a large-scale medical knowledge graph with rich and reliable medical entities and relationships. According to the research, we divide the data sources used in relevant work into two groups, namely unstructured scientific publications and structured medical knowledge databases. Among them, unstructured scientific publications refer to the medical knowledge of a small number of natural products and a large number of unstructured scientific publications such as PubMed, as shown in Table 1.
Table 1. Data sources of MKG with structured medical knowledge database and unstructured scientific publications
Source Type
Name
Related researches
Structured medical knowledge database
KEGG
[20]
SIDER
[21]
ICD-10
[22]
InterBioScreen
[23]
DrugBank
[24]
Unstructured scientific publications
literature
[25,26]
textbooks
[27]
Online resources
[28]
Structured disease knowledge database refers to the collection of open and free disease knowledge created by researchers, such as RepoDB [29], KEGG [20], SemMedDB [30], etc. Malas et al. [31] used the semantic information between drugs and diseases in existing knowledge map RepoDB [29], which is a standard drug reuse database. Korn et al. [32] established COVID-KOP, which is a new knowledge base that combines ROBOKOP [33] biomedical knowledge map with COVID-19 contemporary biomedical literature information.
Unstructured scientific publications, such as literature, textbooks, guides and other scientific publications shall be published by authoritative institutions, publishers, researchers, etc. Because these data sources are significantly more reliable and widely available, they have been used to build large MKGs or domain specific MKGs. For example, Zhang et al. [34] collected Parkinson's disease related links from medical literature and built a knowledge map of medical literature.
[20]Kanehisa, Minoru. "The KEGG database." ‘In silico’ simulation of biological processes: Novartis Foundation Symposium 247. Vol. 247. Chichester, UK: John Wiley & Sons, Ltd, 2002.
[21]Kuhn, Michael, et al. "The SIDER database of drugs and side effects." Nucleic acids research 44.D1 (2016): D1075-D1079.
[22]World Health Organization. The ICD-10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines. Vol. 1. World Health Organization, 1992.
[23]InterBioScreen Databases, Retrieved from InterBioScreen Databases: https://www.ibscreen.com/, 2022.
[24]Wishart, David S., et al. "DrugBank 5.0: a major update to the DrugBank database for 2018." Nucleic acids research 46.D1 (2018): D1074-D1082.
[25]White, Jacob. "PubMed 2.0." Medical reference services quarterly 39.4 (2020): 382-387.
[26]Kim, Sunghwan, et al. "PubChem 2023 update." Nucleic acids research 51.D1 (2023): D1373-D1380.
[27]Sun, Haixia, et al. "Medical knowledge graph to enhance fraud, waste, and abuse detection on claim data: model development and performance evaluation." JMIR Medical Informatics 8.7 (2020): e17653.
[28]Bao, Qiming, Lin Ni, and Jiamou Liu. "HHH: an online medical chatbot system based on knowledge graph and hierarchical bi-directional attention." Proceedings of the Australasian computer science week multiconference. 2020.
[29]Brown, Adam S., and Chirag J. Patel. "A standard database for drug repositioning." Scientific data 4.1 (2017): 1-7.
[30]Kilicoglu, Halil, et al. "SemMedDB: a PubMed-scale repository of biomedical semantic predications." Bioinformatics 28.23 (2012): 3158-3160.
[31]Malas, Tareq B., et al. "Drug prioritization using the semantic properties of a knowledge graph." Scientific reports 9.1 (2019): 6281.
[32]Korn, Daniel, et al. "COVID-KOP: integrating emerging COVID-19 data with the ROBOKOP database." Bioinformatics 37.4 (2021): 586-587.
[33]Morton, Kenneth, et al. "ROBOKOP: an abstraction layer and user interface for knowledge graphs to support question answering." Bioinformatics 35.24 (2019): 5382-5384.
[34]Zhang, **aolin, and Chao Che. "Drug repurposing for Parkinson’s disease by integrating knowledge graph completion model and knowledge fusion of medical literature." Future Internet 13.1 (2021): 14.
Dataset Files
- KEGG.rar (246.73 kB)
- meddra_all_indications.tsv.gz (336.61 kB)
- meddra_all_label_indications.tsv.gz (5.64 MB)
- meddra_all_label_se.tsv.gz (40.56 MB)
- meddra_all_se.tsv.gz (2.27 MB)
- meddra_freq.tsv.gz (1.96 MB)
- 数据库_preprocessed.rar (87.24 kB)
- SIDER_meddra_all_label_se.tsv.gz (40.56 MB)
- DECAGON_bio-decagon-combo.tar.gz (34.04 MB)