Datasets
Standard Dataset
Heterogeneous and Similarity Network Data
- Citation Author(s):
- Submitted by:
- wen wang
- Last updated:
- Wed, 11/13/2024 - 10:38
- DOI:
- 10.21227/33k6-ga05
- License:
- Categories:
- Keywords:
Abstract
Precise prediction of potential drug-disease associations (DDAs) is essential for enhancing treatment strategies and expediting drug development. However, current methods often rely on single-modal data and fail to effectively integrate multimodal information when representing node attributes. Furthermore, many feature extraction processes neglect the integration of node attribute features with topological features.To address these limitations, we propose MedPathEx, a method that integrates multimodal data fusion with metapath feature extraction techniques. First, we constructed a biomedical heterogeneous network comprising three entities—drugs, genes, and diseases—along with their interrelationships. By incorporating multimodal data, we generated similarity networks for the nodes within this heterogeneous network. We then used a graph convolutional network to extract the node attribute features from these similarity networks. Simultaneously, meta-paths enhanced with a multi-head attention mechanism capture local topological features from the heterogeneous network, whereas a global attention mechanism further refines global topological features, enabling a seamless fusion of local and global features. Finally, MedPathEx effectively combined these node attribute features with network structural features to create a comprehensive feature representation that was used to calculate the potential association probabilities between drugs and diseases.The experimental results indicate that MedPathEx surpasses current methods in terms of critical metrics, including AUC, AP, and F1 scores. MedPathEx effectively identified novel candidate drugs in case studies of coronary artery disease and hypertension, underscoring its substantial potential for practical applications.
The data for this study were obtained from three public biomedical databases: Stanford Biomedical Network Dataset Collection (BIOSNAP) [23], Comparative Toxicogenomics Database (CTD) [24], and Pharmacogenomics Knowledgebase (PharmGKB) [25]. These databases provide extensive association data on drugs, genes, and diseases.
Specifically, from BIOSNAP, we acquired 9,761 drug-gene interaction records and 104,327 drug-disease interaction records, covering 4,349 drugs, 2,085 genes, and 565 diseases. From CTD, we obtained 36,321 disease-gene interaction records involving 3,969 genes and 561 diseases. From PharmGKB, we retrieved 3,422 disease-gene interaction records and 6,472 drug-gene interaction records, encompassing 1,614 drugs, 2,053 genes, and 730 diseases.
During data processing, we standardized the names of drugs, genes, and diseases and removed redundant and irrelevant entries, thereby ensuring the uniqueness of each association and the accuracy of the data. After meticulous filtering, we obtained a heterogeneous network with 12,661 nodes and 120,587 edges, including 1148 diseases, 7,591 genes, 4,050 drugs, 69,034 disease-drug associations, 35,998 disease-gene associations, and 15,555 drug-gene associations