Datasets
Standard Dataset
Features Vectors of IDriveGenes: Cancer Driver Genes Dataset
- Citation Author(s):
- Submitted by:
- Muhammad Tahir
- Last updated:
- Wed, 03/22/2023 - 05:49
- DOI:
- 10.21227/c1d0-2856
- Data Format:
- Research Article Link:
- License:
- Categories:
- Keywords:
Abstract
The development of high throughput sequencing technologies i.e. Next Generation Sequencing (NGS) is revolutionizing the exploration of cancer. Though sequence datasets are highly complex, mutation can occur randomly in DNA or RNA sequences that can make cells sicker or less fit. The unusual growth and behavior of genes in cells cause cancer. Cancer-driver gene cells grow when mutation occurs. Identification of cancer driver genes is a critical and challenging issue for researchers. In the proposed work, initially, robust features are extracted from the sequence dataset through Position Relative Incidence Matrix (PRIM) integrated with Accumulative Absolute Position Incidence Vector (AAPIV) generation. PRIM and AAPIV convert the single-dimensional sequence data into 2-dimensional numeric data. Support Vector Machine (SVM), Neural Network (NN), and Random Forest (RF) are used to train the model. The proposed model is validated with different validation methods i.e., independent testing, k-fold crossvalidation, self-consistency, and jackknife testing. The proposed model predicts whether the given primary structure corresponds to cancer driver genes or not. Results analyses show 95%, 92%, and 69% accuracy on RF, Artificial Neural Networks (ANN), and SVM respectively. The comparative analysis with existing state-of-the-art models i.e., 20/20+ and Multimodal Deep Neural Network by integrating Multi-dimensional Data (NDNNMD) shows that the proposed model outperforms the existing techniques.
Supplementary file contains the features vectors generated after preprocessing.
The details are discussed in the article.
Documentation
Attachment | Size |
---|---|
Read Me.txt | 124 bytes |