Features Vectors of IDriveGenes: Cancer Driver Genes Dataset

Citation Author(s):: Yasir Ali (Department of Computer Science, Sir Syed Case Institute of Technology, Islamabad, Pakistan)

Muhammad Sardaraz (Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, Pakistan)

Muhammad Tahir (Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, Pakistan)

Hela Elmannai (Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi Arabia)

Monia Hamdi (Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi Arabia)

Amel Ksibi (Department of Information Systems, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, P.O.Box 84428, Riyadh, Saudi Arabia)
Submitted by:: Muhammad Tahir
Last updated:: Wed, 03/22/2023 - 09:49
DOI:: 10.21227/c1d0-2856
Data Format:: *.csv
Research Article Link:: IDriveGenes: Cancer Driver Genes Prediction using Machine Learning

354 views

Categories:

Keywords:

Cancer ; Driver Genes ; Machine Learning;

ACCESS DATASET CITE

Abstract

The development of high throughput sequencing technologies i.e. Next Generation Sequencing (NGS) is revolutionizing the exploration of cancer. Though sequence datasets are highly complex, mutation can occur randomly in DNA or RNA sequences that can make cells sicker or less fit. The unusual growth and behavior of genes in cells cause cancer. Cancer-driver gene cells grow when mutation occurs. Identification of cancer driver genes is a critical and challenging issue for researchers. In the proposed work, initially, robust features are extracted from the sequence dataset through Position Relative Incidence Matrix (PRIM) integrated with Accumulative Absolute Position Incidence Vector (AAPIV) generation. PRIM and AAPIV convert the single-dimensional sequence data into 2-dimensional numeric data. Support Vector Machine (SVM), Neural Network (NN), and Random Forest (RF) are used to train the model. The proposed model is validated with different validation methods i.e., independent testing, k-fold crossvalidation, self-consistency, and jackknife testing. The proposed model predicts whether the given primary structure corresponds to cancer driver genes or not. Results analyses show 95%, 92%, and 69% accuracy on RF, Artificial Neural Networks (ANN), and SVM respectively. The comparative analysis with existing state-of-the-art models i.e., 20/20+ and Multimodal Deep Neural Network by integrating Multi-dimensional Data (NDNNMD) shows that the proposed model outperforms the existing techniques.