Features Vectors of IDriveGenes: Cancer Driver Genes Dataset

Citation Author(s):
Yasir
Ali
Department of Computer Science, Sir Syed Case Institute of Technology, Islamabad, Pakistan
Muhammad
Sardaraz
Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, Pakistan
Muhammad
Tahir
Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, Pakistan
Hela
Elmannai
Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi Arabia
Monia
Hamdi
Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi Arabia
Amel
Ksibi
Department of Information Systems, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, P.O.Box 84428, Riyadh, Saudi Arabia
Submitted by:
Muhammad Tahir
Last updated:
Wed, 03/22/2023 - 05:49
DOI:
10.21227/c1d0-2856
Data Format:
Research Article Link:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

The development of high throughput sequencing technologies i.e. Next Generation Sequencing (NGS) is revolutionizing the exploration of cancer. Though sequence datasets are highly complex, mutation can occur randomly in DNA or RNA sequences that can make cells sicker or less fit. The unusual growth and behavior of genes in cells cause cancer. Cancer-driver gene cells grow when mutation occurs. Identification of cancer driver genes is a critical and challenging issue for researchers. In the proposed work, initially, robust features are extracted from the sequence dataset through Position Relative Incidence Matrix (PRIM) integrated with Accumulative Absolute Position Incidence Vector (AAPIV) generation. PRIM and AAPIV convert the single-dimensional sequence data into 2-dimensional numeric data. Support Vector Machine (SVM), Neural Network (NN), and Random Forest (RF) are used to train the model. The proposed model is validated with different validation methods i.e., independent testing, k-fold crossvalidation, self-consistency, and jackknife testing. The proposed model predicts whether the given primary structure corresponds to cancer driver genes or not. Results analyses show 95%, 92%, and 69% accuracy on RF, Artificial Neural Networks (ANN), and SVM respectively. The comparative analysis with existing state-of-the-art models i.e., 20/20+ and Multimodal Deep Neural Network by integrating Multi-dimensional Data (NDNNMD) shows that the proposed model outperforms the existing techniques.

Instructions: 

Supplementary file contains the features vectors generated after preprocessing.

The details are discussed in the article.

 

Funding Agency: 
Princess Nourah bint Abdulrahman University , Riyadh, Saudi
Grant Number: 
PNURSP2023R125