Name: DNA sequence alignment datasets based on NW algorithm
Creator: Amr Rashed
License: https://creativecommons.org/licenses/by/4.0/
Keywords: Machine Learning, Biomedical and Health Sciences

Abstract

This study presented six datasets for DNA/RNA sequence alignment for one of the most common alignment algorithms, namely, the Needleman–Wunsch (NW) algorithm. This research proposed a fast and parallel implementation of the NW algorithm by using machine learning techniques. This study is an extension and improved version of our previous work . The current implementation achieves 99.7% accuracy using a multilayer perceptron with ADAM optimizer and up to 2912 giga cell updates per second on two real DNA sequences with a of length 4.1 M nucleotides. Our implementation is valid for extremely long sequences by using the divide-and-conquer strategy.

Instructions:

these datasets are illustrated in a manuscript submitted to IEEE OPEN ACCESS entitled “Parallel Implementation of the Needleman–Wunsch Algorithm Using Machine Learning Algorithms”.

Comments

Good

Submitted by Muhammad Khan on Fri, 05/21/2021 - 06:44

Thank you

Submitted by Amr Rashed on Tue, 09/07/2021 - 05:37

Dataset Files

dataset1 is titled csvlist.txt and so on. Dataset 3T is called csv3testdata.csv and Dataset 6T is called csv6testdata.csv ALLdataset.zip (7.73 MB)
some results and orange reports RESULTS.zip (1.23 MB)
original dataset matrix5.mat (75.42 kB)
create indexed output based on matrix5.mat to be suited for ML use indexxz.m (770 bytes)
test our best model (MLP network) MLPTEST.py (4.65 kB)
another test to our best model (MLP network) phdMLPTEST2 (2).py (4.39 kB)
check all ML models (train/test) phdnew.py (7.37 kB)
check all ML models (cross validation) phdnewcv.py (7.46 kB)
check all ML models (cross validation) update phdnewcv2.py (7.97 kB)

Datasets

Standard Dataset

DNA sequence alignment datasets based on NW algorithm

Abstract

Comments

Dataset Files

QUESTIONS?