Datasets
Standard Dataset
DNA sequence alignment datasets based on NW algorithm
- Citation Author(s):
- Submitted by:
- Amr Rashed
- Last updated:
- Tue, 05/17/2022 - 22:18
- DOI:
- 10.21227/45dr-8p86
- Data Format:
- Research Article Link:
- Links:
- License:
Abstract
This study presented six datasets for DNA/RNA sequence alignment for one of the most common alignment algorithms, namely, the Needleman–Wunsch (NW) algorithm. This research proposed a fast and parallel implementation of the NW algorithm by using machine learning techniques. This study is an extension and improved version of our previous work . The current implementation achieves 99.7% accuracy using a multilayer perceptron with ADAM optimizer and up to 2912 giga cell updates per second on two real DNA sequences with a of length 4.1 M nucleotides. Our implementation is valid for extremely long sequences by using the divide-and-conquer strategy.
these datasets are illustrated in a manuscript submitted to IEEE OPEN ACCESS entitled “Parallel Implementation of the Needleman–Wunsch Algorithm Using Machine Learning Algorithms”.
Dataset Files
- dataset1 is titled csvlist.txt and so on. Dataset 3T is called csv3testdata.csv and Dataset 6T is called csv6testdata.csv ALLdataset.zip (7.73 MB)
- some results and orange reports RESULTS.zip (1.23 MB)
- original dataset matrix5.mat (75.42 kB)
- create indexed output based on matrix5.mat to be suited for ML use indexxz.m (770 bytes)
- test our best model (MLP network) MLPTEST.py (4.65 kB)
- another test to our best model (MLP network) phdMLPTEST2 (2).py (4.39 kB)
- check all ML models (train/test) phdnew.py (7.37 kB)
- check all ML models (cross validation) phdnewcv.py (7.46 kB)
- check all ML models (cross validation) update phdnewcv2.py (7.97 kB)
Comments
Good
Thank you