Name: Bacteria-Species-Classification-ML-Framework
Creator: Li Wen Yow
License: https://creativecommons.org/licenses/by/4.0/
Keywords: Machine Learning

Abstract

This repository contains the code and documentation for a computational framework that leverages machine learning techniques to enable accurate classification of bacterial species, even closely related strains.

The framework integrates genomic analysis methods, such as motif screening and single nucleotide polymorphism (SNP) extraction, to derive informative features from bacterial genomes. These genomic insights are then fed into machine learning models, which are trained to reliably differentiate between bacterial species based on their distinctive patterns and characteristics.

Instructions:

Jupyter Notebooks (.ipynb)

Feature Preprocessing
Feature Engineering - Feature Transformation
Machine Learning - Embedded Method

Feature Preprocessing

The `Feature Preprocessing.ipynb` notebook focuses on the following tasks:

Appending all motifs features into an Excel file
Aligning the motifs (gap filling)
Removing features with high missing values
Eliminating non-informative features

Feature Engineering - Feature Transformation

The `Feature Engineering - Feature transformation.ipynb` notebook covers the following steps:

Tokenizing the sequences and encoding the label classes
Extracting the single nucleotide polymorphism (SNP) information for all feature columns
Creating a new dataframe that contains the strain label, class label, and SNPs information for machine learning training

Machine Learning - Embedded Method

The `Machine Learning_Embedded Method - Top 20.ipynb` notebook includes:

Feature scaling
Feature selection using an embedded method
Training the selected top 20 features set with Random Forest and SVM (both tuned and untuned)
Model evaluation and prediction
K-fold validation, learning curve analysis, t-SNE visualization of selected featurees, feature importance analysis

Data

The final_snp.xlsx file is the input data file that was generated during the feature engineering step. It contains the SNP motifs information that is used for the machine learning model training.

Usage

You can open and run the notebooks using Anaconda Jupyter Notebook. Follow these steps:

Open the notebooks in the following order:

Feature Preprocessing.ipynb
Feature Engineering - Feature transformation.ipynb
Machine Learning_Embedded Method - Top 20.ipynb

The final_snp.xlsx file is used as the input for the machine learning notebook.

Alternatively, you can use Google Colab to run the notebooks directly in the cloud. Simply upload the notebook files to your Google Drive and open them in Google Colab.

Dataset Files

Species Classification Code.zip (9.88 MB)
final_snp.xlsx (7.31 MB)

Datasets

Standard Dataset

Bacteria-Species-Classification-ML-Framework

Abstract

Jupyter Notebooks (.ipynb)

Data

Usage

Dataset Files

QUESTIONS?