Bacteria-Species-Classification-ML-Framework

Citation Author(s):
Li Wen
Yow
Submitted by:
Li Wen Yow
Last updated:
Tue, 03/11/2025 - 02:30
DOI:
10.21227/qyx9-gr39
Data Format:
License:
66 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

This repository contains the code and documentation for a computational framework that leverages machine learning techniques to enable accurate classification of bacterial species, even closely related strains.

The framework integrates genomic analysis methods, such as motif screening and single nucleotide polymorphism (SNP) extraction, to derive informative features from bacterial genomes. These genomic insights are then fed into machine learning models, which are trained to reliably differentiate between bacterial species based on their distinctive patterns and characteristics.

Instructions: 

 

Jupyter Notebooks (.ipynb)

  1. Feature Preprocessing
  2. Feature Engineering - Feature Transformation
  3. Machine Learning - Embedded Method

 

Feature Preprocessing  

The `Feature Preprocessing.ipynb` notebook focuses on the following tasks:  

    • Appending all motifs features into an Excel file  
    • Aligning the motifs (gap filling)  
    • Removing features with high missing values  
    • Eliminating non-informative features 

 

 Feature Engineering - Feature Transformation  

The `Feature Engineering - Feature transformation.ipynb` notebook covers the following steps:  

    • Tokenizing the sequences and encoding the label classes  
    • Extracting the single nucleotide polymorphism (SNP) information for all feature columns  
    • Creating a new dataframe that contains the strain label, class label, and SNPs information for machine learning training  

 

 Machine Learning - Embedded Method  

The `Machine Learning_Embedded Method - Top 20.ipynb` notebook includes:

    • Feature scaling
    • Feature selection using an embedded method  
    • Training the selected top 20 features set with Random Forest and SVM (both tuned and untuned)  
    • Model evaluation and prediction  
    • K-fold validation, learning curve analysis, t-SNE visualization of selected featurees, feature importance analysis

 

Data

The final_snp.xlsx file is the input data file that was generated during the feature engineering step. It contains the SNP motifs information that is used for the machine learning model training.

 

Usage

You can open and run the notebooks using Anaconda Jupyter Notebook. Follow these steps:

Open the notebooks in the following order:

  1. Feature Preprocessing.ipynb
  2. Feature Engineering - Feature transformation.ipynb
  3. Machine Learning_Embedded Method - Top 20.ipynb

The final_snp.xlsx file is used as the input for the machine learning notebook.

Alternatively, you can use Google Colab to run the notebooks directly in the cloud. Simply upload the notebook files to your Google Drive and open them in Google Colab.