IIST BCI Dataset-8 for Selected Common Telugu Words of Male and Female Speakers

Citation Author(s):: Likhith Boddapu (Indian Institute of Space Science and Technology)

Chittaloori Likhitha (Chhattisgarh Swami Vivekanand Technical University)

Parvathy S S (A J College of Science and Technology)

Nancy Sunil (A J College of Science and Technology)

S Sumitra (Indian Institute of Space Science and Technology)

B.S. Manoj (Indian Institute of Space Science and Technology)
Submitted by:: likhith boddapu
Last updated:: Sat, 05/03/2025 - 11:08
DOI:: 10.21227/1xfr-y802
Data Format:: *.avi; *.csv; *.txt; *.zip

456 views

Categories:

Keywords:

Brain Signals

brain-computer interfaces

EEG classification

OpenBCI

ACCESS DATASET CITE

Abstract

Brain-Computer Interface (BCI) technology facilitates a direct connection between the brain and external devices by interpreting neural signals. It is critical to have datasets that contain patient's native languages while developing BCI-based solutions for neurological disorders. However, present BCI research lacks appropriate language-specific datasets, particularly for languages such as Telugu, which is spoken by more than 90 million people in India. We created an Electroencephalograph (EEG)-based BCI dataset containing EEG signal samples corresponding to widely spoken Telugu words for both female and male speakers. The dataset was developed using the OpenBCI Cyton device, which recorded EEG data from two Telugu-speaking participants. The dataset is broken into four parts.
1. Vocalized Telugu words.
2. English translations of Telugu words.
3. Subvocalization of Telugu words.
4. Subvocalization of English words.
The dataset includes 100 different words, each recorded for ten trials for a male and female speakers. Using this dataset, a BCI system capable of translating EEG signals into both vocal and subvocal forms for Telugu and English languages can be created by training this dataset using Machine Learning (ML) and Deep Learning (DL) approaches.

Instructions:

The dataset comprises EEG samples collected from both male and female volunteers. These samples are stored in text documents and are saved as comma-separated values (CSV).
Each row in the dataset represents a separate EEG sample, with the following structure:
Column 1: Sample Index - This column contains a unique identifier for each sample.
Columns 2-9: EEG Records - These columns include data from eight distinct EEG channels, capturing electrical activity from different parts of the brain.
Columns 10-22 and 24: Additional Data - These columns may contain supplementary information, with varying degrees of importance depending on the specific use case.Column 23: Unprocessed Time Data - This column generally contains time-related data in an unprocessed form, which may need to be formatted or adjusted for analysis.
Column 25: Timestamps - This column provides accurate temporal information for each sample, formatted as "Year-Month-Day Hour:Minute."
These timestamps are crucial for synchronizing the EEG data with external events or other measurements.
In addition to the EEG data, the text documents may include metadata or other supplemental information that can provide context or additional insights into the dataset. This metadata could include demographic information about the volunteers, such as gender, age, or other relevant details, which might be useful for further analysis or interpretation.