IIST BCI Dataset-2 for Selected Common Marathi Words

Citation Author(s):: Shubham Tayade (Indian Institute of Space Science and Technology Thiruvananthapuram)

Parvathy S S (A J College of Science and Technology, Thonnakkal)

Nancy Sunil (A J College of Science and Technology, Thonnakkal)

Charu Chauhan (Indian Institute of Space Science and Technology Thiruvananthapuram)

Sumitra S (Indian Institute of Space Science and Technology Thiruvananthapuram)

Manoj B S (Indian Institute of Space Science and Technology Thiruvananthapuram)
Submitted by:: Shubham Tayade
Last updated:: Mon, 03/18/2024 - 13:26
DOI:: 10.36227/techrxiv.171043118.80448751/v1
Data Format:: *.avi; *.csv; *.txt; *.zip

531 views

Categories:

Keywords:

Brain Signals; brain-computer interfaces; EEG classification

OpenBCI

ACCESS DATASET CITE

Abstract

Problems of neurodegenerative disorder patients can be solved by developing Brain-Computer Interface (BCI) based solutions. This requires datasets relevant to the languages spoken by patients. For example, Marathi, a prominent language spoken by over 83 million people in India, lacks BCI datasets based on the language for research purposes. To tackle this gap, we have created a dataset comprising Electroencephalograph (EEG) signal samples of selected common Marathi words. EEG samples were captured using the Open-BCI Cyton device for constructing a dataset by volunteers who speak commonly used Marathi words. The dataset contains EEG recordings involving volunteers pronouncing commonly used Marathi words. This dataset helps in building BCI solutions using Machine Learning (ML) classifiers and Deep Learning methods, which can be used to translate EEG signals into Marathi words.

Instructions:

The dataset includes files produced by the OpenBCI Cyton Biosensing board.

A. Raw Dataset
RAW dataset is in format of text documents. EEG sample is stored as a file with text values separated by commas and arranged in rows and columns.
Column 1 - sample index is represented
Columns 2 to 9 - EEG recordings from the eight selected channels
Columns 10 to 22 and 24 contain unimportant data
Column 23 - representing time in a raw, unprocessed format.
Column 25 - displays the timestamp in YearMonth-Day Hour:Minute: Second format

B. Processed Dataset
This dataset is in format of .csv files. EEG Channel 0 to EEG Channel 7 columns are considered.
The header lines and unnecessary columns are removed using Python script (provided).

Note :
In the folder, 1st .txt file corresponds to 1st Marathi word, 2nd .txt file corresponds to 2nd Marathi word and so on. The list of Marathi words is provided.