Skip to main content

Datasets

Standard Dataset

Synthetic EEG Dataset for CNN-LSTM Training: Clean and Artifact-Contaminated Signals

Citation Author(s):
Marcin Jurczak (Warsaw University of Technology)
Marcin Kołodziej (Warsaw University of Technology)
Andrzej Majkowski (Warsaw University of Technology)
Submitted by:
Marcin Kolodziej
Last updated:
DOI:
10.21227/c4k4-zd80
Data Format:
54 views
Categories:
Keywords:
No Ratings Yet

Abstract

Synthetic EEG Dataset for CNN-LSTM Training: Clean and Artifact-Contaminated Signals

This dataset consists of synthetically generated EEG and EMG signals designed for training Convolutional Neural Networks (CNNs) in artifact detection and removal. The dataset includes both clean EEG signals and EEG signals contaminated with simulated EMG artifacts from various sources.

This dataset is useful for training and evaluating machine learning models aimed at artifact correction, signal denoising, and EEG preprocessing.

Instructions:

Synthetic EEG Dataset for CNN Training: Clean and Artifact-Contaminated Signals

The dataset consists of data used for training and testing methods for removing muscle artifacts from EEG signals. Details on how the database was created can be found in the following publication:

Kołodziej, M.; Jurczak, M.; Majkowski, A.; Rysz, A.; Świderski, B. A Hybrid CNN-LSTM Approach for Muscle Artifact Removal from EEG Using Additional EMG Signal Recording. Appl. Sci. 2025, 15, 4953. https://doi.org/10.3390/app15094953

Description

This dataset consists of synthetically generated EEG and EMG signals designed for training Convolutional Neural Networks (CNNs) in artifact detection and removal. The dataset includes both clean EEG signals and EEG signals contaminated with simulated EMG artifacts from various sources.T

The signals are structured as 80,000 examples, each representing 1 second of data sampled at 256 Hz. The dataset is stored in two files:

  • X.mat – Contains EEG signals with artifacts and corresponding EMG artifact sources.
  • y.mat – Contains the clean EEG signals (artifact-free).

Data Structure

X (dimensions: 80000 × 256 × 6)

  • Dimension 1 (80000) – Number of signal examples (training samples).
  • Dimension 2 (256) – Number of samples per signal, corresponding to 1 second of recording at 256 Hz.
  • Dimension 3 (6) – Number of signal channels:
    • Channel 1: EEG signal contaminated with artifacts.
    • Channel 2: Simulated EMG artifact from the Fp1 electrode.
    • Channel 3: Simulated EMG artifact from the HEOG electrode.
    • Channel 4: Simulated EMG artifact from the Nape electrode.
    • Channel 5: Simulated EMG artifact from the Cheek electrode.
    • Channel 6: Simulated EMG artifact from the Jaw electrode.

y (dimensions: 80000 × 256)

  • Dimension 1 (80000) – Number of signal examples, same as in X.
  • Dimension 2 (256) – Number of samples per signal.
  • y contains the corresponding clean EEG signal (artifact-free) for each of the 80,000 examples.

This dataset is useful for training and evaluating machine learning models aimed at artifact correction, signal denoising, and EEG preprocessing.

Please cite: 

Kołodziej, M.; Jurczak, M.; Majkowski, A.; Rysz, A.; Świderski, B. A Hybrid CNN-LSTM Approach for Muscle Artifact Removal from EEG Using Additional EMG Signal Recording. Appl. Sci. 2025, 15, 4953. https://doi.org/10.3390/app15094953

Funding Agency
Warsaw University of Technology
Grant Number
Research was funded by Warsaw University of Technology within the Excellence Initia-926 tive: Research University (IDUB) program