Synthetic EEG Dataset for CNN-LSTM Training: Clean and Artifact-Contaminated Signals

- Citation Author(s):
- Submitted by:
- Marcin Kolodziej
- Last updated:
- DOI:
- 10.21227/c4k4-zd80
- Data Format:
Abstract
Synthetic EEG Dataset for CNN-LSTM Training: Clean and Artifact-Contaminated Signals
This dataset consists of synthetically generated EEG and EMG signals designed for training Convolutional Neural Networks (CNNs) in artifact detection and removal. The dataset includes both clean EEG signals and EEG signals contaminated with simulated EMG artifacts from various sources.
This dataset is useful for training and evaluating machine learning models aimed at artifact correction, signal denoising, and EEG preprocessing.
Instructions:
Synthetic EEG Dataset for CNN Training: Clean and Artifact-Contaminated Signals
The dataset consists of data used for training and testing methods for removing muscle artifacts from EEG signals. Details on how the database was created can be found in the following publication:
Kołodziej, M.; Jurczak, M.; Majkowski, A.; Rysz, A.; Świderski, B. A Hybrid CNN-LSTM Approach for Muscle Artifact Removal from EEG Using Additional EMG Signal Recording. Appl. Sci. 2025, 15, 4953. https://doi.org/10.3390/app15094953
Description
This dataset consists of synthetically generated EEG and EMG signals designed for training Convolutional Neural Networks (CNNs) in artifact detection and removal. The dataset includes both clean EEG signals and EEG signals contaminated with simulated EMG artifacts from various sources.T
The signals are structured as 80,000 examples, each representing 1 second of data sampled at 256 Hz. The dataset is stored in two files:
- X.mat – Contains EEG signals with artifacts and corresponding EMG artifact sources.
- y.mat – Contains the clean EEG signals (artifact-free).
Data Structure
X (dimensions: 80000 × 256 × 6)
- Dimension 1 (80000) – Number of signal examples (training samples).
- Dimension 2 (256) – Number of samples per signal, corresponding to 1 second of recording at 256 Hz.
- Dimension 3 (6) – Number of signal channels:
- Channel 1: EEG signal contaminated with artifacts.
- Channel 2: Simulated EMG artifact from the Fp1 electrode.
- Channel 3: Simulated EMG artifact from the HEOG electrode.
- Channel 4: Simulated EMG artifact from the Nape electrode.
- Channel 5: Simulated EMG artifact from the Cheek electrode.
- Channel 6: Simulated EMG artifact from the Jaw electrode.
y (dimensions: 80000 × 256)
- Dimension 1 (80000) – Number of signal examples, same as in X.
- Dimension 2 (256) – Number of samples per signal.
- y contains the corresponding clean EEG signal (artifact-free) for each of the 80,000 examples.
This dataset is useful for training and evaluating machine learning models aimed at artifact correction, signal denoising, and EEG preprocessing.
Please cite:
Kołodziej, M.; Jurczak, M.; Majkowski, A.; Rysz, A.; Świderski, B. A Hybrid CNN-LSTM Approach for Muscle Artifact Removal from EEG Using Additional EMG Signal Recording. Appl. Sci. 2025, 15, 4953. https://doi.org/10.3390/app15094953