Datasets
Standard Dataset
DISC: a dataset for integrated sensing and communication in mmWave systems
- Citation Author(s):
- Submitted by:
- Jesus Lacruz
- Last updated:
- Mon, 01/27/2025 - 03:42
- DOI:
- 10.21227/2gm7-9z72
- Research Article Link:
- License:
- Categories:
- Keywords:
Abstract
This dataset provides Channel Impulse Response (CIR) measurements from standard-compliant IEEE 802.11ay packets to validate Integrated Sensing and Communication (ISAC) methods. The CIR sequences contain reflections of the transmitted packets on people moving in an indoor environment. They are collected with a 60 GHz software-defined radio experimentation platform based on the IEEE 802.11ay Wi-Fi standard, which is not affected by frequency offsets by operating in full-duplex mode.
The dataset is divided into three parts:
-
DISC-A (disc_a.tar.gz): Consists of almost 40 minutes of IEEE 802.11ay CIR sequences including signal reflections on 7 subjects performing 4 different activities. This part is characterized by uniform packet transmission times, with a granularity of over 3 CIR estimates per millisecond, yielding extremely high temporal resolution.
-
DISC-B (disc_b.tar.gz): Contains over 40 minutes of IEEE 802.11ay CIR sequences collected with uniform IFS in which we append 12 TRN fields to each packet, steering the BP in each of them to scan the whole field of view of the antenna. This enables the estimation of the AoA of the reflection, and hence to localize and track subjects in the environment. This part contains data from 1 to 7 multiple subjects performing 5 different activities.
-
DISC-C (disc_a.tar.gz): For this part, we use open-source data on Wi-Fi traffic patterns to tune the inter-packet duration and collect more realistic sparse CIR sequences. The resulting CIR measurements, for a total of 9 minutes, are collected with a single subject performing the same 4 activities included in the first part. In the second part, we also use the directional transmission capabilities of our testbed to allow the estimation of the Angle of Arrival (AoA) of the reflections.
We envision our dataset being used by researchers to train and validate machine and deep learning algorithms for fine-grained sensing. Possible use cases include, but are not limited to, the extraction of the micro-Doppler signatures of human movement from the CIR, which enables deep learning-based human activity recognition, and person identification from individual gait features. In addition, new ISAC problems such as the sparse reconstruction of sensing parameters from irregularly sampled signal traces, domain adaptation from regularly sampled signals to sparse ones, and target tracking under missing measurements can also be tackled using the provided dataset.
The full code process the data and train the models are available at https://git2.networks.imdea.org/wng/disc_dataset
Data collection workflow:
We design the Millimeter-Wave (mmWave) testbed to be operated using two Python scripts. configure_SYSTEM.py allows configuring the mmWave front-end (Sivers) in full-duplex mode, the BPs to be used, as well as initializing and configuring the baseband processor. After this one-time system configuration, the provided scripts capture_<n_bp>TRN.py (for n_bp = 1, 12) perform the following operations:
-
Enable full-duplex mode.
-
Load the I/Q samples file into the Field Programmable Gate Array (FPGA), depending on the number of TRN fields in the packet.
-
Load the Inter-Frame Spacing (IFS) file in the BRAM memory on the FPGA, or enable fixed IFS.
-
Trigger data capture: transmit and save a configurable number of packets.
-
Disable Full-duplex mode.
To process the binary files from the testbed, we provide a Matlab function (Top_Decoder.m), which reads the binary file, separates it into the different packets of I/Q samples, and extracts the corresponding Channel Impulse Response (CIR) for each TRN unit included in the packets. These CIR sequences are the raw data provided in the dataset
Data format:
Raw CIR measurements are provided as multi-dimensional arrays in MAT-file format (.mat). Specifically, each CIR sequence is stored in a separate .mat file, containing two fields named CIR and TIME (optional). CIR is a 3-dimensional array with shape (n_range_bins, n_packets, n_bp), which represent the number of range bins according to the range resolution of the system (512 in our case), the number of packets transmitted in the sequence, and the number of BPs used in each packet, respectively. TIME is an optional field. If used, it contains the time instant, in seconds, in which the corresponding packet was transmitted, relative to the beginning of the measurement. This is omitted in the uniform sequences, as they have fixed IFS, but it is present in the sparse measurements to provide additional information for sparse reconstruction algorithms. We also provide processed micro-Doppler signatures extracted from the CIR through spectral analysis. These are provided in Python's numpy array format (.npy), with shape (n_frames, n_freq) representing the number of time frames and frequency samples, respectively. The micro-Doppler spectrograms are obtained with a windowed short-time Fourier transform, with a time window of W = 64 samples, using a Hanning window. Subsequent windows overlap by 32 samples. The obtained spectra are then transformed to the Decibel (dB) domain and normalized in the range [0, 1] along the frequency axis.
Dataset composition
To reduce biases in the provided data, the subjects from the experiments are both male and female, with different heights and weights. The provided dataset is structured as follows:
DISC-A:
Uniform IFS, forward-looking, multiple subjects.
-
/raw_data: The CIR sequences are divided in subfolders by each subject named PERSON<id> where <id> \in {1, . . . , 7}. Then, for each subject, sequences are named after the activity performed in that measurement as <activity>_<id>_<index>.mat, where <activity> is one between WALKING (A0), RUNNING (A1), SITTING (A2) or HANDS (A3), and <index> is an incremental integer identifying a specific measurement among those of the same subject and activity.
-
/micro_doppler_stft: Processed micro-Doppler spectrograms whose file name matches the raw CIR sequence from which they are extracted.
DISC-B:
Uniform IFS, multiple BPs, multiple subjects moving freely and concurrently.
-
/raw_data: The CIR sequences are collected across 7 different days. A first set of CIR sequences contains a single subject moving freely in the room and performing one of 5 possible activities: WALKING (A0), RUNNING (A1), HANDS (A2), SITTING/STANDING (A3) and STANDING STILL (A4). A second set of sequences contains multiple subjects concurrently present in the room (up to a maximum of 5 subjects), in different locations, and performing different activities. This subdirectory contains the measurements which are named according to the following convention TEST<test_id>_US_<rep>_<dev>.mat, where <test_id> is the identifier of the test, <rep> is an index to identify different repetitions of the same experiment and <dev> is the device used for the experiment.
-
From <test_id> > 5, the samples contain information for multiple targets at the same time.
-
/micro_doppler_stft: Processed micro-Doppler spectrograms whose file name matches the raw CIR sequence from which they are extracted.
-
From <test_id> > 5, the samples contain information for multiple targets at the same time. The micro-Doppler for these samples will be TEST<test_id>_US_<rep>_<dev>_<activity>_<subject>.npy, where <activity> is the activity performed by the subject <subject> in the sample.
-
experiments_discb.xlsx: Contains the associations between the test identifiers, the activities performed, the file paths and the experiment’s day and room for each test.
-
We are providing information from two different devices:
-
F1, placed at (0, 0) m
-
F2, placed at (-1.8, 0) m
-
The groundtruth labels for positions are:
-
P1 → (0.6, 4.2)
-
P2 → (-1.2, 4.8)
-
P3 → (-2.4, 3)
-
P4 → (-1.2, 3)
-
P5 → (1.2, 2.4)
-
P6 → (-1.8, 1.8)
-
P7 → (1.2, 4.8)
-
P8 → (-0.6, 1.8)
DISC-C:
Non-uniform IFS, multiple BPs, a single subject moving freely.
-
/info: This subdirectory contains information to describe the different sparse traffic patterns used for the experiments.
-
/sampling_patterns: It contains .txt files containing the packet transmission instants, in seconds. Each file is named as <M>_<pattern>.txt where <M> is the maximum number of packets per window transmitted in that sequence and <pattern> is one of psu_cs, library, and ug. A complete description of the design of such sampling patterns can be found in reference [1] below.
-
experiments_discc.xlsx: Contains the associations between the test identifiers, the file names of the inter-packet patterns, and the activities performed in each test.
-
/raw_data: We provide 8.7 s long traces for each of 3 sequence types (psu_cs, library, and ug). Experiments are repeated for different values of the maximum number of packets per window, setting it to M = 4, 8, 16, 32, 64. This subdirectory contains the measurements which are named according to the following convention TEST<id>_<r>_<code>_F2.mat, where <id> is an incremental identifier for each measurement, <r> is an incremental integer identifying repetitions of the same sequence. <code> contains a code in the form M<i>, with <i> being an integer mapping the measurement to the corresponding inter-packet times, specified by the value of M for the injections and the traffic pattern used.
-
/micro_doppler_itft: These are obtained from sparse CIR sequences using the method proposed in [1], which is based on the Iterative Hard Thresholding algorithm, and are named according to the convention test_<id>.npy.
Dataset Files
- uniform_7subj.zip (59.38 GB)
- sparse_1subj.zip (119.53 GB)
- disc_a.tar.gz (59.50 GB)
- disc_c.tar.gz (32.36 GB)
- disc_b.tar.gz (549.96 GB)
- scripts_DISC.zip (399.78 MB)
Comments
Many thanks for your efforts in providing this helpful dataset.
I have employed this dataset in my work to train the DQN utilized in several on-policy DRL algorithms such as A2C combined with greedy policy and I got great results. Thank you for sharing such useful information.
This dataset is also suitable to be developed through CGANs.