Emotional Crowd Sound

Citation Author(s):
Department of Computer, Control, and Management Engineering, Sapienza University of Rome, Italy
Department of Mathematics and Computer Science, University of Florence, Italy
Department of Mathematics and Computer Science, University of Perugia, Italy
Submitted by:
Valentina Franzoni
Last updated:
Thu, 02/25/2021 - 08:03
Data Format:
0 ratings - Please login to submit your rating.


Crowds express emotions as a collective individual, which is evident from the sounds that a crowd produces in particular events, e.g., collective booing, laughing or cheering in sports matches, movies, theaters, concerts, political demonstrations, and riots. Crowd sounds can be characterized by frequency-amplitude features, using analysis techniques similar to those applied on individual voices, where deep learning classification is applied to spectrogram images derived by sound transformations.
We present the first dataset of data to apply a technique based on the generation of sound spectrograms from fragments of fixed length, extracted from original audio clips recorded in high-attendance events, where the crowd acts as a collective individual: Transfer learning techniques can be used on a neural network, novel or pre-trained on low-level features using extensive datasets of visual knowledge.
The original sound clips are filtered and normalized in amplitude for a correct spectrogram generation, on which to fine-tune the domain-specific features.
This dataset includes the complete data of the study, to reproduce each step.

Files in the dataset:

step0 original files: 
Approval 39
Disapproval 14
Neutral 15

step1 normalization:
Approval 39
Disapproval 14
Neutral 15
We normalized the loudness of the dataset to −23 Loudness Units, following the EBU R128 standard.
We filtered the sound in 20–20,000 Hz range.

step2 sound blocks:
Approval 1787
Disapproval 388
Neutral 7340
We divided the sound files in blocks with the following characteristics:
1s blocks length
0.25s shifting window
0.75s overlap
We removed 37 silence blocks

step3 spectrogram images:
The blocks of the three emotional classes have been transformed to spectrogram images in four frequency scales:
bark (0-3.5 kHz)
erb (2-4 kHz)
log (0.02-2 kHz)
mel (4-6 kHz)
Per each scale:
Approval 1787
Disapproval 388
Neutral 7340
Spectrograms have been generated using the spgrambw draw spectrogram function.
We used the Jet colormap of 64 colors, generating png images using a 400 samples hamming-window, frame increment of 4.5 millisecond.

step4 train and test spectrograms:
Approval 1429
Disapproval 310
Neutral 5872
Approval 358
Disapproval 78
Neutral 1468

Extract locally the zip files, read the readme file.

Instructions for dataset usage are included in the open access paper:
 Franzoni, V., Biondi, G., Milani, A., Emotional sounds of crowds: spectrogram-based analysis using deep learning (2020) Multimedia Tools and Applications, 79 (47-48), pp. 36063-36075. https://doi.org/10.1007/s11042-020-09428-x

File are released under Creative Commons Attribution-ShareAlike 4.0 International License

Dataset Files

Open Access dataset files are accessible to all logged in  users. Don't have a login?  Create a free IEEE account.  IEEE Membership is not required.


File README.txt1.5 KB