HENLO: Human voice Natural Language from On-demand media

Citation Author(s):: Agustinus Bimo Gumelar (Department of Electrical Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia)

Derry Pramono Adi (Department of Informatics, Faculty of Engineering, Widya Mandala Catholic University Surabaya, Surabaya, Indonesia)

Andreas Agung Kristanto (Department of Psychology, Faculty of Social and Political Sciences, Mulawarman University, Samarinda, East Kalimantan, Indonesia)
Submitted by:: AGUSTINUS GUMELAR
Last updated:: Wed, 11/20/2024 - 10:23
DOI:: 10.21227/m0w3-nz08
Data Format:: *.wav; *.mp3
Research Article Link:: Speech Emotion Detection in On-Demand Media using SVM and LSTM (original in Ind…

226 views

Categories:

Keywords:

Affective Computing; Emotion Recognition; Deep Learning; Emotional Speech; On-demand media

ACCESS DATASET CITE

Abstract

The Human voice Natural Language from On-demand media (HENLO) dataset is a high-quality emotional speech dataset created to address the need for representative and realistic data in speech emotion recognition research. Unlike many existing datasets, which rely on simulated emotions performed by untrained speakers or directed participants, HENLO sources its data from professionally produced films and podcasts available on Media On-Demand (MOD). These audio samples feature trained actors employing the Stanislavski method, ensuring authentic emotional expressions that closely resemble real-life scenarios.

The dataset prioritizes realism and quality, leveraging audio from films and podcasts produced by top-tier entertainment companies. Each clip undergoes rigorous mastering and scoring processes to ensure minimal environmental noise, making the dataset ideal for machine learning models requiring clean acoustic signals. This high-quality data enables researchers to extract and analyze features such as pitch, intonation, and rhythm with greater accuracy. Additionally, MOD offers unlimited access to a diverse collection of media, further enriching the dataset with varied emotional contexts.

Instructions:

Contents

The dataset consists of 1,176 audio clips, categorized into four core emotional classes based on the theories of Robert Plutchik and Paul Ekman:

Angry: 337 clips
Sad: 293 clips
Happy: 279 clips
Fear: 273 clips

All audio is in English and is available in both MP3 and WAV formats to accommodate diverse research and application needs.

File Details

Total Dataset Size: + 272 MB

Angry: 78 MB
Sad: 67 MB
Happy: 61 MB
Fear: 64 MB

Clip Duration: 5–20 seconds

File Formats: MP3 and WAV

Dataset Usage

Research Applications

This dataset is well-suited for:

Speech Emotion Recognition: Training and testing models to identify emotions from speech data.
Deep Learning Applications: Leveraging high-quality audio for advanced machine learning architectures such as CNNs and RNNs.
Human-Computer Interaction: Enhancing systems like virtual assistants and emotion-aware customer service bots with more responsive and realistic interactions.
Ethical and Clean Data Analysis: Utilizing audio free from environmental noise and ethical concerns, as the recordings come from publicly available on-demand media.

With its clean, high-quality audio and professional emotional expressions, HENLO stands out as an ideal resource for both academic research and practical application in modern speech emotion recognition.

Loading and Accessing the Data

Speech Data: Audio files can be loaded using standard audio processing libraries in Audacity or in Python, such as librosa or pydub.

Preprocessing Notes