HENLO: Human voice Natural Language from On-demand media

Citation Author(s):
Agustinus Bimo
Gumelar
Department of Electrical Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Derry Pramono
Adi
Department of Informatics, Faculty of Engineering, Widya Mandala Catholic University Surabaya, Surabaya, Indonesia
Andreas Agung
Kristanto
Department of Psychology, Faculty of Social and Political Sciences, Mulawarman University, Samarinda, East Kalimantan, Indonesia
Submitted by:
AGUSTINUS GUMELAR
Last updated:
Wed, 11/20/2024 - 05:23
DOI:
10.21227/m0w3-nz08
Data Format:
Research Article Link:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

The Human voice Natural Language from On-demand media (HENLO) dataset is a high-quality emotional speech dataset created to address the need for representative and realistic data in speech emotion recognition research. Unlike many existing datasets, which rely on simulated emotions performed by untrained speakers or directed participants, HENLO sources its data from professionally produced films and podcasts available on Media On-Demand (MOD). These audio samples feature trained actors employing the Stanislavski method, ensuring authentic emotional expressions that closely resemble real-life scenarios.

The dataset prioritizes realism and quality, leveraging audio from films and podcasts produced by top-tier entertainment companies. Each clip undergoes rigorous mastering and scoring processes to ensure minimal environmental noise, making the dataset ideal for machine learning models requiring clean acoustic signals. This high-quality data enables researchers to extract and analyze features such as pitch, intonation, and rhythm with greater accuracy. Additionally, MOD offers unlimited access to a diverse collection of media, further enriching the dataset with varied emotional contexts.

Instructions: 

Contents

The dataset consists of 1,176 audio clips, categorized into four core emotional classes based on the theories of Robert Plutchik and Paul Ekman:

  • Angry: 337 clips
  • Sad: 293 clips
  • Happy: 279 clips
  • Fear: 273 clips

 All audio is in English and is available in both MP3 and WAV formats to accommodate diverse research and application needs.

 

 File Details

Total Dataset Size: + 272 MB

  • Angry: 78 MB
  • Sad: 67 MB
  • Happy: 61 MB
  • Fear: 64 MB

 Clip Duration: 5–20 seconds

 File Formats: MP3 and WAV

 

Dataset Usage

Research Applications

This dataset is well-suited for:

  • Speech Emotion Recognition: Training and testing models to identify emotions from speech data.
  • Deep Learning Applications: Leveraging high-quality audio for advanced machine learning architectures such as CNNs and RNNs.
  • Human-Computer Interaction: Enhancing systems like virtual assistants and emotion-aware customer service bots with more responsive and realistic interactions.
  • Ethical and Clean Data Analysis: Utilizing audio free from environmental noise and ethical concerns, as the recordings come from publicly available on-demand media.

 With its clean, high-quality audio and professional emotional expressions, HENLO stands out as an ideal resource for both academic research and practical application in modern speech emotion recognition.

 

Loading and Accessing the Data

Speech Data: Audio files can be loaded using standard audio processing libraries in Audacity or in Python, such as librosa or pydub.

Preprocessing Notes

  • Speech: The preprocessing steps included noise reduction, sampling rate adjustment (to 48kHz), and silence removal.

 

Documentation

AttachmentSize
File HENLO_Dataset_Overview-2024.docx17.62 KB