Indonesian Toxic Speech Dataset (IndoToxSpeech)

Citation Author(s):
Agustinus Bimo
Gumelar
Department of Electrical Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Eko Mulyanto
Yuniarno
Department of Computer Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Arif
Nugroho
Department of Electrical Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Derry Pramono
Adi
Department of Electrical Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Indar
Sugiarto
Department of Electrical Engineering, Petra Christian University, Surabaya, Indonesia
Mauridhi Hery
Purnomo
Department of Electrical Engineering, Department of Computer Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Submitted by:
AGUSTINUS GUMELAR
Last updated:
Thu, 09/05/2024 - 23:48
DOI:
10.21227/dbgb-j630
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset contains audio recordings and transcriptions of toxic speech derived from Indonesian conversations during YouTube videos where scammers are confronted. The dataset captures two separate interactions that escalate into toxic exchanges. Each interaction has been verified by native Indonesian speakers and labeled into two classes: toxic and non-toxic. The dataset includes both the original and preprocessed versions of the speech and text data. The original speech files total 136MB, while the preprocessed speech files are 111,7MB. Text transcriptions of the conversations are also included, with both original and preprocessed text files being 16 KB. This dataset can be utilized for research in toxic speech detection, natural language processing, and the development of machine learning models for audio and text classification.

Instructions: 

Dataset Overview

This dataset contains audio recordings and transcriptions of toxic speech from two Indonesian conversations recorded during YouTube videos where scammers are confronted. Each conversation was verified by native Indonesian speakers and classified into toxic and non-toxic categories. The dataset is divided into both original and preprocessed versions of the speech and text data.

Contents

The dataset is organized into the following directories:

  • /audio_original/
    • Contains the original audio files in WAV. These files capture the raw conversations as recorded from YouTube videos.
  • /audio_preprocessed /
    • Contains the preprocessed audio files in WAV. These files have been noise-reduced, sampling rate adjusted, length duration trimmed, and had silence removed to improve clarity for machine learning tasks.
  • /transcript_original/
    • Contains the original text transcriptions of the audio files in CSV format. The transcriptions have been verified for accuracy by native Indonesian speakers.
  • /transcript_preprocessed/
    • Contains the preprocessed text transcriptions. These files have undergone case folding, number removal, stopwords removal, stemming, and correction of typographical errors.

File Details

  • /audio_original/
    • Total size: 136,9 MB
    • Format: WAV
  • /audio_preprocessed/
    • Total size: 111,7 MB
    • Format: WAV
  • /transcript_original/
    • Total size: 16 KB
    • Format: CSV
  • /transcript_preprocessed/
    • Total size: 16 KB
    • Format: CSV

 

Dataset Usage

Research Applications

This dataset is ideal for:

  • Toxic speech detection in audio and text.
  • Natural language processing (NLP) in Indonesian.
  • Speech-to-text conversion studies.
  • Machine learning model training for audio and text classification.

 

Loading and Accessing the Data

  1. Speech Data: Audio files can be loaded using standard audio processing libraries in Python, such as librosa or pydub.
  2. Text Data: The text files are in CSV format and can be read using pandas or any other text-processing library.

 

Preprocessing Notes

  • Speech: The preprocessing steps included noise reduction, sampling rate adjustment, length duration trimming, and silence removal.
  • Text: Preprocessing included text normalization, case folding, number removal, stopwords removal, stemming, and fixing typographical errors.

 

Comments

Truly amazing 

Submitted by Krishnananda Ch... on Thu, 09/26/2024 - 07:24

Dataset Files

    Files have not been uploaded for this dataset

    Documentation

    AttachmentSize
    File Dataset Overview - IndoToxSpeech.docx16.54 KB