Indonesian Toxic Speech Dataset (IndoToxSpeech)

Citation Author(s):: Agustinus Bimo
Gumelar

Department of Electrical Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia

Eko Mulyanto
Yuniarno

Department of Computer Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia

Arif
Nugroho

Department of Electrical Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia

Derry Pramono
Adi

Department of Electrical Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia

Indar
Sugiarto

Department of Electrical Engineering, Petra Christian University, Surabaya, Indonesia

Andreas
AgungKristanto

Department of Psychology, Faculty of Social and Political Sciences, Mulawarman University, Samarinda, East Kalimantan, Indonesia

Mauridhi Hery
Purnomo

Department of Electrical Engineering, Department of Computer Engineering, Faculty of Intelligent Electrical and Information Technology (ELECTICS), Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Submitted by:: AGUSTINUS GUMELAR
Last updated:: Wed, 01/08/2025 - 05:14
DOI:: 10.21227/dbgb-j630
Data Format:: *.wav; *.csv
Research Article Link:: An Improved Toxic Speech Detection on Multimodal Scam Confrontation Data Using LSTM-Based Deep Learning
License:: Creative Commons Attribution

431 Views

Categories:: Artificial Intelligence
Communications
Computational Intelligence
Keywords:: Toxic Speech, Indonesian Language, Scam Conversation, YouTube, Natural Language Processing, Speech-to-Text

0 ratings - Please login to submit your rating.

ACCESS DATASET CITE

Abstract

This dataset contains audio recordings and transcriptions of toxic speech derived from Indonesian conversations during YouTube videos where scammers are confronted. The dataset captures two separate interactions that escalate into toxic exchanges. Each interaction has been verified by native Indonesian speakers and labeled into two classes: toxic and non-toxic. The dataset includes both the original and preprocessed versions of the speech and text data. The original speech files total 136MB, while the preprocessed speech files are 111,7MB. Text transcriptions of the conversations are also included, with both original and preprocessed text files being 16 KB. This dataset can be utilized for research in toxic speech detection, natural language processing, and the development of machine learning models for audio and text classification.

Instructions:

Dataset Overview

This dataset contains audio recordings and transcriptions of toxic speech from two Indonesian conversations recorded during YouTube videos where scammers are confronted. Each conversation was verified by native Indonesian speakers and classified into toxic and non-toxic categories. The dataset is divided into both original and preprocessed versions of the speech and text data.

Contents

The dataset is organized into the following directories:

/audio_original/

Contains the original audio files in WAV. These files capture the raw conversations as recorded from YouTube videos.

/audio_preprocessed /

Contains the preprocessed audio files in WAV. These files have been noise-reduced, sampling rate adjusted, length duration trimmed, and had silence removed to improve clarity for machine learning tasks.

/transcript_original/

Contains the original text transcriptions of the audio files in CSV format. The transcriptions have been verified for accuracy by native Indonesian speakers.

/transcript_preprocessed/

Contains the preprocessed text transcriptions. These files have undergone case folding, number removal, stopwords removal, stemming, and correction of typographical errors.

File Details

/audio_original/

Total size: 136,9 MB
Format: WAV

/audio_preprocessed/

Total size: 111,7 MB
Format: WAV

/transcript_original/

Total size: 16 KB
Format: CSV

/transcript_preprocessed/

Total size: 16 KB
Format: CSV

Dataset Usage

Research Applications

This dataset is ideal for:

Toxic speech detection in audio and text.
Natural language processing (NLP) in Indonesian.
Speech-to-text conversion studies.
Machine learning model training for audio and text classification.

Loading and Accessing the Data

Speech Data: Audio files can be loaded using standard audio processing libraries in Python, such as librosa or pydub.
Text Data: The text files are in CSV format and can be read using pandas or any other text-processing library.

Preprocessing Notes

Speech: The preprocessing steps included noise reduction, sampling rate adjustment, length duration trimming, and silence removal.
Text: Preprocessing included text normalization, case folding, number removal, stopwords removal, stemming, and fixing typographical errors.

Comments

Truly amazing

Submitted by Krishnananda Ch... on Thu, 09/26/2024 - 07:24

Hi, could you please grant me access to the dataset?
Thank you.

Submitted by Abayomi Olaoye on Mon, 02/24/2025 - 20:58

Dataset Files

Files have not been uploaded for this dataset

Documentation

Attachment	Size
Dataset Overview - IndoToxSpeech.docx	16.54 KB

Datasets

Standard Dataset

Indonesian Toxic Speech Dataset (IndoToxSpeech)

Abstract

Comments

More from this Author

HENLO: Human voice Natural Language from On-demand...

Dataset Files

Documentation

QUESTIONS?