Datasets
Standard Dataset
Indonesian Toxic Speech Dataset (IndoToxSpeech)
- Citation Author(s):
- Submitted by:
- AGUSTINUS GUMELAR
- Last updated:
- Wed, 11/20/2024 - 05:22
- DOI:
- 10.21227/dbgb-j630
- Data Format:
- Research Article Link:
- License:
- Categories:
- Keywords:
Abstract
This dataset contains audio recordings and transcriptions of toxic speech derived from Indonesian conversations during YouTube videos where scammers are confronted. The dataset captures two separate interactions that escalate into toxic exchanges. Each interaction has been verified by native Indonesian speakers and labeled into two classes: toxic and non-toxic. The dataset includes both the original and preprocessed versions of the speech and text data. The original speech files total 136MB, while the preprocessed speech files are 111,7MB. Text transcriptions of the conversations are also included, with both original and preprocessed text files being 16 KB. This dataset can be utilized for research in toxic speech detection, natural language processing, and the development of machine learning models for audio and text classification.
Dataset Overview
This dataset contains audio recordings and transcriptions of toxic speech from two Indonesian conversations recorded during YouTube videos where scammers are confronted. Each conversation was verified by native Indonesian speakers and classified into toxic and non-toxic categories. The dataset is divided into both original and preprocessed versions of the speech and text data.
Contents
The dataset is organized into the following directories:
- /audio_original/
- Contains the original audio files in WAV. These files capture the raw conversations as recorded from YouTube videos.
- /audio_preprocessed /
- Contains the preprocessed audio files in WAV. These files have been noise-reduced, sampling rate adjusted, length duration trimmed, and had silence removed to improve clarity for machine learning tasks.
- /transcript_original/
- Contains the original text transcriptions of the audio files in CSV format. The transcriptions have been verified for accuracy by native Indonesian speakers.
- /transcript_preprocessed/
- Contains the preprocessed text transcriptions. These files have undergone case folding, number removal, stopwords removal, stemming, and correction of typographical errors.
File Details
- /audio_original/
- Total size: 136,9 MB
- Format: WAV
- /audio_preprocessed/
- Total size: 111,7 MB
- Format: WAV
- /transcript_original/
- Total size: 16 KB
- Format: CSV
- /transcript_preprocessed/
- Total size: 16 KB
- Format: CSV
Dataset Usage
Research Applications
This dataset is ideal for:
- Toxic speech detection in audio and text.
- Natural language processing (NLP) in Indonesian.
- Speech-to-text conversion studies.
- Machine learning model training for audio and text classification.
Loading and Accessing the Data
- Speech Data: Audio files can be loaded using standard audio processing libraries in Python, such as librosa or pydub.
- Text Data: The text files are in CSV format and can be read using pandas or any other text-processing library.
Preprocessing Notes
- Speech: The preprocessing steps included noise reduction, sampling rate adjustment, length duration trimming, and silence removal.
- Text: Preprocessing included text normalization, case folding, number removal, stopwords removal, stemming, and fixing typographical errors.
Documentation
Attachment | Size |
---|---|
Dataset Overview - IndoToxSpeech.docx | 16.54 KB |
Comments
Truly amazing