SADA Dataset

Name: SADA Dataset
Creator: Raghad Almutairi
License: https://creativecommons.org/licenses/by/4.0/

Citation Author(s):: Raghad Almutairi
Submitted by:: Raghad Almutairi
Last updated:: Sat, 02/08/2025 - 14:59
DOI:: 10.21227/revt-pa41
Data Format:: *.wav

141 views

Categories:

Keywords:

ACCESS DATASET CITE

Abstract

This dataset contains audio recordings sourced from more than 57 TV shows provided by the Saudi Broadcasting Authority. The total number of hours published for these recordings is ~667 hours. The recordings are in Arabic, the majority are in Saudi dialects, and some are in other dialects. To enhance the usage of SADA, the dataset is split into training, validation, and testing sets. Each of validation and testing sets is around 10 hours in audio segments length while training set is 418 hours.

Instructions:

# SADA - Saudi Audio Dataset for Arabic - Version 1.0
The National Center for Artificial Intelligence at the Saudi Data and Artificial Intelligence Authority (SDAIA), in collaboration with the Saudi Broadcasting Authority (SBA), published the “SADA” dataset, which stands for "Saudi Audio Dataset for Arabic”.
This dataset contains audio recordings sourced from more than 57 TV shows provided by the Saudi Broadcasting Authority. The total number of hours published for these recordings is ~667 hours. The recordings are in Arabic, the majority are in Saudi dialects, and some are in other dialects. To enhance the usage of SADA, the dataset is split into training, validation, and testing sets. Each of validation and testing sets is around 10 hours in audio segments length while training set is 418 hours.
## Audio Data
The audio files are divided into four batches (directories) containing the full audios for the training, testing, and validation sets with the following properties:
- number of audio files: 4563 (average duration 10 min)- audio format: .wav- audio channels: mono- audio sampling rate: 16KHz- audio codec: pcm_s16le (PCM signed 16-bit little-endian)
## CSV Files
There are three *.csv files (train.csv, test.csv, and valid.csv). All files are encoded in UTF8. Each of them contains the transcription of each segment, together with their annotation. In total there are 13 columns. The column headings are listed in the first line of each csv file, and explained below:
- **FileName**: `batch_folder/audio_file`. - **ShowName**: TV show name. - **FullFileLength**: duration of the audio file in seconds. - **SegmentID**: unique ID for each segment. - **SegmentLength**: segment's duration in seconds. - **SegmentStart**: start of segment as offset from the beginning of the audio file in seconds. - **SegmentEnd**: end of segment as offset from the beginning of the audio file in seconds. - **SpeakerAge**: the age group of the speaker (Adult, Child, Young Adult, Elderly, More than 1 speaker, or Unknown). - **SpeakerGender**: the gender of the speaker (Male, Female, More than 1 speaker, or Unknown). - **SpeakerDialect**: the dialect of the speaker (Najdi, Hijazi, Janubi, Shamali, Khaliji, ModernStandardArabic,Levantine, Egyptian, Iraqi, Yemeni, Maghrebi, More than 1 speaker, Unknown, or Notapplicable). - **Environment**: the surrounding environment of the segment (Clean, Car, Music, or Noisy). - **Speaker**: unique speaker ID within each audio file, however not across files. - **GroundTruthText**: the actual uttered text of that segment. - **ProcessedText**: the pre-processed text of the GroundTruthText. - **Category**: the category of the show (كوميدي,درامي,مسابقات,اطفال,طبخ,اجتماعي,توعوي ارشادي,سياحي,وثائقي,ترفيهي,تاريخي).

**Note**: Text processing includes normalizing Arabic letters to unified forms such as آأإ to ا, removing punctuations, emojis, diacritics, and any special characters. Utterances with empty text, English words or digits are discarded.
## Datasets DistributionThe following tables are a distribution overview of each set.
### **Training set**
| Age | Percentage || --- | :---: || Adult | 45.17% || More than one speaker | 44.73% || Unknown | 7.80% || Other | 2.30% |
| Gender | Percentage || --- | :---: || More than one speaker | 44.73% || Male | 34.65% || Female | 12.75% || Unknown | 7.87% |
| Dialect | Percentage || --- | :---: || More than one speaker | 44.73% || Najdi | 28.01% || Hijazi | 9.63% || Unknown | 7.87% || Khaliji | 7.01% || Other | 2.75% |
| Environment | Percentage || --- | :---: || Music | 38.14% || Noisy | 33.94% || Clean | 27.82% || Car | 0.10% |
### **Validation set**
| Age | Percentage || --- | :---: || Adult | 51.14% || More than one speaker | 41.67% || Other | 7.19% |
| Gender | Percentage || --- | :---: || More than one speaker | 41.67% || Male | 35.13% || Female | 17.91% || Unknown | 5.29% |
| Dialect | Percentage || --- | :---: || More than one speaker | 41.67% || Najdi | 36.18% || Hijazi | 7.01% || Khaliji | 6.89% || Other | 8.25% |
| Environment | Percentage || --- | :---: || Music | 45.04% || Noisy | 24.65% || Clean | 30.27% || Car | 0.04% |
### **Testing set**
| Age | Percentage || --- | :---: || Adult | 46.01% || More than one speaker | 44.69% || Other | 9.3% |
| Gender | Percentage || --- | :---: || More than one speaker | 44.69% || Male | 41.05% || Unknown | 7.75% || Female | 6.51% |
| Dialect | Percentage || --- | :---: || More than one speaker | 44.69% || Najdi | 19.27% || Khaliji | 10.51% || Hijazi | 10.42% || Other | 15.11% |
| Environment | Percentage || --- | :---: || Music | 29.52% || Noisy | 35.75% || Clean | 34.69% || Car | 0.04% |
## Licenses
This work is licensed under a CC BY-NC-SA 4.0 license.
## Citation
If you use SADA dataset please use the following citation:
```@inproceedings{SADA2023, Title= {SADA - SBA & SDAIA Audio Dataset for Arabic}, Author= {Areeb Alowisheq,Abdullah Alrajeh, Sadeen Alharbi Abdulmajeed Alrowithi, Aljawharah Bin Tamran, Asma Ibrahim, Raghad Aloraini, Raneem Alnajim, Ranya Alkahtani, Renad Almuasaad, Sara Alrasheed, Shaykhah Alsubaie, Yaser Alonaizan}, Booktitle = {To be published}, affiliation = {NCAI-SDAIA} Year = {2023}}```

This data is from SDAIA

Raghad Almutairi Sat, 02/08/2025 - 15:01 Permalink