MASC: Massive Arabic Speech Corpus

Citation Author(s):: Mohammad Al-Fetyani

Muhammad Al-Barham

Gheith Abandah

Adham Alsharkawi

Maha Dawas
Submitted by:: Mohammad Apfetyani
Last updated:: Mon, 04/28/2025 - 20:53
DOI:: 10.21227/e1qb-jv46
Data Format:: *.wav; *.csv; *.vtt; *.txt
Links:: Deepspeech model trained on masc training sets

KenLM 3-grams LM trained on masc twitter dataset

11079 views

Categories:

Artificial Intelligence

Keywords:

Arabic speech recognition

CITE

Abstract

This paper describes the creation of the Massive Arabic Speech Corpus (MASC). MASC is a dataset that contains 1,000 hours of speech sampled at 16 kHz and crawled from over 700 YouTube channels. The dataset is multi-regional, multi-genre, and multi-dialect intended to advance the research and development of Arabic speech technology with a special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available to interested researchers. To enhance the language model, a new and inclusive Arabic speech corpus is required, and thus, a dataset of 12 M unique Arabic words, originally crawled from Twitter, is also created and released.

Instructions:

The structure of MASC:

masc
¦ channels_quality.csv -------> Contains channel_url, channel_name, channel_id, quality
¦ lm_twitter.txt -------> Each line represents a filtered tweet
¦
+---audios -------> Contains all audios as wav files
¦ --auido_id1.wav
¦ --auido_id2.wav
¦ --auido_id3.wav
¦ --auido_id4.wav
¦ --auido_id5.wav
¦ --auido_id6.wav
¦ --auido_id7.wav
¦ --....
¦
+---subsets -------> Contains the training and evaluation sets with meta data
¦ clean_dev.csv -------> Structure: video_id, start, end, duration, text
¦ clean_dev_meta.csv -------> Structure: video_id, category, video_duration, channel_id, country, dialect, gender, transcript_duration
¦ clean_test.csv
¦ clean_test_meta.csv
¦ clean_train.csv (meta)
¦ noisy_dev.csv
¦ noisy_dev_meta.csv
¦ noisy_test.csv
¦ noisy_test_meta.csv
¦ noisy_train.csv (meta)
¦
+---subtitles -------> Contains the raw subtitles extracted from YouTube for each video
¦ --auido_id1.ar.vtt
¦ --auido_id2.ar.vtt
¦ --auido_id3.ar.vtt
¦ --auido_id4.ar.vtt
¦ --auido_id5.ar.vtt
¦ --auido_id6.ar.vtt
¦ --auido_id7.ar.vtt
¦ --....

Steps to train a speech recognition model:
1. Loop through the ids in the clean/noisy train subset.
2. For each audio, read the corresponding subtitle (Subtitle name contains audio id).
3. For each segment in the subtitle, chunk the corresponding audio segment and save the result in a csv file (audio_segment_path, duration, transcription).
4. Preprocess the transcription as you like (Remember this is a raw training data). We recommend to use Maha for preprocessing (https://github.com/TRoboto/Maha).
5. Select your model and start training (Depending on your hardware but on one GPU, it may take 2-7 days to finish training).
6. Use the dev subsets for hyper-parameter tuning.

Steps to evaluate the model:
1. Loop through the segments in the clean/noisy test subset.
2. For each segment, read the audio, pass it to the model and record the output.
3. Once all segments are complete, calculate the WER between the manually annotated transcriptions and the output of the model.

thanks

Yurii Bushta Tue, 10/05/2021 - 13:19 Permalink

Arabic dialect bias detection

Diego Saenz Sat, 03/12/2022 - 15:31 Permalink

hey, do u know any phone calls datasets for my project?

Salima Al-Abdulla Mon, 08/14/2023 - 03:37 Permalink