MASC: Massive Arabic Speech Corpus
This paper describes the creation of the Massive Arabic Speech Corpus (MASC). MASC is a dataset that contains 1,000 hours of speech sampled at 16 kHz and crawled from over 700 YouTube channels. The dataset is multi-regional, multi-genre, and multi-dialect intended to advance the research and development of Arabic speech technology with a special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available to interested researchers. To enhance the language model, a new and inclusive Arabic speech corpus is required, and thus, a dataset of 12 M unique Arabic words, originally crawled from Twitter, is also created and released.
The structure of MASC:
¦ channels_quality.csv -------> Contains channel_url, channel_name, channel_id, quality
¦ lm_twitter.txt -------> Each line represents a filtered tweet
+---audios -------> Contains all audios as wav files
+---subsets -------> Contains the training and evaluation sets with meta data
¦ clean_dev.csv -------> Structure: video_id, start, end, duration, text
¦ clean_dev_meta.csv -------> Structure: video_id, category, video_duration, channel_id, country, dialect, gender, transcript_duration
¦ clean_train.csv (meta)
¦ noisy_train.csv (meta)
+---subtitles -------> Contains the raw subtitles extracted from YouTube for each video
Steps to train a speech recognition model:
1. Loop through the ids in the clean/noisy train subset.
2. For each audio, read the corresponding subtitle (Subtitle name contains audio id).
3. For each segment in the subtitle, chunk the corresponding audio segment and save the result in a csv file (audio_segment_path, duration, transcription).
4. Preprocess the transcription as you like (Remember this is a raw training data). We recommend to use Maha for preprocessing (https://github.com/TRoboto/Maha).
5. Select your model and start training (Depending on your hardware but on one GPU, it may take 2-7 days to finish training).
6. Use the dev subsets for hyper-parameter tuning.
Steps to evaluate the model:
1. Loop through the segments in the clean/noisy test subset.
2. For each segment, read the audio, pass it to the model and record the output.
3. Once all segments are complete, calculate the WER between the manually annotated transcriptions and the output of the model.