MASC: Massive Arabic Speech Corpus

Citation Author(s):
Mohammad
Al-Fetyani
Muhammad
Al-Barham
Gheith
Abandah
Adham
Alsharkawi
Maha
Dawas
Submitted by:
Mohammad Al-Fetyani
Last updated:
DOI:
10.21227/e1qb-jv46
Data Format:
Links:
License:
3.666665
3 ratings - Please login to submit your rating.

Abstract 

This paper describes the creation of the Massive Arabic Speech Corpus (MASC). MASC is a dataset that contains 1,000 hours of speech sampled at 16~kHz and crawled from over 700 YouTube channels. The dataset is multi-regional, multi-genre, and multi-dialect intended to advance the research and development of Arabic speech technology with a special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available to interested researchers. To enhance the language model, a new and inclusive Arabic speech corpus is required, and thus, a dataset of 12~M unique Arabic words, originally crawled from Twitter, is also created and released. Evaluating on our newly introduced evaluation sets, the best word error rate achieved by the speech recognition model is 19.8% for the clean development set and 21.8% for the clean test set.

Instructions: 

Will be available after the paper is accepted.

Comments

thanks

Submitted by Yurii Bushta on Tue, 10/05/2021 - 09:19

Arabic dialect bias detection

Submitted by Diego Saenz on Sat, 03/12/2022 - 10:31

Dataset Files

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in  users. Don't have a login?  Create a free IEEE account.  IEEE Membership is not required.