This paper describes the creation of the Massive Arabic Speech Corpus (MASC). MASC is a dataset that contains 1,000 hours of speech sampled at 16 kHz and crawled from over 700 YouTube channels. The dataset is multi-regional, multi-genre, and multi-dialect intended to advance the research and development of Arabic speech technology with a special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available to interested researchers.

Dataset Files

You must be an IEEE Dataport Subscriber to access these files. Subscribe now or login.

[1] Mohammad Al-Fetyani, Muhammad Al-Barham, Gheith Abandah, Adham Alsharkawi, Maha Dawas, "MASC: Massive Arabic Speech Corpus", IEEE Dataport, 2021. [Online]. Available: http://dx.doi.org/10.21227/e1qb-jv46. Accessed: Jan. 14, 2025.
@data{e1qb-jv46-21,
doi = {10.21227/e1qb-jv46},
url = {http://dx.doi.org/10.21227/e1qb-jv46},
author = {Mohammad Al-Fetyani; Muhammad Al-Barham; Gheith Abandah; Adham Alsharkawi; Maha Dawas },
publisher = {IEEE Dataport},
title = {MASC: Massive Arabic Speech Corpus},
year = {2021} }
TY - DATA
T1 - MASC: Massive Arabic Speech Corpus
AU - Mohammad Al-Fetyani; Muhammad Al-Barham; Gheith Abandah; Adham Alsharkawi; Maha Dawas
PY - 2021
PB - IEEE Dataport
UR - 10.21227/e1qb-jv46
ER -
Mohammad Al-Fetyani, Muhammad Al-Barham, Gheith Abandah, Adham Alsharkawi, Maha Dawas. (2021). MASC: Massive Arabic Speech Corpus. IEEE Dataport. http://dx.doi.org/10.21227/e1qb-jv46
Mohammad Al-Fetyani, Muhammad Al-Barham, Gheith Abandah, Adham Alsharkawi, Maha Dawas, 2021. MASC: Massive Arabic Speech Corpus. Available at: http://dx.doi.org/10.21227/e1qb-jv46.
Mohammad Al-Fetyani, Muhammad Al-Barham, Gheith Abandah, Adham Alsharkawi, Maha Dawas. (2021). "MASC: Massive Arabic Speech Corpus." Web.
1. Mohammad Al-Fetyani, Muhammad Al-Barham, Gheith Abandah, Adham Alsharkawi, Maha Dawas. MASC: Massive Arabic Speech Corpus [Internet]. IEEE Dataport; 2021. Available from : http://dx.doi.org/10.21227/e1qb-jv46
Mohammad Al-Fetyani, Muhammad Al-Barham, Gheith Abandah, Adham Alsharkawi, Maha Dawas. "MASC: Massive Arabic Speech Corpus." doi: 10.21227/e1qb-jv46