Arabic speech recognition

The dataset collected for the whole Quran; 114 sura (6236 ayah) recited by 35 Reciters (approximately 218000 audio files), downloaded from this website, the audio files downloaded in mp3 format, all the downloaded files based on the Hafs from A’asim narration, the dataset figure shows reciters names who participate in this dataset.



This paper describes the creation of the Massive Arabic Speech Corpus (MASC). MASC is a dataset that contains 1,000 hours of speech sampled at 16 kHz and crawled from over 700 YouTube channels. The dataset is multi-regional, multi-genre, and multi-dialect intended to advance the research and development of Arabic speech technology with a special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available to interested researchers.