This paper releases and describes the creation of the Massive Arabic Speech Corpus (MASC). This corpus is a dataset that contains 1,000 hours of speech sampled at 16~kHz and crawled from over 700 YouTube channels. MASC is multi-regional, multi-genre, and multi-dialect dataset that is intended to advance the research and development of Arabic speech technology with the special emphasis on Arabic speech recognition. In addition to MASC, a pre-trained 3-gram language model and a pre-trained automatic speech recognition model are also developed and made available for interested researches.

Instructions: 

Will be available after the paper is accepted.

Categories:
1273 Views