LUS: Mizo Monolingual Corpus

Citation Author(s):
Candy
Lalrempuii
National institute of Technology Silchar, India
Badal
Soni
National institute of Technology Silchar, India
Submitted by:
Candy Lalrempuii
Last updated:
Tue, 04/04/2023 - 06:51
DOI:
10.21227/4kx5-wc43
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Mizo or Lushai language is the official language of Mizoram, a state in the north-eastern part of India. It is an under-resourced language that falls under the Tibeto-Burman language family and is highly tonal in nature. 

LUS dataset comprises monolingual corpus crawled from different Mizo news websites such as Zalen (https://zalen.in/) and Times of Mizoram (https://www.timesofmizoram.com/). The dataset consists of a total of 101827 Mizo language sentences for research and academic purposes.

Instructions: 

The monolingual.zip file contains a monolingual data folder (monolingual_mizo_data) which contains the raw data.

Academicians/researchers who want to use this dataset for research purpose must cite the following papers:

 

Lalrempuii, C., Soni, B. (2020). Attention-Based English to Mizo Neural Machine Translation. In: Bhattacharjee, A., Borgohain, S., Soni, B., Verma, G., Gao, XZ. (eds) Machine Learning, Image Processing, Network Security and Data Sciences. MIND 2020. Communications in Computer and Information Science, vol 1241. Springer, Singapore. https://doi.org/10.1007/978-981-15-6318-8_17

Candy Lalrempuii, Badal Soni, and Partha Pakray. 2021. An Improved English-to-Mizo Neural Machine Translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 4, Article 61 (July 2021), 21 pages. https://doi.org/10.1145/3445974

Comments

Nice dataset.

Submitted by Partha Pakray on Sat, 08/26/2023 - 01:02