LUS: Mizo Monolingual Corpus

Citation Author(s):
Submitted by:
Candy Lalrempuii
Last updated:
Tue, 04/04/2023 - 03:59
Data Format:
0 ratings - Please login to submit your rating.


Mizo or Lushai language is the official language of Mizoram, a state in the north-eastern part of India. It is an under-resourced language that falls under the Tibeto-Burman language family and is highly tonal in nature. 

LUS dataset comprises monolingual corpus crawled from different Mizo news websites such as Zalen ( and Times of Mizoram ( The dataset consists of a total of 101827 Mizo language sentences for research and academic purposes.


The file contains a monolingual data folder (monolingual_mizo_data) which contains the raw data.

Academicians/researchers who wants to use this data should cite the following publications:

Lalrempuii, C., Soni, B. (2020). Attention-Based English to Mizo Neural Machine Translation. In: Bhattacharjee, A., Borgohain, S., Soni, B., Verma, G., Gao, XZ. (eds) Machine Learning, Image Processing, Network Security and Data Sciences. MIND 2020. Communications in Computer and Information Science, vol 1241. Springer, Singapore.

Candy Lalrempuii, Badal Soni, and Partha Pakray. 2021. An Improved English-to-Mizo Neural Machine Translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 4, Article 61 (July 2021), 21 pages.


Dataset Files

    Files have not been uploaded for this dataset