LUS: Mizo Monolingual Corpus

Citation Author(s):
National institute of Technology Silchar, India
National institute of Technology Silchar, India
Submitted by:
Candy Lalrempuii
Last updated:
Tue, 04/04/2023 - 06:51
0 ratings - Please login to submit your rating.


Mizo or Lushai language is the official language of Mizoram, a state in the north-eastern part of India. It is an under-resourced language that falls under the Tibeto-Burman language family and is highly tonal in nature. 

LUS dataset comprises monolingual corpus crawled from different Mizo news websites such as Zalen ( and Times of Mizoram ( The dataset consists of a total of 101827 Mizo language sentences for research and academic purposes.


The file contains a monolingual data folder (monolingual_mizo_data) which contains the raw data.

Academicians/researchers who want to use this dataset for research purpose must cite the following papers:


Lalrempuii, C., Soni, B. (2020). Attention-Based English to Mizo Neural Machine Translation. In: Bhattacharjee, A., Borgohain, S., Soni, B., Verma, G., Gao, XZ. (eds) Machine Learning, Image Processing, Network Security and Data Sciences. MIND 2020. Communications in Computer and Information Science, vol 1241. Springer, Singapore.

Candy Lalrempuii, Badal Soni, and Partha Pakray. 2021. An Improved English-to-Mizo Neural Machine Translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 4, Article 61 (July 2021), 21 pages.


Nice dataset.

Submitted by Partha Pakray on Sat, 08/26/2023 - 01:02