Bangla Corpus

Citation Author(s):
Aysha
Akther
Khulna University
MD. SHYMON
ISLAM
Khulna University
HAFSA
SULTANA
Khulna University
A.K.Z RASEL
RAHMAN
SUJANA
SAHA
KAZI MASUDUL
ALAM
Submitted by:
Aysha Akther
Last updated:
Mon, 02/07/2022 - 10:54
DOI:
10.21227/3bhm-my48
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. In this article, we describe the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono. Thiscorpus consists of more than 353 million word tokens in total as well as more than one million unique tokens from 18 major text categories of online Bangla websites.

Instructions: 

The file format is .CSV

The complete corpus will be available after accepting the manuscript.

Dataset Files

    Files have not been uploaded for this dataset

    Documentation

    AttachmentSize
    File KUMono Bangla corpus documentation.pdf202.77 KB