Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. In this article, we describe the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono. Thiscorpus consists of more than 353 million word tokens in total as well as more than one million unique tokens from 18 major text categories of online Bangla websites.
The file format is .CSV
The complete corpus will be available after accepting the manuscript.