Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. In this article, we describe the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono. Thiscorpus consists of more than 353 million word tokens in total as well as more than one million unique tokens from 18 major text categories of online Bangla websites.

Categories:
65 Views

Tweets related to 10 different types of disasters were monitored from 28 September 2021 till 6 October 2021. 67528 rows containing 16 fields were extracted using Artificial Intelligence and Natural Language Processing Services of Microsoft.

Categories:
260 Views