Our dataset, which is Nepali news dataset, contains 17 categories, including Art, Bank, Blog, Business, Diaspora, Entertainment, Filmy, Health, Hollywood-bollywood, Koseli, Literature, Music, National, Opinion, Society, Sports, and World.

If you use this dataset, please cite our paper.

Sitaula C, Basnet A, Aryal S. 2021. Vector representation based on a supervised codebook for Nepali documents classification. PeerJ Computer Science 7:e412 https://doi.org/10.7717/peerj-cs.412


Most text-simplification systems require an indicator of the complexity of the words. The prevalent approaches to word difficulty prediction are based on manual feature engineering. Using deep learning based models are largely left unexplored due to their comparatively poor performance. We have explored the use of one of such in predicting the difficulty of words. We have treated the problem as a binary classification problem. We have trained traditional machine learning models and evaluated their performance on the task.