Skip to main content

Datasets

Open Access

300-Dimensional Word Embeddings for Nepali Language

Citation Author(s):
Rabindra Lamsal (Artificial Intelligence & Data Science Lab, SC&SS, JNU)
Submitted by:
Rabindra Lamsal
Last updated:
DOI:
10.21227/dz6s-my90
Data Format:
Links:
Average: 5 (1 vote)

Abstract

This pre-trained Word2Vec model has 300-dimensional vectors for more than 0.5 million Nepali words and phrases. A separate Nepali language text corpus was created using the news contents freely available in the public domain. The text corpus contained more than 90 million running words. The "Nepali Text Corpus" can be accessed freely from http://dx.doi.org/10.21227/jxrd-d245.

Word2Vec model details: Embeddings Dimension: 300, Architecture: Continuous - BOW, Training algorithm: Negative sampling = 15, Context (window) size: 10, Token minimum count: 2, Encoded in UTF-8.

Instructions:

from gensim.models import KeyedVectors # Load vectors model = KeyedVectors.load_word2vec_format(''.../path/to/nepali_embeddings_word2vec.txt', binary=False) # find similarity between words model.similarity('फेसबूक','इन्स्टाग्राम') #most similar words model.most_similar('ठमेल') #try some linear algebra maths with Nepali words model.most_similar(positive=['', ''], negative=[''], topn=1)

Dataset Files

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.

DOCUMENTATION