300-Dimensional Word Embeddings for Nepali Language
This pre-trained Word2Vec model has 300-dimensional vectors for more than 0.5 million Nepali words and phrases. A separate Nepali language text corpus was created using the news contents freely available in the public domain. The text corpus contained more than 90 million running words. The "Nepali Text Corpus" can be accessed freely from http://dx.doi.org/10.21227/jxrd-d245.
Word2Vec model details: Embeddings Dimension: 300, Architecture: Continuous - BOW, Training algorithm: Negative sampling = 15, Context (window) size: 10, Token minimum count: 2, Encoded in UTF-8.
from gensim.models import KeyedVectors
# Load vectors
model = KeyedVectors.load_word2vec_format(''.../path/to/nepali_embeddings_word2vec.txt', binary=False)
# find similarity between words
#most similar words
#try some linear algebra maths with Nepali words
model.most_similar(positive=['', ''], negative=[''], topn=1)
The design of the Nepali text corpus and the training of the Word2Vec model was done at Database Systems and Artificial Intelligence Lab, School of Computer and System Sciences, Jawaharlal Nehru University, New Delhi.