Datasets
Standard Dataset
Vectors from llm
- Citation Author(s):
- Submitted by:
- Maksim Pokrovskiy
- Last updated:
- Mon, 10/14/2024 - 02:31
- DOI:
- 10.21227/t44r-9011
- License:
41 Views
- Categories:
- Keywords:
0 ratings - Please login to submit your rating.
Abstract
Here i got parsed one literature site for about 10.000.000 sentences from russian books and make sentence vector embeddings from them using Mistral open API.
Embeddings got resized from 1024 to 256 dimensions using python scikit-learn PCA method.
Word embeddings are a way of representing words as vectors in a multi-dimensional space, where the distance and direction between vectors reflect the similarity and relationships among the corresponding words.
Mistral AI is a French company specializing in artificial intelligence (AI) products. Founded in April 2023 by former employees of Meta Platforms and Google DeepMind,[1] the company has quickly risen to prominence in the AI sector.
Instructions:
All information is in "ReadMe.txt".
Dataset Files
- vec_0-5.zip (6.67 GB)
- vec_6-17.zip (12.46 GB)
- datas.zip (209.16 MB)
- read_fvec.py (1.25 kB)
Documentation
Attachment | Size |
---|---|
ReadMe.txt | 389 bytes |
Comments
Update