Datasets
Standard Dataset
Vectors from llm
- Citation Author(s):
- Submitted by:
- Maksim Pokrovskiy
- Last updated:
- Thu, 10/17/2024 - 10:53
- DOI:
- 10.21227/t44r-9011
- License:
62 Views
- Categories:
- Keywords:
0 ratings - Please login to submit your rating.
Abstract
Here i got parsed literature site https://avidreaders.ru for about 10.000.000 sentences from russian books and make sentence vector embeddings from them using Mistral open API.
Embeddings got resized from 1024 to 256 dimensions using python scikit-learn PCA method.
Word embeddings are a way of representing words as vectors in a multi-dimensional space, where the distance and direction between vectors reflect the similarity and relationships among the corresponding words.
Mistral AI is a French company specializing in artificial intelligence (AI) products. Founded in April 2023 by former employees of Meta Platforms and Google DeepMind,[1] the company has quickly risen to prominence in the AI sector.
Instructions:
All information is in "ReadMe.txt".
Dataset Files
- vec_0-5.zip (6.67 GB)
- vec_6-17.zip (12.46 GB)
- datas.zip (209.16 MB)
- read_fvec.py (1.25 kB)
Documentation
Attachment | Size |
---|---|
ReadMe.txt | 389 bytes |
Comments
Update