Datasets
Standard Dataset
SynEL: A Synthetic Benchmark for Entity Linking
- Citation Author(s):
- Submitted by:
- Ilia Karpov
- Last updated:
- Thu, 11/14/2024 - 17:06
- DOI:
- 10.21227/25m4-h372
- Data Format:
- License:
Abstract
Dataset for "SynEL: A Synthetic Benchmark for Entity Linking" paper. The dataset integrates structured information from two primary sources: DBpedia for English, representing a high-resource language environment, and the Russian Public Company Register, a challenging low-resource dataset. Each dataset includes extensive annotations and structured entity links, ensuring high relevance for real-world applications in diverse industries. The dataset facilitates the training and evaluation of advanced graph neural network (GNN) and large language model (LLM) techniques, enabling robust performance across varied linguistic contexts. Experimental results indicate that models trained on this dataset achieve significant gains in entity linking precision and recall, especially in specialized domains such as finance and regulatory compliance.
Just unzip the archive.
Dataset Files
- egrul-based-with-mentions.json.zip (10.66 MB)
- egrul-based-3.json.zip (2.18 MB)
- egrul-based-2.json.zip (5.03 MB)
- egrul-based-1.json.zip (9.08 MB)
- deanonymization-based-with-mentions.json.zip (144.65 kB)
- dbpedia-based.json.zip (16.34 MB)
- dbpedia-based-with-mentions.json.zip (989.53 kB)