FedE4RAG_Dataset

- Citation Author(s):
-
Qianren MaoQili ZhangHanwen HaoZhentao HanRunhua XuWeifeng JiangJianxin Li
- Submitted by:
- Qili Zhang
- Last updated:
- DOI:
- 10.21227/pzpq-eb25
- Categories:
- Keywords:
Abstract
This is the dataset of the paper Privacy-Preserving Federal Embedding Learning for Localized Retrieval-Augmented Generation.
FedE4RAG addresses data scarcity and privacy challenges in private RAG systems. It uses federated learning (FL) to collaboratively train client-side RAG retrieval models, keeping raw data localized. The framework employs knowledge distillation for effective server-client communication and homomorphic encryption to enhance parameter privacy. FedE4RAG aims to boost the performance of localized RAG retrievers by leveraging diverse client insights securely, balancing data utility and confidentiality, particularly demonstrated in sensitive domains like finance.
Instructions:
# FedE4RAG_Dataset
This is the dataset of the paper ***Privacy-Preserving Federal Embedding Learning for Localized Retrieval-Augmented Generation***.
[FedE4RAG](https://github.com/DocAILab/FedE4RAG) addresses data scarcity and privacy challenges in private RAG systems. It uses federated learning (FL) to collaboratively train client-side RAG retrieval models, keeping raw data localized. The framework employs knowledge distillation for effective server-client communication and homomorphic encryption to enhance parameter privacy. FedE4RAG aims to boost the performance of localized RAG retrievers by leveraging diverse client insights securely, balancing data utility and confidentiality, particularly demonstrated in sensitive domains like finance.
## Dataset structure
```
FedE4RAG_Dataset
|-FEDE4FIN
|-train_corpus.json # Corpus used to generate the training data
|-train_data
|-data_1000_random.json # 1000 synthetic data out of order
|-data_2000_random.json
|-data_5000_random.json
|-data_10000_random.json
|-data_20000_random.json
|-data_50000_random.json
|-RAG4FIN
|-test_corpus.json # Corpus used to downstream question & answer
|-test_qa
|-data_100.json # Question & answer used to test
|-val_qa
|-data_50.json # Question & answer used to validation
```
## Data structure
Meaning of data field in corpus:
```
page_content # The corresponding context of this corpus.
index # The corresponding context of this corpus.
```
Meaning of data field in training data:
```
company # The company that owns this data
page # The contexts of this data
index # The index of this data
reference # The reference of the question
question # The question generated by context
```
Meaning of data field in test & validation data:
```
key_content
reference # The reference
reference_idx # The reference index
question # The question
answer # The answer
other_info
doc_name # The document name including the reference
company # The company that owns this document
question_type # The question type
question_reasoning # the question reasoning type
question
answer
evidence
evidence_text
doc_name
evidence_page_num # The page number of the reference in the document
evidence_text_full_page # The context of the reference
```
## Acknowledgements
Part of the corpus used to synthesize the training data is derived from open source datasets: [PatronusAI/financebench · Datasets at Hugging Face](https://huggingface.co/datasets/PatronusAI/financebench). We are grateful for the contributions and insights provided by the financebench development team, which have been instrumental in advancing our project's development in the federated learning domain.