FedE4RAG_Dataset

Name: FedE4RAG_Dataset
Creator: Qili Zhang
Keywords: Financial

Citation Author(s):: Qianren Mao

Qili Zhang

Hanwen Hao

Zhentao Han

Runhua Xu

Weifeng Jiang

Jianxin Li
Submitted by:: Qili Zhang
Last updated:: Wed, 04/30/2025 - 03:07
DOI:: 10.21227/pzpq-eb25

15 views

Categories:

Financial

Keywords:

Federal Learning

large language models

Financial

ACCESS DATASET CITE

Abstract

This is the dataset of the paper Privacy-Preserving Federal Embedding Learning for Localized Retrieval-Augmented Generation.

FedE4RAG addresses data scarcity and privacy challenges in private RAG systems. It uses federated learning (FL) to collaboratively train client-side RAG retrieval models, keeping raw data localized. The framework employs knowledge distillation for effective server-client communication and homomorphic encryption to enhance parameter privacy. FedE4RAG aims to boost the performance of localized RAG retrievers by leveraging diverse client insights securely, balancing data utility and confidentiality, particularly demonstrated in sensitive domains like finance.

Instructions:

# FedE4RAG_Dataset

This is the dataset of the paper ***Privacy-Preserving Federal Embedding Learning for Localized Retrieval-Augmented Generation***.

[FedE4RAG](https://github.com/DocAILab/FedE4RAG) addresses data scarcity and privacy challenges in private RAG systems. It uses federated learning (FL) to collaboratively train client-side RAG retrieval models, keeping raw data localized. The framework employs knowledge distillation for effective server-client communication and homomorphic encryption to enhance parameter privacy. FedE4RAG aims to boost the performance of localized RAG retrievers by leveraging diverse client insights securely, balancing data utility and confidentiality, particularly demonstrated in sensitive domains like finance.

## Dataset structure

```

FedE4RAG_Dataset

|-FEDE4FIN

|-train_corpus.json # Corpus used to generate the training data

|-train_data

|-data_1000_random.json # 1000 synthetic data out of order

|-data_2000_random.json

|-data_5000_random.json

|-data_10000_random.json

|-data_20000_random.json

|-data_50000_random.json

|-RAG4FIN

|-test_corpus.json # Corpus used to downstream question & answer

|-test_qa

|-data_100.json # Question & answer used to test

|-val_qa

|-data_50.json # Question & answer used to validation

```

## Data structure

Meaning of data field in corpus:

```

page_content # The corresponding context of this corpus.

index # The corresponding context of this corpus.

```

Meaning of data field in training data:

```

company # The company that owns this data

page # The contexts of this data

index # The index of this data

reference # The reference of the question

question # The question generated by context

```

Meaning of data field in test & validation data:

```

key_content

reference # The reference

reference_idx # The reference index

question # The question

answer # The answer

other_info

doc_name # The document name including the reference

company # The company that owns this document

question_type # The question type

question_reasoning # the question reasoning type

question

answer

evidence

evidence_text

doc_name

evidence_page_num # The page number of the reference in the document

evidence_text_full_page # The context of the reference

```

## Acknowledgements

Part of the corpus used to synthesize the training data is derived from open source datasets: [PatronusAI/financebench · Datasets at Hugging Face](https://huggingface.co/datasets/PatronusAI/financebench). We are grateful for the contributions and insights provided by the financebench development team, which have been instrumental in advancing our project's development in the federated learning domain.