Skip to main content

Datasets

Standard Dataset

FedE4RAG_Dataset

Citation Author(s):
Qianren Mao
Qili Zhang
Hanwen Hao
Zhentao Han
Runhua Xu
Weifeng Jiang
Jianxin Li
Submitted by:
Qili Zhang
Last updated:
DOI:
10.21227/pzpq-eb25
No Ratings Yet

Abstract

This is the dataset of the paper Privacy-Preserving Federal Embedding Learning for Localized Retrieval-Augmented Generation.

FedE4RAG addresses data scarcity and privacy challenges in private RAG systems. It uses federated learning (FL) to collaboratively train client-side RAG retrieval models, keeping raw data localized. The framework employs knowledge distillation for effective server-client communication and homomorphic encryption to enhance parameter privacy. FedE4RAG aims to boost the performance of localized RAG retrievers by leveraging diverse client insights securely, balancing data utility and confidentiality, particularly demonstrated in sensitive domains like finance.

Instructions:

# FedE4RAG_Dataset

This is the dataset of the paper ***Privacy-Preserving Federal Embedding Learning for Localized Retrieval-Augmented Generation***.

[FedE4RAG](https://github.com/DocAILab/FedE4RAG) addresses data scarcity and privacy challenges in private RAG systems. It uses federated learning (FL) to collaboratively train client-side RAG retrieval models, keeping raw data localized. The framework employs knowledge distillation for effective server-client communication and homomorphic encryption to enhance parameter privacy. FedE4RAG aims to boost the performance of localized RAG retrievers by leveraging diverse client insights securely, balancing data utility and confidentiality, particularly demonstrated in sensitive domains like finance.

## Dataset structure

```

FedE4RAG_Dataset

|-FEDE4FIN

  |-train_corpus.json               #   Corpus used to generate the training data

  |-train_data

    |-data_1000_random.json         #   1000 synthetic data out of order

    |-data_2000_random.json

    |-data_5000_random.json

    |-data_10000_random.json

    |-data_20000_random.json

    |-data_50000_random.json

|-RAG4FIN

  |-test_corpus.json                #   Corpus used to downstream question & answer

  |-test_qa

    |-data_100.json                 #   Question & answer used to test

  |-val_qa

    |-data_50.json                  #   Question & answer used to validation

``` 

## Data structure
 

Meaning of data field in corpus:

```

page_content        #   The corresponding context of this corpus.

index               #   The corresponding context of this corpus.

``` 

Meaning of data field in training data: 

```

company             #   The company that owns this data

page                #   The contexts of this data

index               #   The index of this data

reference           #   The reference of the question 

question            #   The question generated by context

```

Meaning of data field in test & validation data: 

```

key_content

    reference                       #   The reference

    reference_idx                   #   The reference index

    question                        #   The question

    answer                          #   The answer

other_info

    doc_name                        #   The document name including the reference

    company                         #   The company that owns this document

    question_type                   #   The question type

    question_reasoning              #   the question reasoning type

    question                        

    answer

    evidence    

        evidence_text

        doc_name

        evidence_page_num           #   The page number of the reference in the document

        evidence_text_full_page     #   The context of the reference

```

## Acknowledgements 

Part of the corpus used to synthesize the training data is derived from open source datasets:  [PatronusAI/financebench · Datasets at Hugging Face](https://huggingface.co/datasets/PatronusAI/financebench). We are grateful for the contributions and insights provided by the financebench development team, which have been instrumental in advancing our project's development in the federated learning domain.