REGen_data(Retrieval Generation Chat dataset)

0
0 ratings - Please login to submit your rating.

Abstract 

The dataset and source code used in paper "Pick the Better and Leave the Rest: Leveraging Multiple Retrieved Results to Guide Response Generation".

We conduct experiments on the Retrieval Generation Chat dataset, which contains about five million query-response pairs and provides 3 to 10 retrieved references for each query. Note that there exist samples where the retrieved query is exactly the user’s utterance, and such retrieved candidates are taken as the ground truth and removed from the rest candidates. After that, we also remove those queries with more than ten responses since each reference is probably irrelevant to most replies. Moreover, following the setting of previous studies, only the samples with at least 20% of corresponding satisfying Jaccard(r; r(i)) > 0.3 are leveraged for training, where Jaccard stands for the Jaccard distance. Finally, since each query corresponds to multiple replies, we split the filtered corpus into training (1,179,374), validation (21,462), and test (20,896) sets based on the query.