Sentence embeddings for document sets in DUC 2002 summarization task

Citation Author(s):
Sandra J. Gutiérrez-Hinojosa, Hiram Calvo, Marco A. Moreno-Armendáriz, Carlos A. Duchanoy
Submitted by:
Hiram Calvo
Last updated:
Thu, 11/08/2018 - 10:34
DOI:
10.21227/qq4m-er38
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

D U C 2 0 0 2 dataset (https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html) processed through doc2vec (https://github.com/jhlau/doc2vec)
This dataset includes the documents embeddings of the full DUC 2002 in the following configurations:

  • Sentence embeddings
  • Document embeddings
  • Document Set embeddings

It also includes the results of the research presented in "Central embeddings for extractive summarization based on similarity"

In order to obtain the original DUC 2002 dataset please consult the official site.

Instructions: 

 

vectoresoracionessets.npy​ List of sentence embeddings for each document set in the DUC 2002

n_oracionessets.npy ​List of that include the number of sentences for each document set

Example [280, 152, 82, ... ,]

In the Example the first document set consider 280 sentences, the second document set has 152 sentences. In relation with the ​vectoresoracionessets.npy ​file this must be interpreted as the first 280 sentences indexes [0-279] are part of the first document set the following 152 sentences indexes [280 - 431] are part of the second document set and so on.

vectoresdocumentossets.npy​ List of documents embeddings for each document set in the DUC 2002

n_documentossets.npy ​List of that include the number of documents for each document set

Example [6, 8, 10, ... ,]

In the Example the first document set consider 6 documents, the second document set has 8 documents. In relation with the ​vectoresdocumentossets.npy ​file this must be interpreted as the first 6 documents indexes [0-5] are part of the first document set the following 8 documents indexes [6 - 14] are part of the second document set and so on.

vectorescentroidessets.npy ​List of embeddings for each document set in the DUC 2002 oracionesGSA.npy ​List of index for each sentence selected in the gold-standard

summaries A in the DUC 2002

oracionesGSB.npy ​List of index for each sentence selected in the gold-standard summaries B in the DUC 2002

CE_D, CE_S, CE_Set ​Folders include the central embedding and the list of index for each sentence in the generated summary for each proposed method.

Ejemplo_set0 ​Folder includes the example data plotted as result in the linked article

 

Documentation

AttachmentSize
File ReadMe.pdf59.15 KB