Sentence embeddings for document sets in DUC 2002 summarization task

Citation Author(s):
Submitted by:: Hiram Calvo
Last updated:: Thu, 11/08/2018 - 15:34
DOI:: 10.21227/qq4m-er38
Data Format:: Numpy

887 views

Categories:

Computational Intelligence

Keywords:

Word embeddings

central embeddings

concept similarity

ACCESS DATASET CITE

Abstract

D U C 2 0 0 2 dataset (https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html) processed through doc2vec (https://github.com/jhlau/doc2vec)
This dataset includes the documents embeddings of the full DUC 2002 in the following configurations:

Sentence embeddings
Document embeddings
Document Set embeddings

It also includes the results of the research presented in "Central embeddings for extractive summarization based on similarity"

In order to obtain the original DUC 2002 dataset please consult the official site.

Instructions:

vectoresoracionessets.npy List of sentence embeddings for each document set in the DUC 2002

n_oracionessets.npy List of that include the number of sentences for each document set

Example [280, 152, 82, ... ,]

In the Example the first document set consider 280 sentences, the second document set has 152 sentences. In relation with the vectoresoracionessets.npy file this must be interpreted as the first 280 sentences indexes [0-279] are part of the first document set the following 152 sentences indexes [280 - 431] are part of the second document set and so on.

vectoresdocumentossets.npy List of documents embeddings for each document set in the DUC 2002

n_documentossets.npy List of that include the number of documents for each document set

Example [6, 8, 10, ... ,]

In the Example the first document set consider 6 documents, the second document set has 8 documents. In relation with the vectoresdocumentossets.npy file this must be interpreted as the first 6 documents indexes [0-5] are part of the first document set the following 8 documents indexes [6 - 14] are part of the second document set and so on.

vectorescentroidessets.npy List of embeddings for each document set in the DUC 2002 oracionesGSA.npy List of index for each sentence selected in the gold-standard

summaries A in the DUC 2002

oracionesGSB.npy List of index for each sentence selected in the gold-standard summaries B in the DUC 2002

CE_D, CE_S, CE_Set Folders include the central embedding and the list of index for each sentence in the generated summary for each proposed method.

Ejemplo_set0 Folder includes the example data plotted as result in the linked article