Datasets
Standard Dataset
Sentence embeddings for document sets in DUC 2002 summarization task
- Citation Author(s):
- Sandra J. Gutiérrez-Hinojosa, Hiram Calvo, Marco A. Moreno-Armendáriz, Carlos A. Duchanoy
- Submitted by:
- Hiram Calvo
- Last updated:
- Thu, 11/08/2018 - 10:34
- DOI:
- 10.21227/qq4m-er38
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
D U C 2 0 0 2 dataset (https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html) processed through doc2vec (https://github.com/jhlau/doc2vec)
This dataset includes the documents embeddings of the full DUC 2002 in the following configurations:
- Sentence embeddings
- Document embeddings
- Document Set embeddings
It also includes the results of the research presented in "Central embeddings for extractive summarization based on similarity"
In order to obtain the original DUC 2002 dataset please consult the official site.
vectoresoracionessets.npy List of sentence embeddings for each document set in the DUC 2002
n_oracionessets.npy List of that include the number of sentences for each document set
Example [280, 152, 82, ... ,]
In the Example the first document set consider 280 sentences, the second document set has 152 sentences. In relation with the vectoresoracionessets.npy file this must be interpreted as the first 280 sentences indexes [0-279] are part of the first document set the following 152 sentences indexes [280 - 431] are part of the second document set and so on.
vectoresdocumentossets.npy List of documents embeddings for each document set in the DUC 2002
n_documentossets.npy List of that include the number of documents for each document set
Example [6, 8, 10, ... ,]
In the Example the first document set consider 6 documents, the second document set has 8 documents. In relation with the vectoresdocumentossets.npy file this must be interpreted as the first 6 documents indexes [0-5] are part of the first document set the following 8 documents indexes [6 - 14] are part of the second document set and so on.
vectorescentroidessets.npy List of embeddings for each document set in the DUC 2002 oracionesGSA.npy List of index for each sentence selected in the gold-standard
summaries A in the DUC 2002
oracionesGSB.npy List of index for each sentence selected in the gold-standard summaries B in the DUC 2002
CE_D, CE_S, CE_Set Folders include the central embedding and the list of index for each sentence in the generated summary for each proposed method.
Ejemplo_set0 Folder includes the example data plotted as result in the linked article
Dataset Files
- Document embeddings of the DUC 2002 dataset document-embeddings-of-the-duc-2002-dataset.rar.zip (29.58 MB)
Documentation
Attachment | Size |
---|---|
ReadMe.pdf | 59.15 KB |