Skip to main content

Datasets

Open Access

SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document Layout Analysis

Citation Author(s):
Felipe Grijalva (Escuela Politécnica Nacional (Quito, Ecuador) and Faculty of Engineering and Applied Sciences (FICA), Telecommunications Engineering, Universidad de Las Américas (UDLA), Quito 170125, Ecuador)
Carla Parra (NuCom, Nuevas Comunicaciones Iberia S.A., Barcelona, 08172, Spain)
Marco Gallardo (Escuela Politécnica Nacional (Quito, Ecuador))
Erick Santos (Escuela Politécnica Nacional (Quito, Ecuador))
Byron Acuña (University of Campinas UNICAMP (Campinas, Brazil))
Juan Carlos Rodríguez (Interface and Isolation Products Group, Analog Devices Incorporated, Wilmington, MA 01887 USA)
Julio Larco (Departamento de Eléctrica, Electrónica y Telecomunicaciones, Universidad de las Fuerzas Armadas ESPE, Sangolquí - Ecuador)
Submitted by:
Byron Acuna Acurio
Last updated:
DOI:
https://doi.org/10.1109/ACCESS.2021.3125913
Data Format:
Research Article Link:
Links:
No Ratings Yet

Abstract

Document layout analysis (DLA) plays an important role for identifying and classifying the different regions of digital documents in the context of Document Understanding tasks. In light of this, SciBank seeks to provide a considerable amount  of data from text (abstract, text blocks, caption, keywords, reference, section, subsection, title), tables, figures and equations (isolated equations and inline equations) of 74435 scientific articles pages. Human curators validated that these 12 regions were properly labeled. Moreover, SciBank offers relevant information from unstructured data, which in turn might be used later by machine learning models. We aim at complementing the current DLA datasets available such as Publaynet, TableBank, and DocBank. Different than these publicly available datasets, our main contributions is the inclusion of inline equation annotated regions.

 

Instructions:

  1. Datasheet_for_SciBank_Dataset.pdf. The Datasheet for this Dataset includes all the relevant details of the composition, collection, preprocessing, cleaning and labeling process used to construct SciBank.
  2. METADATA_FINAL.csv. Each row represent the metadata for every region according to the following fields
    1. Folder: the name of the folder within the main folder PAPER_TAR
    2. Page: png filename of the image where the region is located
    3. Height_Page, Width_Page: dimensions in pixels of the png image page
    4. CoodX, CoodY, Width, Height: coordinates of the region in pixels 
    5. Class: region label
    6. Page_in_pdf: page number within the PDF containing the page of the region
  3. PAPER_TAR folder includes the PNG images from all paper pages and the PDF papers in hierarchical subdirectories, both referenced by METADATA_FINAL.csv. 

 

 
Funding Agency
FAPESP
Grant Number
#2022/16881-5, #2020/03069-5, #2021/11380-5

Dataset Files

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.