SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document Layout Analysis

Citation Author(s):
Felipe
Grijalva
Escuela Politécnica Nacional (Quito, Ecuador) and Faculty of Engineering and Applied Sciences (FICA), Telecommunications Engineering, Universidad de Las Américas (UDLA), Quito 170125, Ecuador
Carla
Parra
NuCom, Nuevas Comunicaciones Iberia S.A., Barcelona, 08172, Spain
Marco
Gallardo
Escuela Politécnica Nacional (Quito, Ecuador)
Erick
Santos
Escuela Politécnica Nacional (Quito, Ecuador)
Byron
Acuña
University of Campinas UNICAMP (Campinas, Brazil)
Juan Carlos
Rodríguez
Interface and Isolation Products Group, Analog Devices Incorporated, Wilmington, MA 01887 USA
Julio
Larco
Departamento de Eléctrica, Electrónica y Telecomunicaciones, Universidad de las Fuerzas Armadas ESPE, Sangolquí - Ecuador
Submitted by:
Byron Acuna Acurio
Last updated:
Mon, 03/14/2022 - 14:53
DOI:
10.21227/2yex-bt23
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Document layout analysis (DLA) plays an important role for identifying and classifying the different regions of digital documents in the context of Document Understanding tasks. In light of this, SciBank seeks to provide a considerable amount  of data from text (abstract, text blocks, caption, keywords, reference, section, subsection, title), tables, figures and equations (isolated equations and inline equations) of 74435 scientific articles pages. Human curators validated that these 12 regions were properly labeled. Moreover, SciBank offers relevant information from unstructured data, which in turn might be used later by machine learning models. We aim at complementing the current DLA datasets available such as Publaynet, TableBank, and DocBank. Different than these publicly available datasets, our main contributions is the inclusion of inline equation annotated regions.

Instructions: 
  1. Datasheet_for_SciBank_Dataset.pdf. The Datasheet for this Dataset includes all the relevant details of the composition, collection, preprocessing, cleaning and labeling process used to construct SciBank.
  2. METADATA_FINAL.csv. Each row represent the metadata for every region according to the following fields
    1. Folder: the name of the folder within the main folder PAPER_TAR
    2. Page: png filename of the image where the region is located
    3. Height_Page, Width_Page: dimensions in pixels of the png image page
    4. CoodX, CoodY, Width, Height: coordinates of the region in pixels 
    5. Class: region label
    6. Page_in_pdf: page number within the PDF containing the page of the region
  3. PAPER_TAR folder includes the PNG images from all paper pages and the PDF papers in hierarchical subdirectories, both referenced by METADATA_FINAL.csv.