SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document Layout Analysis

Name: SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document Layout Analysis
Creator: Byron Acuna Acurio
License: https://creativecommons.org/licenses/by/4.0/

Citation Author(s):: Felipe Grijalva (Escuela Politécnica Nacional (Quito, Ecuador) and Faculty of Engineering and Applied Sciences (FICA), Telecommunications Engineering, Universidad de Las Américas (UDLA), Quito 170125, Ecuador)

Carla Parra (NuCom, Nuevas Comunicaciones Iberia S.A., Barcelona, 08172, Spain)

Marco Gallardo (Escuela Politécnica Nacional (Quito, Ecuador))

Erick Santos (Escuela Politécnica Nacional (Quito, Ecuador))

Byron Acuña (University of Campinas UNICAMP (Campinas, Brazil))

Juan Carlos Rodríguez (Interface and Isolation Products Group, Analog Devices Incorporated, Wilmington, MA 01887 USA)

Julio Larco (Departamento de Eléctrica, Electrónica y Telecomunicaciones, Universidad de las Fuerzas Armadas ESPE, Sangolquí - Ecuador)
Submitted by:: Byron Acuna Acurio
Last updated:: Thu, 07/11/2024 - 19:46
DOI:: https://doi.org/10.1109/ACCESS.2021.3125913
Data Format:: PNG

PDF

CSV
Research Article Link:: Deep Learning in Time-Frequency Domain for Document Layout Analysis
Links:: Deep Learning in Time-Frequency Domain for Document Layout Analysis

Implementation of "Deep Learning in Time-Frequency Domain for Document Layout A…

2187 views

Categories:

Keywords:

Document Layout Analysis

Document Understanding

CITE

Abstract

Document layout analysis (DLA) plays an important role for identifying and classifying the different regions of digital documents in the context of Document Understanding tasks. In light of this, SciBank seeks to provide a considerable amount of data from text (abstract, text blocks, caption, keywords, reference, section, subsection, title), tables, figures and equations (isolated equations and inline equations) of 74435 scientific articles pages. Human curators validated that these 12 regions were properly labeled. Moreover, SciBank offers relevant information from unstructured data, which in turn might be used later by machine learning models. We aim at complementing the current DLA datasets available such as Publaynet, TableBank, and DocBank. Different than these publicly available datasets, our main contributions is the inclusion of inline equation annotated regions.

Instructions:

Datasheet_for_SciBank_Dataset.pdf. The Datasheet for this Dataset includes all the relevant details of the composition, collection, preprocessing, cleaning and labeling process used to construct SciBank.
METADATA_FINAL.csv. Each row represent the metadata for every region according to the following fields

Folder: the name of the folder within the main folder PAPER_TAR
Page: png filename of the image where the region is located
Height_Page, Width_Page: dimensions in pixels of the png image page
CoodX, CoodY, Width, Height: coordinates of the region in pixels
Class: region label
Page_in_pdf: page number within the PDF containing the page of the region

PAPER_TAR folder includes the PNG images from all paper pages and the PDF papers in hierarchical subdirectories, both referenced by METADATA_FINAL.csv.