SciBank: A Large Dataset of Annotated Scientific Paper Regions for Document Layout Analysis
Document layout analysis (DLA) plays an important role for identifying and classifying the different regions of digital documents in the context of Document Understanding tasks. In light of this, SciBank seeks to provide a considerable amount of data from text (abstract, text blocks, caption, keywords, reference, section, subsection, title), tables, figures and equations (isolated equations and inline equations) of 74435 scientific articles pages. Human curators validated that these 12 regions were properly labeled. Moreover, SciBank offers relevant information from unstructured data, which in turn might be used later by machine learning models. We aim at complementing the current DLA datasets available such as Publaynet, TableBank, and DocBank. Different than these publicly available datasets, our main contributions is the inclusion of inline equation annotated regions.
- Datasheet_for_SciBank_Dataset.pdf. The Datasheet for this Dataset includes all the relevant details of the composition, collection, preprocessing, cleaning and labeling process used to construct SciBank.
- METADATA_FINAL.csv. Each row represent the metadata for every region according to the following fields
- Folder: the name of the folder within the main folder PAPER_TAR
- Page: png filename of the image where the region is located
- Height_Page, Width_Page: dimensions in pixels of the png image page
- CoodX, CoodY, Width, Height: coordinates of the region in pixels
- Class: region label
- Page_in_pdf: page number within the PDF containing the page of the region
- PAPER_TAR folder includes the PNG images from all paper pages and the PDF papers in hierarchical subdirectories, both referenced by METADATA_FINAL.csv.