Skip to main content

Datasets

Open Access

Three Benchmark Datasets for Scholarly Article Layout Analysis

Citation Author(s):
Meng Ling (The Ohio State University)
Jian Chen (The Ohio State University)
Torsten Möller (University of Vienna)
Petra Isenberg (INRIA)
Tobias Isenberg (INRIA)
Michael Sedlmair (University of Stuttgart)
Robert S. Laramee (University of Nottingham)
Han-Wei Shen (The Ohio State University)
Jian Wu (Old Dominion University)
C. Lee Giles (The Pennsylvania State University)
Submitted by:
Jian Chen
Last updated:
DOI:
10.21227/326q-bf39
Data Format:
Links:
No Ratings Yet

Abstract

This dataset contains three benchmark datasets as part of the scholarly output of an ICDAR 2021 paper: 

Meng Ling, Jian Chen, Torsten Möller, Petra Isenberg, Tobias Isenberg, Michael Sedlmair, Robert S. Laramee, Han-Wei Shen, Jian Wu, and C. Lee Giles, Document Domain Randomization for Deep Learning Document Layout Extraction, 16th International Conference on Document Analysis and Recognition (ICDAR) 2021. September 5-10, Lausanne, Switzerland. 

This dataset contains nine class lables: abstract, algorithm, author, body text, caption, equation, figure, table, and title.

* Dataset 1: CS-150x, an extension of the classical benchmark dataset CS-150 from three classes (figure, table, and caption) to nine classes, 1176 pages, Clark, C., Divvala, S.: Looking beyond text: Extracting figures, tables and captions from com- puter science papers. In: Workshops at the 29th AAAI Conference on Artificial Intelligence (2015), https://aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10092.

* Dataset 2: ACL300, 300 randomly sampled articles (or 2508 pages) from the 55,759 papers scraped from the ACL anthology website; https://www.aclweb.org/anthology/.

* Dataset 3: VIS300, about 10% (or 2619 pages) of the document pages in randomly partitioned articles from 26,350 VIS paper pages published in  Chen, J., Ling, M., Li, R., Isenberg, P., Isenberg, T., Sedlmair, M., Möller, T., Laramee, R.S., Shen, H.W., Wünsche, K., Wang, Q.: VIS30K: A collection of figures and tables from IEEE visualization conference publications. IEEE Trans. Vis. Comput. Graph. 27 (2021), to appear doi: 10.1109/TVCG.2021.3054916.

This dataset is also available online at https://web.cse.ohio-state.edu/~chen.8028/ICDAR2021Benchmark/.

Instructions:

Image files are in png formats and the metafiles are in plain text. 

Dataset Files

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.