Datasets
Open Access
Three Benchmark Datasets for Scholarly Article Layout Analysis
- Citation Author(s):
- Submitted by:
- Jian Chen
- Last updated:
- Thu, 05/20/2021 - 08:42
- DOI:
- 10.21227/326q-bf39
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
This dataset contains three benchmark datasets as part of the scholarly output of an ICDAR 2021 paper:
Meng Ling, Jian Chen, Torsten Möller, Petra Isenberg, Tobias Isenberg, Michael Sedlmair, Robert S. Laramee, Han-Wei Shen, Jian Wu, and C. Lee Giles, Document Domain Randomization for Deep Learning Document Layout Extraction, 16th International Conference on Document Analysis and Recognition (ICDAR) 2021. September 5-10, Lausanne, Switzerland.
This dataset contains nine class lables: abstract, algorithm, author, body text, caption, equation, figure, table, and title.
* Dataset 1: CS-150x, an extension of the classical benchmark dataset CS-150 from three classes (figure, table, and caption) to nine classes, 1176 pages, Clark, C., Divvala, S.: Looking beyond text: Extracting figures, tables and captions from com- puter science papers. In: Workshops at the 29th AAAI Conference on Artificial Intelligence (2015), https://aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10092.
* Dataset 2: ACL300, 300 randomly sampled articles (or 2508 pages) from the 55,759 papers scraped from the ACL anthology website; https://www.aclweb.org/anthology/.
* Dataset 3: VIS300, about 10% (or 2619 pages) of the document pages in randomly partitioned articles from 26,350 VIS paper pages published in Chen, J., Ling, M., Li, R., Isenberg, P., Isenberg, T., Sedlmair, M., Möller, T., Laramee, R.S., Shen, H.W., Wünsche, K., Wang, Q.: VIS30K: A collection of figures and tables from IEEE visualization conference publications. IEEE Trans. Vis. Comput. Graph. 27 (2021), to appear doi: 10.1109/TVCG.2021.3054916.
This dataset is also available online at https://web.cse.ohio-state.edu/~chen.8028/ICDAR2021Benchmark/.
Image files are in png formats and the metafiles are in plain text.
Dataset Files
- CS-150x nine classes Test.CS150AndMeta.zip (884.43 MB)
- ACL300 nine classes Test.ACL300AndMeta.zip (293.55 MB)
- VIS300 nine classes Test.VIS300AndMeta.zip (3.06 GB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.