Revisiting Table Detection Datasets for Visually Rich Documents

Citation Author(s):: Bin Xiao (University of Ottawa)

Murat Simsek (University of Ottawa)

Burak Kantarci (University of Ottawa)

Ala Abu Alkheir (Lytica Inc.)
Submitted by:: Bin Xiao
Last updated:: Mon, 04/08/2024 - 15:28
DOI:: 10.21227/sh17-hr68
Research Article Link:: Revisiting Table Detection Datasets for Visually Rich Documents

384 views

Categories:

Artificial Intelligence

Keywords:

table detection

ACCESS DATASET CITE

Abstract

Table Detection is a fundamental task for visually rich document understanding. However, popular public datasets widely used in related studies have inherent limitations, including noisy and inconsistent samples, limited training samples and limited data sources. These limitations make these datasets unreliable for evaluating the model performance and cannot reflect the actual capacity of models. Therefore, this study revisits some open-source datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, Open-Tables. Moreover, to enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open-source datasets. To ensure the label quality of the dataset, we annotated the dataset manually following the guidance of a domain expert. The proposed dataset is challenging and can be a sample of actual cases in the business context. We built strong baselines using various state-of-the-art object detection models. Experimental results show that the domain differences among existing open-source datasets are minor despite having different data sources. Our proposed Open-Tables and ICT-TD can provide a more reliable evaluation for models because of their high quality and consistent annotations. Besides, they are more suitable for cross-domain settings. Our experimental results show that in the cross-domain setting, benchmark models trained with cleaned Open-Tables dataset can achieve 0.6%-2.6% higher weighted average F1 than the corresponding ones trained with the noisy version of Open-Tables.

Instructions:

This is the dataset proposed in the paper Revisiting Table Detection Datasets for Visually Rich Documents (https://arxiv.org/abs/2305.04833). Please cite the following paper if you think this dataset is helpful.

@article{xiao2023revisiting,

title={Revisiting table detection datasets for visually rich documents},

author={Xiao, Bin and Simsek, Murat and Kantarci, Burak and Alkheir, Ala Abu},

journal={arXiv preprint arXiv:2305.04833},

year={2023}

}