SinOCR and SinFUND - Sinhala OCR and Form Understanding Datasets

Citation Author(s):
Kavishka
Gunathilaka
Department of Computer Science and Engineering, University of Moratuwa
Danusha
Hewagama
Department of Computer Science and Engineering, University of Moratuwa
Supul
Pushpakumara
Department of Computer Science and Engineering, University of Moratuwa
Thanuja
Ambegoda
Department of Computer Science and Engineering, University of Moratuwa
Submitted by:
Thanuja Ambegoda
Last updated:
Thu, 06/20/2024 - 07:35
DOI:
10.21227/hhez-0r18
Data Format:
License:
252 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

We present the SinOCR and SinFUND datasets, two comprehensive resources designed to advance Optical Character Recognition (OCR) and form understanding for the Sinhala language. SinOCR, the first publicly available and the most extensive dataset for Sinhala OCR to date, includes 100,000 images featuring printed text in 200 different Sinhala fonts and 1,135 images of handwritten text, capturing a wide spectrum of writing styles. SinFUND, the first fully annotated dataset of its kind, comprises 100 diverse, manually filled Sinhala forms, offering a robust foundation for developing template-free form understanding models. These datasets are crucial for addressing the challenges posed by paper-based documentation in low-resource languages, enhancing accuracy and efficiency in digital document processing. Both datasets aim to stimulate further research and innovation, providing valuable benchmarks for the OCR and form understanding communities. Access to these datasets will facilitate the development of more sophisticated models, promoting digital transformation and improved administrative processes in Sri Lanka and potentially other regions with similar linguistic challenges. The benchmarks will be published in a research article with the same title.

Comments

test

Submitted by Shalitha Thilak... on Tue, 06/11/2024 - 06:18

Hi, is it possible to get this dataset?

Submitted by Dejan Pecevski on Thu, 09/19/2024 - 05:24