Skip to main content

Datasets

Standard Dataset

SinOCR and SinFUND - Sinhala OCR and Form Understanding Datasets

Citation Author(s):
Kavishka Gunathilaka (Department of Computer Science and Engineering, University of Moratuwa)
Danusha Hewagama (Department of Computer Science and Engineering, University of Moratuwa)
Supul Pushpakumara (Department of Computer Science and Engineering, University of Moratuwa)
Thanuja Ambegoda (Department of Computer Science and Engineering, University of Moratuwa)
Submitted by:
Thanuja Ambegoda
Last updated:
DOI:
10.21227/hhez-0r18
Data Format:
582 views
Categories:
Keywords:
No Ratings Yet

Abstract

We present the SinOCR and SinFUND datasets, two comprehensive resources designed to advance Optical Character Recognition (OCR) and form understanding for the Sinhala language. SinOCR, the first publicly available and the most extensive dataset for Sinhala OCR to date, includes 100,000 images featuring printed text in 200 different Sinhala fonts and 1,135 images of handwritten text, capturing a wide spectrum of writing styles. SinFUND, the first fully annotated dataset of its kind, comprises 100 diverse, manually filled Sinhala forms, offering a robust foundation for developing template-free form understanding models. These datasets are crucial for addressing the challenges posed by paper-based documentation in low-resource languages, enhancing accuracy and efficiency in digital document processing. Both datasets aim to stimulate further research and innovation, providing valuable benchmarks for the OCR and form understanding communities. Access to these datasets will facilitate the development of more sophisticated models, promoting digital transformation and improved administrative processes in Sri Lanka and potentially other regions with similar linguistic challenges. The benchmarks will be published in a research article with the same title.

Instructions:

The dataset contains the following three subfolders

1. SinFUND: Sinhala forms dataset

2. SinOCR-handwritten

3. SinOCR-printed