Document Analysis

We present the SinOCR and SinFUND datasets, two comprehensive resources designed to advance Optical Character Recognition (OCR) and form understanding for the Sinhala language. SinOCR, the first publicly available and the most extensive dataset for Sinhala OCR to date, includes 100,000 images featuring printed text in 200 different Sinhala fonts and 1,135 images of handwritten text, capturing a wide spectrum of writing styles.

Categories:
411 Views

Wide varieties of scripts are used in writing languages throughout the world. In a multiscript and multi-language environment, it is necessary to know the different scripts used in every part of a document to apply the appropriate document analysis algorithm. Consequently, several approaches for automatic script identification have been proposed in the literature, and can be broadly classified under two categories of techniques: those that are structure and visual appearance-based and those that are deep learning-based.

Categories:
661 Views