OCR (Optical Character Recognition); Pattern Recognition; Handwritten Recognition; Public Data

Handwriting Sanskrit Character Recognition

The "Sanskrit Character Dataset" includes 44 classes of handwritten Sanskrit characters, designed to support research in optical character recognition (OCR) and machine learning for ancient languages. Each class represents a unique Sanskrit letter, collected in various handwriting styles to ensure diversity and robustness. For each class, 50 to 80 images are included. To ensure diversity and real-world applicability, the letters were written in various handwriting styles.

Categories:

Computer Vision

IITBBS-OCR-Dataset

Odia is a classical and popular language in the Indian subcontinent used by more than 50 million people. In spite of its rich history, popularity and usefulness, not much research efforts have been made to achieve high level accuracy in case of Odia OCR. New handwritten alphanumeric character and numeral datasets for Odia are created by our research group@iitbbs and reported here in order to address the paucity of benchmark Odia datasets.

Categories:

Image Processing

MANUU: Handwritten Urdu OCR Dataset

The "MANUU: Handwritten Urdu OCR Dataset" is an extensive and meticulously curated collection to advance OCR (Optical Character Recognition) for handwritten Urdu letters, digits, and words. The compilation of the dataset has been conducted methodically, ensuring that it encompasses a wide variety of handwritten instances. This comprehensive collection enables the construction and assessment of strong models for Optical Character Recognition (OCR) systems specifically designed for the complexities of the Urdu script. Ensuring public accessibility of this resource is of utmost importance.

Categories:

Urdu Handwritten Ligature Dataset

Urdu Handwritten Ligature Dataset (UHLD) is the first unconstrained handwritten Urdu dataset developed for various handwritten Urdu recognition tasks and OCR research problems. The UHLD is written independently of paper color, paper type (blank or ruled), ink color, and pen type. The UHLD consists of around six thousand handwritten Urdu text lines written by 200 different writers. The UHLD dataset covers six and seven-character ligatures whereas it was only up to five character ligatures in previous dataset such as UNHD.

Categories:

Other

DEVANAGARI CAPTCHA DATASET OF 1 Million Images : A challenge Test

CAPTCHA (Completely Automated Public Turing Tests to Tell Computers and Humans Apart). Only humans can successfully complete this test; current computer systems cannot. It is utilized in several applications for both human and machine identification. Text-based CAPTCHAs are the most typical type used on websites. Most of the letters in this protected CAPTCHA script are in English, it is challenging for rural residents who only speak their native tongues to pass the test.

Categories:

SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset

This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes
across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’

Categories: