OCR (Optical Character Recognition); Pattern Recognition; Handwritten Recognition; Public Data

Handwriting Sanskrit Character Recognition

The "Sanskrit Character Dataset" includes 44 classes of handwritten Sanskrit characters, designed to support research in optical character recognition (OCR) and machine learning for ancient languages. Each class represents a unique Sanskrit letter, collected in various handwriting styles to ensure diversity and robustness. For each class, 50 to 80 images are included. To ensure diversity and real-world applicability, the letters were written in various handwriting styles.

Categories:: Computer Vision

310 Views

IITBBS-OCR-Dataset

Odia is a classical and popular language in the Indian subcontinent used by more than 50 million people. In spite of its rich history, popularity and usefulness, not much research efforts have been made to achieve high level accuracy in case of Odia OCR. New handwritten alphanumeric character and numeral datasets for Odia are created by our research group@iitbbs and reported here in order to address the paucity of benchmark Odia datasets.

Categories:: Image Processing

354 Views

MANUU: Handwritten Urdu OCR Dataset

The "MANUU: Handwritten Urdu OCR Dataset" is an extensive and meticulously curated collection to advance OCR (Optical Character Recognition) for handwritten Urdu letters, digits, and words. The compilation of the dataset has been conducted methodically, ensuring that it encompasses a wide variety of handwritten instances. This comprehensive collection enables the construction and assessment of strong models for Optical Character Recognition (OCR) systems specifically designed for the complexities of the Urdu script.

Categories:: Artificial Intelligence
Image Processing

744 Views

Urdu Handwritten Ligature Dataset

Urdu Handwritten Ligature Dataset (UHLD) is the first unconstrained handwritten Urdu dataset developed for various handwritten Urdu recognition tasks and OCR research problems. The UHLD is written independently of paper color, paper type (blank or ruled), ink color, and pen type. The UHLD consists of around six thousand handwritten Urdu text lines written by 200 different writers. The UHLD dataset covers six and seven-character ligatures whereas it was only up to five character ligatures in previous dataset such as UNHD.

Categories:: Other

10 Views

DEVANAGARI CAPTCHA DATASET OF 1 Million Images : A challenge Test

CAPTCHA (Completely Automated Public Turing Tests to Tell Computers and Humans Apart). Only humans can successfully complete this test; current computer systems cannot. It is utilized in several applications for both human and machine identification. Text-based CAPTCHAs are the most typical type used on websites. Most of the letters in this protected CAPTCHA script are in English, it is challenging for rural residents who only speak their native tongues to pass the test.

Categories:: Artificial Intelligence
Education and Learning Technologies
Machine Learning
Image Processing

1742 Views

SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset

This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes
across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’

Categories:: Machine Learning
Image Processing
Computer Vision

124 Views