Abstract

The "MANUU: Handwritten Urdu OCR Dataset" is an extensive and meticulously curated collection to advance OCR (Optical Character Recognition) for handwritten Urdu letters, digits, and words. The compilation of the dataset has been conducted methodically, ensuring that it encompasses a wide variety of handwritten instances. This comprehensive collection enables the construction and assessment of strong models for Optical Character Recognition (OCR) systems specifically designed for the complexities of the Urdu script. Ensuring public accessibility of this resource is of utmost importance. The website in question is accessible to the general public. The information was generated through a collaborative endeavor that engaged the participation of more than 600 writers. The presence of a varied group of contributors guarantees a wide range of handwriting styles, variances, and subtleties, leading to a dataset that closely resembles the authentic diversity found in handwritten Urdu literature.

The proposed dataset, has the potential to significantly contribute to the progress of OCR technology. It aims to address the existing disparity between manual transcription and automated processing of handwritten Urdu literature. The dataset is significant for the advancement of OCR solutions in handwritten Urdu material. Its extensive size, detailed labeling, and complete coverage make it a useful resource. This dataset opens new possibilities for extracting relevant information from handwritten Urdu text in many applications, enhancing accuracy and efficiency.

The dataset was obtained from Maulana Azad National Urdu University, located in Telangana, Hyderabad. Before the collection of the dataset, each participant underwent training and was thereafter instructed to compose the given text in a manner that reflects natural language usage. The original dataset was collected from a sample of 649 individuals who identified as native authors, encompassing both males and females. Each participant is required to compose six pages, with each page consisting of eight lines of text. Each participant was given identical material pages, It encompasses an extensive range of handwritten examples, totaling 2596 pages and 172,634-character images, including digits and various forms of characters. These characters are further categorized into isolated, initial, medial, and final forms, showcasing a rich diversity of handwriting styles and variations. The dataset is a collaborative effort involving 649 writers, both male and female, from school and college backgrounds, with a breakdown of left and right-handed writing. This dataset, which contains 11,682 special character pictures and 61,006-word images overall, is an important tool for the development and evaluation of OCR systems since it provides a variety of real-world handwriting data that can be used to train and test these systems.

Instructions:

load it directly from the data.npz file

Funding Agency:

Not applicable

Dataset Files

datasetnpz.rar (140.56 MB)
npz-20241215T152212Z-001.zip (26.83 MB)
model-20241215T153008Z-001.zip (641.32 MB)

Documentation

Attachment	Size
dataset description.docx	1.1 MB

Datasets

Standard Dataset

MANUU: Handwritten Urdu OCR Dataset

Abstract

Dataset Files

Documentation

QUESTIONS?