Handwritten Devanagari Characters Dataset –(Vowels, Consonants and Numerals) of 44,000 images for Devanagari CAPTCHA Generation and Recognition.

Citation Author(s):
SANJAY
PATE
Nanasaheb Y.N.CHAVAN Arts Science and Comnerce college Chalisgaon
Prof.Dr.Rakesh
Ramteke
School of Computer Sciences, Kaviyitri Bahinabai Chaudhari North Maharashtra University, Jalgaon
Submitted by:
SANJAY PATE
Last updated:
Mon, 07/08/2024 - 15:59
DOI:
10.21227/9zpv-3194
Data Format:
Research Article Link:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Devanagari is a phonetic script that originated from Ancient Brahmi. It is the foundation of various Indian languages. According to data from the year 2022, the Devanagari Hindi script is spoken by over 342 million people worldwide and ranks third among the top 45 languages. There are approximately 11 vowels and 33 consonants and 10 numerals in the Devanagari script. The Devanagari script has no upper-or lower-case letters and is written from left to right.

The data set includes 44 handwritten Devanagari vowels, consonants, and numbers (i.e., 4 Vowels, 30 Consonants, and 10 numerals) from 63 Devanagari character sets, 19 images from the character set were eliminated to avoid confusing humans and maintain usability. The dataset is created using 44 (forty-four) distinct Devanagari characters in total.

Numerals (10)

Vowels (04)

Consonants (30)

०  १  २  ३  ४  ५  ६  ७  ८  ९

अ इ उ ए

क ख ग घ च  छ ज झ ट ड ढ ण त  थ  द  ध न प ब भ म य र ल व श ष स ह ळ

On a Python-created canvas, the data is gathered and distributed the canvas code to more than one hundred (100+) Devanagari language native speakers of all ages, including both lefts- and right-handed computer users. Each user writes 440 characters (44 characters multiplied by 10) on the canvas and saves it on their computers. All user data is then compiled. The character on the canvas is black with a white background. No image noise is a benefit of using canvas. The total number of character images collected was 44,000 (forty-four thousand).

Additionally, data is pre-processed, scaled, and kept in a place that is open to the public. The final data set contains a total of 44,000 digitized images, 10,000 Devanagari numerals (10 numerals * 1000 each), 4000 vowels (4 vowels * 1000 each), and 30,000 consonants (30 Consonants * 1000 each), after the occluded images and scribbles have been removed. Each image has a grayscale data type and is in the .jpeg format. Each image requires 1.5 kb of storage and has a resolution of 65 by 65 pixels. As a result, although just 50 MB of data storage was necessary, 162 MB of disc space was needed.

 

Data was manually organized into the appropriate folders. Additionally.CSV (Comma Separated Values)  files with training sets (70%) and testing sets (30%) are available for the said dataset. A.zip file containing the entire data set of images is also available.

Comments

DEVANAGARI character DATASET will be helpful for the research scholars working with Devanagari OCR, Devanagari Character Recognition, and most important Handwritten CAPTCHA recognition and design.

Submitted by SANJAY PATE on Thu, 10/06/2022 - 11:45

The beauty of this DEVANAGARI character DATASET is Noiseless.

Submitted by SANJAY PATE on Thu, 10/06/2022 - 11:51