Datasets
Standard Dataset
Handwritten Devanagari Characters Dataset –(Vowels, Consonants and Numerals) of 44,000 images for Devanagari CAPTCHA Generation and Recognition.
- Citation Author(s):
- Submitted by:
- SANJAY PATE
- Last updated:
- Mon, 07/08/2024 - 15:59
- DOI:
- 10.21227/9zpv-3194
- Data Format:
- Research Article Link:
- License:
- Categories:
- Keywords:
Abstract
Devanagari is a phonetic script that originated from Ancient Brahmi. It is the foundation of various Indian languages. According to data from the year 2022, the Devanagari Hindi script is spoken by over 342 million people worldwide and ranks third among the top 45 languages. There are approximately 11 vowels and 33 consonants and 10 numerals in the Devanagari script. The Devanagari script has no upper-or lower-case letters and is written from left to right.
The data set includes 44 handwritten Devanagari vowels, consonants, and numbers (i.e., 4 Vowels, 30 Consonants, and 10 numerals) from 63 Devanagari character sets, 19 images from the character set were eliminated to avoid confusing humans and maintain usability. The dataset is created using 44 (forty-four) distinct Devanagari characters in total.
Numerals (10)
Vowels (04)
Consonants (30)
० १ २ ३ ४ ५ ६ ७ ८ ९
अ इ उ ए
क ख ग घ च छ ज झ ट ड ढ ण त थ द ध न प ब भ म य र ल व श ष स ह ळ
On a Python-created canvas, the data is gathered and distributed the canvas code to more than one hundred (100+) Devanagari language native speakers of all ages, including both lefts- and right-handed computer users. Each user writes 440 characters (44 characters multiplied by 10) on the canvas and saves it on their computers. All user data is then compiled. The character on the canvas is black with a white background. No image noise is a benefit of using canvas. The total number of character images collected was 44,000 (forty-four thousand).
Additionally, data is pre-processed, scaled, and kept in a place that is open to the public. The final data set contains a total of 44,000 digitized images, 10,000 Devanagari numerals (10 numerals * 1000 each), 4000 vowels (4 vowels * 1000 each), and 30,000 consonants (30 Consonants * 1000 each), after the occluded images and scribbles have been removed. Each image has a grayscale data type and is in the .jpeg format. Each image requires 1.5 kb of storage and has a resolution of 65 by 65 pixels. As a result, although just 50 MB of data storage was necessary, 162 MB of disc space was needed.
Data was manually organized into the appropriate folders. Additionally.CSV (Comma Separated Values) files with training sets (70%) and testing sets (30%) are available for the said dataset. A.zip file containing the entire data set of images is also available.
The entire dataset is separated into two different classes.
One class is for Selected Devanagari 34-character images. There are 34 folders because there is one individual folder for each character. 1000 JPEG images in greyscale format are contained in each folder.
Another class is for Devanagari Selected 10-numeral images. Each Numeral has one separate folder, so 10 folders are there in each folder 1000 greyscale images in JPEG format.
The beauty of the dataset is that images from computer-literate users are gathered on a canvas made in Python. So there is no noise in any picture.
We strongly advise interested users to use this dataset for image processing applications such as CAPTCHA generation and Devanagari character recognition.
The project's official journal publication gives a more thorough description of the test environment and hardware setup.
Dataset Files
- DEVANAGARI CHARACTER DATASET 4 VOWELS 30 CONSONANATS 10 NUMERALS DEVA-C-SET.zip (46.92 MB)
- SCRIPT WRITTEN IN PYTHON SCRIPT.zip (11.43 kB)
Documentation
Attachment | Size |
---|---|
instruction : DEVANAGARI CHARACTER DATASET | 18.33 KB |
Comments
DEVANAGARI character DATASET will be helpful for the research scholars working with Devanagari OCR, Devanagari Character Recognition, and most important Handwritten CAPTCHA recognition and design.
The beauty of this DEVANAGARI character DATASET is Noiseless.