OCR Telugu Image Dataset

Name: OCR Telugu Image Dataset
Creator: Kadavakollu Rao
License: https://creativecommons.org/licenses/by/4.0/
Keywords: Machine Learning

Citation Author(s):: Kadavakollu Venkateswara Rao (Boon IT Solutions)
Submitted by:: Kadavakollu Rao
Last updated:: Fri, 12/08/2023 - 09:31
DOI:: 10.21227/a1kv-rj60
Data Format:: Image
Research Article Link:: Telugu OCR Image Data

979 views

Categories:

Machine Learning

Keywords:

OCR (Optical Character Recognition)

ACCESS DATASET CITE

Abstract

The choice of the dataset is the key for OCR systems. Unfortunately, there are very few works on Telugu character datasets. The work by Pramod et al has 500 words and an average of 50 images with 50 fonts in four styles for training data each image of size 48x48 per category. They used the most frequently occurring words in Telugu but were unable to cover all the words in Telugu. Later works were based on character level. The dataset by Hastie has 460 classes and 160 samples per class which is made up of 500 images. However, these works have not utilized all the possible combinations of Vatu and Gunitha's. Here, we propose a dataset that takes into consideration all possible combinations of Vatu and Gunitha's with 17387 categories and nearly 560 samples per class. All the images are of size 32x32. There are 6,757,044 training samples, 972,309 validation samples, and 1,934,190 test samples which add up to (2 GB) images.

Each character has been augmented with 20 different fonts downloaded from 5 different sizes, random rotations, additive Gaussian noise, and spatial transformations. Our dataset is novel because, unlike other datasets which only take into account the commonly occurring permutations of characters and vatu's, we have spanned the entire Telugu alphabet and their corresponding vatu and Gunitha's.

Machine learning with deep neural network (DNN) models is called "deep learning." Several "neurons" are stacked in layers to form the computational mathematical model known as a deep neural network (DNN). An input is given to the network at the beginning of the chain of neurons, and it is transformed and passed down through the network to the final layer, where it is given out as the output. The Gradient Descent technique [Barzilai & Brewin, 1988] is commonly used to modify the network's weights. There are two main phases to the learning process, called epochs and batches, respectively. The batch size is the number of samples the network takes before determining the gradients. The training dataset's size establishes how many batches make up the epoch. A neuron layer is comprised of the neurons that receive each neuron's output, the neurons that generate that output, and another parameter. The network cannot learn its hyperparameters, so they must be defined and tuned beforehand. These hyperparameters include the number of layers, the number of neurons in each layer, the batch size, the learning rate, and others.

The hyperparameters can be fine-tuned or optimized by repeatedly training the network with a small set of initial hyperparameter values and testing its performance on a validation set. To get the best results in the least amount of time, the values are adjusted after each cycle. Different layers of deep learning, each with its own set of skills, have emerged over the years. The network can "remember" previously calculated data (a character in this study) and learn corrections concerning an entire sequence (a sentence in this study) with the help of a deep learning layer like LSTM (Long Short-Term Memory) [Hochreiter & Schmidhuber, 1997]. Gated recurrent units are another popular layer in neural networks [Cho et al., 2014]. There are two possible orders in which to implement the LSTM and GRU layers. This method, called bidirectional-RNN [Schuster & Paliwal, 1997], enables the layer to pick up on a broader Con Telugu text by correcting based on the words that come after them as well as those that came before. This method replicates the transfer of data throughout the network twice, once in the conventional direction and once in the reverse (ending-to-beginning) direction. A variety of models, such as the sequence-to-sequence model [Stuever, Vinals, & Le, 2014] built with an encoder-decoder architecture [Cho et al., 2014], are constructed using these layers. Furthermore, the type and complexity of the task determine the necessary number of layers for the encoder and decoder. A "dropout" regulation layer is added [Srivastava, Hinton, Krichevsky, Stuever, & Salahuddin, 2014] to prevent overfitting and solve the vanishing gradient problem [Hochreiter, 1998]. For the network to be forced to generalize and not rely solely on a one-to-one correspondence between input and output, several neurons are removed at the beginning of each learning cycle.

Deep learning models using a variety of DNNs have proven to be highly effective in many NLP tasks in recent years. In particular, these models are very effective at fixing spelling mistakes [Raaijmakers, 2013], translating Telugu text [Mokhtar, et al., 2018], and learning editing operations [Cherupara, 2014]. Consequently, it seems like a promising direction to use deep learning to fix OCR errors, particularly the encoder-decoder architecture that is central to the Neural Machine Translation (NMT) approach, which maps (i.e. translates) a sequence input (e.g., a sequence of OCRed characters with errors) to a sequence output. The sequence is encoded by the encoder into a Con Telugu text vector of a fixed size that the decoder will use to predict the output (the correct sequence of characters, for example). Most solutions seem to be incorporating neural networks at this point, as shown by the results of a recent competition in the field [Chiron, Doucet, & Moreaux, 2017; Rigaud, et al., 2019].

Dataset Contain & Diversity:

Containing more than 100 images, this Telugu OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc. on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Telugu text.

The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

We have observed that the dictionary can never be completely populated with all the words because of the complex morphology of the language. The union of two words results in a morphology where the last character of the first word and the first character of the second word are combined to get another character and the rest of the characters remain the same. Sandhi's and samosas result in countless combinations of words that cannot be included in the dictionary unless we use a sandhi splitter making the recognition of words much simpler. This phenomenon results in an infinite word list which makes the purpose of the dictionary useless. This problem can be overcome by using a sandhi splitter, which will break a word into its root words which can be easily found in the dictionary of words. The quality of the image has a lot of impact in the later stages of the process of character recognition. We have observed that during the binarization process, the characters. are broken when there's a thin line and faded background and if the characters which are at the back cover are visible to the front cover and the background is light, then these characters are also included on the front page which leads to the garbled image. This can be overcome by focusing more on the digital processing of the image. Training the character variants is highly difficult due to the huge number of combinations of characters possible and the nuances add to the complexity. One possible solution is the brute force method of training all variants of characters with a large number of samples per variant instead of training the vowel modifiers alone. The other approach is to instead, the output can be processed for finding the disconnected character variants and joining them.

All these shopping lists were written, and images were captured by native Telugu people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

Conclusion:

Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Telugu language. Your journey to improved language understanding and processing begins here.