MDIW-13 MultiScript Document Database

Citation Author(s):: Miguel A Ferrer

Abhijit Das

Moises Diaz

Cristina Carmona-Duarte

Umapada Pal
Submitted by:: Miguel A. Ferrer
Last updated:: Fri, 10/25/2019 - 10:52
DOI:: 10.21227/656q-hc18
Data Format:: images

699 views

Categories:

Keywords:

Document Analysis

Multiscript document

Script identification

CITE

Abstract

Wide varieties of scripts are used in writing languages throughout the world. In a multiscript and multi-language environment, it is necessary to know the different scripts used in every part of a document to apply the appropriate document analysis algorithm. Consequently, several approaches for automatic script identification have been proposed in the literature, and can be broadly classified under two categories of techniques: those that are structure and visual appearance-based and those that are deep learning-based. Incidentally, since most existing techniques have been tested using different datasets and script combinations, a fair comparison between them is difficult. To alleviate this drawback, this paper therefore introduces a multiscript database, which contains both printed and handwritten documents obtained from a wide variety of scripts, such as Arabic, Bengali, Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu and Thai. The dataset consists of 1137 documents scanned from local newspapers, as well as handwritten letters and notes. Further, these documents are segmented into lines and words, for a total, respectively, of 13,983 and 86,675 lines and words in the dataset.

Instructions:

The database consists of printed and handwritten documents. We realized that the documents from each script contain some sort of watermark owing to the fact that each script’s documents came from a different original native location. Therefore, the sheets and some layouts were different, depending on their origins. This poses a risk of the document watermark, rather than the script, being recognized, which could be the case with a deep learning-based classifier.

Segmenting text from the backgrounds of some documents was challenging. Even with state-of-the art segmentation techniques used, the result was not satisfactory, and included a lot of salt and pepper noise or black patches, or was missing some parts of the text.

To avoid these drawbacks and provide a dataset for script recognition, all the documents were preprocessed and converted to a white background, while the foreground text ink was equalized. Furthermore, all documents were manually revised. Both original and processed documents are included in the database.

To allow for script recognition at different levels (i.e., document, line and word), each document was divided into lines and each line into words. In the division, a line is defined as an image with 2 or more words, and a word is defined as an image with 2 or more characters.

A. RECORDING OF PRINTED DOCUMENTS

The printed part of the database was recorded from a wide range of local newspapers and magazines to ensure that the samples would be as realistic as possible. The newspaper samples were collected mainly from India (as a wide verity of scripts are used there), Thailand, Japan, the United Arab Emirates and Europe. The database includes 13 different scripts: Arabic, Bengali, Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu and Thai.

The newspapers were scanned at a 300 dpi resolution. Paragraphs with only one script were selected for the database (paragraph here means the headline and body text). Thus, different text sizes, fonts, and styles are included in the database. Further, we tried to ensure that all the text lines were not skewed horizontally. All images were saved in png format, and using the script_xxx.png naming convention, with script being an abbreviation or memo for each script, and xxx, the file number starting at 001 for each script.

B. RECORDING OF HANDWRITTEN DOCUMENTS

Similar to the printed part in the handwritten database, we also included 13 different scripts: Persian as Arabic, Bengali, Gujarati, Punjabi, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu and Thai.

Most of the documents were provided by native volunteers capable of writing documents in their respective scripts. Each volunteer wrote a document, scanned it at 300 dpi, and then sent it to us by email. Consequently, the documents had large ink, sheet and scanner quality variations. Some of the Roman sheets came from the IAM handwritten database.

C. BACKGROUND AND INK EQUALIZATION

Due to the broad quality range of the documents, a two-step preprocessing was performed. In the first step, images are binarized by transforming the background into white, while in the second step, an ink equalization is performed.

Because the background texture, noise and illumination condition are primary factors degrading document image binarization performance, we used an iterative refinement framework in this paper to support robust binarization, In the process, the input image is initially transformed into a Bhattacharyya similarity matrix with a Gaussian kernel, which is subsequently converted into a binary image using a maximum entropy classifier. Then, the run-length histogram estimates the character stroke width. After noise elimination, the output image is used for the next round of refinement, and the process terminates when the estimated stroke width is stable. However, some documents are not correctly binarized, and in such cases, a manual binarization is performed using local thresholds. All the documents were revised and some noise was removed manually.

For ink equalization, we used an ink deposition model. All the black pixels on the binarized images were considered as ink spots and correlated with a Gaussian of width 0.2 mm. Finally, the image was equalized to duplicate fluid ink.

D. TEXT LINE SEGMENTATION

For the lines from a document to be segmented, they must be horizontal, otherwise a skew correction algorithm must be used ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"DOI":"10.1142/9789812775320_0011","author":[{"dropping-particle":"","family":"Mäenpää","given":"Topi","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Pietikäinen","given":"Matti","non-dropping-particle":"","parse-names":false,"suffix":""}],"container-title":"Handbook of Pattern Recognition and Computer Vision","id":"ITEM-1","issued":{"date-parts":[["2005","1"]]},"page":"197-216","publisher":"WORLD SCIENTIFIC","title":"Texture Analysis with Local Binary Patterns","type":"chapter"},"uris":["http://www.mendeley.com/documents/?uuid=d3df5a24-8f9e-4a6f-99be-7684ec76f8c4"]}],"mendeley":{"formattedCitation":"[21]","plainTextFormattedCitation":"[21]","previouslyFormattedCitation":"[21]"},"properties":{"noteIndex":0},"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}[21].

For the line segmentation, each connected object/component of the image is detected, and its convex hull obtained. The result is dilated horizontally in order to connect the objects belonging to the same line and each connected object is labeled. The next step is a line-by-line extraction, performed as follows:

1. Select the top object of the dilated lines and determine its horizontal histogram.

2. If its histogram has a single maximum, then it should be a single line, and the object is used as a mask to segment the line (see Figure 4).

3. If the object has several peaks, we assume that there are several lines. To separate them, we follow the next steps:

a. The object is horizontally eroded until the top object contains a single peak.

b. The new top object is dilated to recover the original shape and is used as a mask to segment the top line.

4. The top line is deleted, and the process is repeated from step 1 to the end.

The segmentation results were manually reviewed, and lines that had been wrongly segmented were manually repaired. The lines were saved as image files and named using the script_xxx_yyy.png format, where yyy is the line number, xxx isthe document number and script is the abbreviation for the script, as previously mentioned. Figure 3 presents an example of a segmented line for handwriting. These images are saved in grayscale format.

E. WORD SEGMENTATION

The words were segmented from the lines in two steps, with the first step being completely automatic. Each line was converted to a black and white component, a vertical histogram was obtained, and points where the value of the histogram was found to be zero were identified as the gaps or the intersection. Gaps wider than one-third of the line height were labeled as word separations.

In the second step, failed word segmentations were manually corrected. Each word was saved individually as a black and white image. The files were named using the script_xxx_yyy_zzz.png format, with zzz being the word number of the line script_xxx_yyy. For instance, a file named roma_004_012_004.png contains the black and white image of the fourth word on the 12th line of the 4th document in Roman script.

In Thai and Japanese, word segmentation is done heuristically because their lines consist of two or three long sequences of characters separated by a greater space. This is because in these scripts, there is generally no gap between two words, and contextual meaning is generally used to decide which characters comprise a word. Since we do not use contextual meaning in the present database, we used the following approach for pseudo-segmentation of Thai and Japanese scripts: for each sequence of characters, the first two characters are the first pseudo-word; the third to the fifth characters are the second pseudo-word; the sixth to the ninth character are the third pseudo-word, and so on, up to the end of the sequence.

It should be noted that in this work, our intention is not to develop a new line/word segmentation system. We only use this simple procedure to segment lines and words in a bid to build our database. We thus use a semi-automatic approach, with human verification and correction in case of erroneous segmentation.