LASCID: Latin and Arabic Scene Character Image Dataset

Citation Author(s):: Riadh Harizi (REGIM-Lab, ENIS, University of Sfax, Tunisia)

Rim Walha (REGIM-Lab, ENIS, University of Sfax, Tunisia)

Fadoua Drira (REGIM-Lab, ENIS, University of Sfax, Tunisia)
Submitted by:: Rim Walha
Last updated:: Thu, 01/18/2024 - 22:34
DOI:: 10.21227/akvn-9791
Data Format:: PNG

381 views

Categories:

Keywords:

Latin and Arabic characters

ACCESS DATASET CITE

Abstract

In international contexts, natural scenes may include text in multiple languages. Especially, Latin and Arabic scene character image dataset is essential for training models to accurately detect and recognize text regions within real-world images. This is crucial for applications such as text translation, image search, content analysis, and autonomous vehicles that need to interpret text in different languages.

The proposed dataset encompasses a collection of 8034 Latin and Arabic scene character images which cover a large variety of text size, style, font, brightness, resolution, and orientation commonly encountered in diverse text related real-world contexts. In fact, an important effort has been done for collecting and labeling 4284 real scene character regions manually cropped from a set of well-known benchmark datasets, including ICDAR 2003, 2013, 2015, and 2017 scene text datasets. In addition, our dataset incorporates a set of 1860 synthetic character images from the CharImageDB dataset. Moreover, a Generative Adversarial Network (GAN)-based characters generator is developed to enhance the diversity of the dataset by creating 1890 synthetic Latin and Arabic character images, ensuring learning models to be exposed to a broader range of text visual information within real-world complex environments.

Such a Latin and Arabic scene character image dataset is an important resource for advancing research and development in computer vision, OCR, and related fields, ensuring that technology can effectively process and understand textual information in diverse scripts and languages.