TamilCOCO Dataset

Citation Author(s):
Jothi Prakash
V
Arul Antran Vijay
S
Balamurugan
R
Sudharsan
S
Sanjai
T
Ahamed Sameer
A
Submitted by:
Jothi Prakash V
Last updated:
Thu, 01/09/2025 - 10:36
DOI:
10.21227/y8f2-nk02
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

TamilCOCO is a novel bilingual image captioning dataset specifically designed for Tamil, a low-resource language. This dataset facilitates research in image captioning, cross-lingual natural language processing, and culturally adapted AI applications.

Dataset Statistics


  • Total Rows: 305,340
  • Total Columns: 3
  • Unique Images: 63,062
  • Unique English Captions: 303,036
  • Unique Tamil Captions: N/A (translations are not unique due to possible repetitions)

Column Descriptions


  1. image_id: Unique identifier for each image in the dataset. Represents the visual content associated with the captions.
  2. caption_english: The original English caption describing the image.
  3. raw_caption_tamil: The corresponding Tamil caption, translated and culturally adapted for relevance.

Features


  • Language Pair: English-Tamil
  • Data Type: Textual descriptions (image captions)
  • Multilingual Support: Bilingual captions in English and Tamil, enabling cross-lingual applications.
  • Cultural Adaptation: Tamil captions incorporate idiomatic expressions and culturally specific terms for enhanced relevance.
Instructions: 

Methodology


Annotation Framework


  1. Semi-Automated Translation: Initial translations generated using multilingual models like mBART.
  2. Cultural Adaptation Module: Refines captions for cultural relevance and semantic accuracy.
  3. Iterative Validation: Community-based and expert reviews to ensure linguistic and cultural fidelity.

Evaluation Metrics


  • Standard Metrics: BLEU, METEOR, CIDEr, SPICE
  • Novel Metric: Cultural Relevance Score (CRS) for assessing cultural adaptation quality.

Collaborators


This project was made possible by contributions from:

  • Jothi Prakash V, Arul Antran Vijay S, Balamurugan R, Sudharshan S, Sanjai T, and Ahamed Sameer A.

Acknowledgments


We extend our heartfelt thanks to the community of Tamil-speaking volunteers and annotators who meticulously reviewed and validated captions, ensuring both cultural and linguistic accuracy. Special recognition is given to the cultural consultants who advised on idiomatic expressions and traditional references.

Usage


This dataset is suitable for:

  • Training and evaluating image-captioning models in Tamil.
  • Research in cross-lingual and low-resource language processing.
  • Developing culturally aware NLP and AI applications.

Future Work


TamilCOCO aims to support advancements in cross-lingual image captioning, low-resource NLP, and culturally enriched AI systems. We welcome collaborations and contributions to expand and enhance this dataset.

Contact


For inquiries and collaborations, reach out to:

Dataset Files

    Files have not been uploaded for this dataset