Datasets
Standard Dataset
TamilCOCO Dataset
- Citation Author(s):
- Submitted by:
- Jothi Prakash V
- Last updated:
- Thu, 01/09/2025 - 10:36
- DOI:
- 10.21227/y8f2-nk02
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
TamilCOCO is a novel bilingual image captioning dataset specifically designed for Tamil, a low-resource language. This dataset facilitates research in image captioning, cross-lingual natural language processing, and culturally adapted AI applications.
Dataset Statistics
- Total Rows: 305,340
- Total Columns: 3
- Unique Images: 63,062
- Unique English Captions: 303,036
- Unique Tamil Captions: N/A (translations are not unique due to possible repetitions)
Column Descriptions
image_id
: Unique identifier for each image in the dataset. Represents the visual content associated with the captions.caption_english
: The original English caption describing the image.raw_caption_tamil
: The corresponding Tamil caption, translated and culturally adapted for relevance.
Features
- Language Pair: English-Tamil
- Data Type: Textual descriptions (image captions)
- Multilingual Support: Bilingual captions in English and Tamil, enabling cross-lingual applications.
- Cultural Adaptation: Tamil captions incorporate idiomatic expressions and culturally specific terms for enhanced relevance.
Methodology
Annotation Framework
- Semi-Automated Translation: Initial translations generated using multilingual models like mBART.
- Cultural Adaptation Module: Refines captions for cultural relevance and semantic accuracy.
- Iterative Validation: Community-based and expert reviews to ensure linguistic and cultural fidelity.
Evaluation Metrics
- Standard Metrics: BLEU, METEOR, CIDEr, SPICE
- Novel Metric: Cultural Relevance Score (CRS) for assessing cultural adaptation quality.
Collaborators
This project was made possible by contributions from:
- Jothi Prakash V, Arul Antran Vijay S, Balamurugan R, Sudharshan S, Sanjai T, and Ahamed Sameer A.
Acknowledgments
We extend our heartfelt thanks to the community of Tamil-speaking volunteers and annotators who meticulously reviewed and validated captions, ensuring both cultural and linguistic accuracy. Special recognition is given to the cultural consultants who advised on idiomatic expressions and traditional references.
Usage
This dataset is suitable for:
- Training and evaluating image-captioning models in Tamil.
- Research in cross-lingual and low-resource language processing.
- Developing culturally aware NLP and AI applications.
Future Work
TamilCOCO aims to support advancements in cross-lingual image captioning, low-resource NLP, and culturally enriched AI systems. We welcome collaborations and contributions to expand and enhance this dataset.
Contact
For inquiries and collaborations, reach out to:
- Jothi Prakash V: jothiprakashv@gmail.com
- Arul Antran Vijay S: arulantranvijay@gmail.com