GeoCaption-12K

Citation Author(s):: Xing Zi (University of Technology Sydney)
Submitted by:: Xing Zi
Last updated:: Tue, 05/06/2025 - 06:01
DOI:: 10.21227/6qf2-zd04
Data Format:: *.jpg; *.png; *.tif; *.ai; *.opj;

*.csv; *.json;
Research Article Link:: https://doi.org/10.48550/arXiv.2504.09528
Links:: https://doi.org/10.48550/arXiv.2504.09528

22 views

Categories:

Keywords:

Image and Video Captioning

ACCESS DATASET CITE

Abstract

The scarcity of multimodal datasets in remote sensing, particularly those combining high-resolution imagery with descriptive textual annotations, limits advancements in context-aware analysis. To address this, we introduce a novel dataset comprising 12,473 aerial and satellite images sourced from established benchmarks (RSSCN7, DLRSD, iSAID, LoveDA, and WHU), enriched with automatically generated pseudo-captions and semantic tags. Using a two-step pipeline, we first construct structured prompts from polygon-based annotations and employ GPT-4O to generate detailed captions articulating spatial layouts and object relationships, resulting in captions averaging 181.94 words with a vocabulary of 3,961 unique words. These captions are filtered for noise reduction and paired with semantic tags extracted via named entity recognition and part-of-speech tagging, providing domain-specific cues (e.g., “building,” “river,” “runway”). This dataset, with its comprehensive visual-textual annotations, significantly reduces manual annotation costs while enabling advanced multimodal remote sensing tasks, such as scene understanding and spatially informed visual representation learning across diverse urban, rural, and natural environments. The dataset is publicly available to foster research in remote sensing and multimodal learning.