Abstract

Large Vision-Language Models (LVLMs) struggle with distractions, particularly in the presence of irrelevant visual or textual inputs. This paper introduces the Irrelevance Robust Visual Question Answering (IR-VQA) benchmark to systematically evaluate and mitigate this ``multimodal distractibility". IR-VQA targets three key paradigms: irrelevant visual contexts in image-independent questions, irrelevant textual contexts in image-dependent questions, and text-only distractions. Our experiments reveal that even state-of-the-art models like GPT-4o exhibit significant drops in accuracy and reasoning due to distraction-induced inconsistencies. To address this challenge, we present a novel methodology with the following components. First, we introduce new evaluation metrics, Positive Consistency (PC) and Negative Consistency (NC), to better assess model robustness under distractions. Next, we show that finetuning on our dataset demonstrates significant performance improvement in both traditional benchmarks and IR-VQA, highlighting the value of our dataset in enhancing model reliability and revealing deeper insights into multimodal interactions. This work paves the way for the development of more robust LVLMs for real-world applications.

Instructions:

Datasets
------------------
Specifically, our dataset contains:

- `train.json` the training data for model finetuning, which contains the default mix of our IR-VQA as well as original data from ScienceQA/MMLU (as seen in the paper)
- `test.csv` the evaluation data of our IR-VQA benchmark

Images
------------------
You should download and set up all necessary images in the following directory structure in `images`:

```
<dataset_root>
    -- ./ScienceQA/
    	-- training_images/
    	-- validation_images/
    	-- test_images/                   
    -- ./mscoco/
    	-- train2017/
    	-- val2017/
    	-- test2017/
    -- ./MMBench/
```

For ScienceQA images for each split, please visit the [ScienceQA] (https://huggingface.co/datasets/derek-thomas/ScienceQA) HuggingFace page, read each split, and store the images according to the indices. Note that some indices may not have a corresponding image. Save each image as `image_xxx.png` under the corresponding subdirectory.

For MSCOCO images, please download them from the [official website] (https://cocodataset.org/#download).

For MMBench images, please download them from [here] (https://github.com/open-compass/MMBench) and save images from each row under the `./images/MMBench/` directory.

Dataset Files

The dataset zip that contains our training and test data Data.zip (4.53 MB)

Datasets

Standard Dataset

Irrelevance Robust Visual Question Answering (IR-VQA)

Abstract

Dataset Files

QUESTIONS?