- *Analyzing shortcut learning in VLMs across NLI and visual entailment:* Vision-language models (VLMs) achieve strong performance on many tasks, yet they can exhibit shortcut learning, where predictions rely on simple input patterns rather than on a full use of the available evidence. For LLMs, this behavior has been observed in NLI, which asks whether a hypothesis follows from a given premise. Prior work has shown that models can often solve NLI by relying mainly on cues in the hypothesis, without fully capturing the relationship between the premise and the hypothesis ([Poliak et al., 2018](https://aclanthology.org/S18-2023/), [Yuan et al., 2024](https://aclanthology.org/2024.emnlp-main.679/)). In visual entailment, the premise is an image rather than a text ([Xie et al., 2019](https://arxiv.org/abs/1901.06706), [Kayser et al., 2021](https://openaccess.thecvf.com/content/ICCV2021/papers/Kayser_E-ViL_A_Dataset_and_Benchmark_for_Natural_Language_Explanations_in_ICCV_2021_paper.pdf)). The goal of this project is to investigate whether similar shortcut behavior occurs in vision-language models when performing visual entailment, and to analyze which visual and textual information models rely on when making inferences. The scope can be adjusted for BSc or MSc, for example by varying the number of models, prompting strategies, or the depth and types of cues analyzed.
0 commit comments