arXiv 论文速递

Snapshot: 20260427_0405

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Authors: Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord

First: 2026-04-23T17:54:36+00:00 · Latest: 2026-04-23T17:54:36+00:00

Abstract

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .

Summary / 总结

This study investigates the causes of hallucinations in large vision-language models (LVLMs) and proposes HalluScope, a benchmark to understand the role of different factors in inducing hallucinations. The research finds that hallucinations are mainly due to over-reliance on textual priors and background knowledge, especially from textual instructions. To address this, the authors introduce HalluVL-DPO, a fine-tuning framework that uses a curated dataset to guide the model towards more visually grounded responses, effectively reducing hallucinations while maintaining or improving other visual capabilities.

该研究探讨了大型视觉-语言模型（LVLM）中幻觉的原因，并提出了HalluScope基准，以理解不同因素在引发幻觉中的作用。研究发现，幻觉主要是由于过度依赖文本先验和背景知识，尤其是来自文本指令的信息。为了解决这一问题，作者引入了HalluVL-DPO框架，该框架使用一个精心构建的数据集来引导模型生成更符合视觉输入的响应，从而有效减少了幻觉现象，同时保持或提升了其他视觉能力的性能。

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Authors: Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou

First: 2025-06-26T17:59:12+00:00 · Latest: 2026-04-23T17:42:55+00:00

Comments: Project webpage: https://plan-lab.github.io/hallusegbench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).

Summary / 总结

This paper addresses the issue of pixel-grounding hallucinations in Segmentation Vision-Language Models (VLMs) by introducing Counterfactual Segmentation Reasoning (CSR) and a new benchmark called HalluSegBench. The method involves curating a dataset with controlled visual counterfactuals and developing new evaluation metrics to diagnose and measure hallucinations. The key finding is that the proposed RobustSeg model, trained with counterfactual fine-tuning, reduces hallucinations by 30% while improving segmentation performance on FP-RefCOCO(+/g).

本文通过引入Counterfactual Segmentation Reasoning (CSR) 和新的基准HalluSegBench，解决了Segmentation Vision-Language Models (VLMs) 中的像素定位幻觉问题。该方法包括构建一个带有控制视觉反事实的数据集，并开发新的评估指标来诊断和测量幻觉。主要发现是，通过反事实微调训练的RobustSeg模型减少了30%的幻觉，同时在FP-RefCOCO(+/g)上提高了分割性能。

Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding

Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi

First: 2025-06-24T12:45:09+00:00 · Latest: 2026-04-23T17:05:26+00:00