arXiv 论文速递

Snapshot: 20260426_0404

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Authors: Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord

First: 2026-04-23T17:54:36+00:00 · Latest: 2026-04-23T17:54:36+00:00

Abstract

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .

中文标题/摘要

标题：当提示超越视觉：LVLM中的提示诱导幻觉

尽管大型视觉-语言模型（LVLM）的能力取得了令人印象深刻的进展，但这些系统仍然容易出现幻觉，即与视觉输入无关的输出。先前的研究将LVLM中的幻觉归因于视觉骨干的局限性或语言组件的主导地位，但这些因素的重要性尚不明确。为了解决这一模糊性，我们提出了HalluScope，一个基准测试，以更好地理解不同因素导致幻觉的程度。我们的分析表明，幻觉主要源自对文本先验和背景知识的过度依赖，尤其是通过文本指令引入的信息。为了减轻由文本指令先验引起的幻觉，我们提出了HalluVL-DPO框架，这是一种针对现成LVLM进行微调的方法，使其产生更符合视觉输入的响应。HalluVL-DPO利用我们精心构建的训练数据集中的偏好优化，引导模型更倾向于真实的响应而非幻觉。我们证明，优化后的模型有效地缓解了目标幻觉失败模式，同时在其他幻觉基准测试和视觉能力评估中保持或提高了性能。为了支持可重复性和进一步的研究，我们将公开发布我们的评估基准、偏好训练数据集和代码，网址为https://pegah-kh.github.io/projects/prompts-override-vision/。

Summary / 总结

This study investigates the causes of hallucinations in large vision-language models (LVLMs) and proposes HalluScope, a benchmark to understand the extent to which different factors induce hallucinations. The research finds that hallucinations mainly result from over-reliance on textual priors and background knowledge, especially when prompted by textual instructions. To address this, the authors introduce HalluVL-DPO, a fine-tuning framework that uses a curated dataset to guide the model towards more visually grounded responses, effectively reducing hallucinations while maintaining or improving other visual capabilities.

研究探讨了大型视觉语言模型（LVLM）中幻觉的原因，并提出了HalluScope基准，以了解不同因素引发幻觉的程度。研究发现，幻觉主要源于对文本先验知识和背景知识的过度依赖，尤其是在文本指令的提示下。为解决这一问题，作者引入了HalluVL-DPO框架，该框架利用一个精心构建的数据集来引导模型生成更符合视觉输入的响应，从而有效减少了幻觉现象，同时保持或提升了其他视觉能力的性能。

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Authors: Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou

First: 2025-06-26T17:59:12+00:00 · Latest: 2026-04-23T17:42:55+00:00

Comments: Project webpage: https://plan-lab.github.io/hallusegbench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).

Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding

Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi

First: 2025-06-24T12:45:09+00:00 · Latest: 2026-04-23T17:05:26+00:00