arXiv 论文速递

Snapshot: 20260425_0410

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Authors: Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord

First: 2026-04-23T17:54:36+00:00 · Latest: 2026-04-23T17:54:36+00:00

Abstract

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .

中文标题/摘要

标题：当提示超越视觉：LVLM中的提示诱导幻觉

尽管大型视觉-语言模型（LVLMs）的能力取得了令人印象深刻的进展，但这些系统仍然容易出现幻觉，即与视觉输入无关的输出。先前的研究将LVLM中的幻觉归因于视觉骨干的局限性或语言组件的主导地位，但这些因素的重要性尚不明确。为了解决这一模糊性，我们提出了HalluScope基准，以更好地理解不同因素诱导幻觉的程度。我们的分析表明，幻觉主要源自对文本先验和背景知识的过度依赖，尤其是通过文本指令引入的信息。为了减轻由文本指令先验诱导的幻觉，我们提出了HalluVL-DPO框架，这是一种针对现成LVLM进行微调的方法，使其产生更符合视觉输入的响应。HalluVL-DPO利用我们精心构建的训练数据集中的偏好优化，引导模型更倾向于生成符合实际的响应而非幻觉。我们证明，优化后的模型有效地缓解了目标幻觉失败模式，同时在其他幻觉基准测试和视觉能力评估中保持或提高了性能。为了支持可重复性和进一步的研究，我们将在https://pegah-kh.github.io/projects/prompts-override-vision/ 公开发布我们的评估基准、偏好训练数据集和代码。

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Authors: Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou

First: 2025-06-26T17:59:12+00:00 · Latest: 2026-04-23T17:42:55+00:00

Comments: Project webpage: https://plan-lab.github.io/hallusegbench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).

中文标题/摘要

标题：反事实分割推理：诊断和缓解像素定位幻觉

分割视觉语言模型（VLMs）在增强基于视觉的语义理解方面取得了显著进展，但它们仍然容易产生像素定位幻觉，即为错误的对象生成掩码或为完全不存在的对象生成掩码。现有的评估几乎完全依赖于基于文本或标签的扰动，仅检查预测的掩码是否与查询标签匹配。这种评估忽略了幻觉的空间足迹和严重程度，因此无法揭示由视觉驱动的幻觉，这些幻觉更具挑战性且更为普遍。为解决这一差距，我们形式化了反事实分割推理（CSR）任务，其中模型必须在事实图像中分割参考对象，并在反事实对应物中避免。为了支持这一任务，我们构建了HalluSegBench，这是首个使用受控视觉反事实来诊断引用和推理表达分割幻觉的大规模基准，并引入了新的评估指标来衡量幻觉的严重程度并分离视觉和语言驱动的失败模式。我们还引入了RobustSeg，这是一种通过反事实微调（CFT）训练的分割VLM，使其学习何时分割何时避免。实验结果表明，RobustSeg将幻觉减少了30%，同时在FP-RefCOCO(+/g)上提高了分割性能。

Summary / 总结

The paper addresses the issue of pixel-grounding hallucinations in Segmentation Vision-Language Models (VLMs) by introducing Counterfactual Segmentation Reasoning (CSR) and a new benchmark, HalluSegBench. The method involves training a model to segment the correct object in the original image and abstain in a counterfactual image, leading to a 30% reduction in hallucinations while improving segmentation performance on FP-RefCOCO(+/g).

论文通过引入Counterfactual Segmentation Reasoning (CSR) 和新的基准HalluSegBench，解决了Segmentation Vision-Language Models (VLMs) 中的像素定位幻觉问题。该方法要求模型在原始图像中正确分割目标，在反事实图像中则不进行分割，从而将幻觉减少了30%，同时在FP-RefCOCO(+/g)上提高了分割性能。

Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding

Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi

First: 2025-06-24T12:45:09+00:00 · Latest: 2026-04-23T17:05:26+00:00