arXiv 论文速递

2026-04-25 04:10
Snapshot: 20260425_0410
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
Authors: Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord
First: 2026-04-23T17:54:36+00:00 · Latest: 2026-04-23T17:54:36+00:00
Abstract
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .
中文标题/摘要
标题:当提示超越视觉:LVLM中的提示诱导幻觉
尽管大型视觉-语言模型(LVLMs)的能力取得了令人印象深刻的进展,但这些系统仍然容易出现幻觉,即与视觉输入无关的输出。先前的研究将LVLM中的幻觉归因于视觉骨干的局限性或语言组件的主导地位,但这些因素的重要性尚不明确。为了解决这一模糊性,我们提出了HalluScope基准,以更好地理解不同因素诱导幻觉的程度。我们的分析表明,幻觉主要源自对文本先验和背景知识的过度依赖,尤其是通过文本指令引入的信息。为了减轻由文本指令先验诱导的幻觉,我们提出了HalluVL-DPO框架,这是一种针对现成LVLM进行微调的方法,使其产生更符合视觉输入的响应。HalluVL-DPO利用我们精心构建的训练数据集中的偏好优化,引导模型更倾向于生成符合实际的响应而非幻觉。我们证明,优化后的模型有效地缓解了目标幻觉失败模式,同时在其他幻觉基准测试和视觉能力评估中保持或提高了性能。为了支持可重复性和进一步的研究,我们将在https://pegah-kh.github.io/projects/prompts-override-vision/ 公开发布我们的评估基准、偏好训练数据集和代码。
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Authors: Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou
First: 2025-06-26T17:59:12+00:00 · Latest: 2026-04-23T17:42:55+00:00
Comments: Project webpage: https://plan-lab.github.io/hallusegbench/
Abstract
Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
中文标题/摘要
标题:反事实分割推理:诊断和缓解像素定位幻觉
分割视觉语言模型(VLMs)在增强基于视觉的语义理解方面取得了显著进展,但它们仍然容易产生像素定位幻觉,即为错误的对象生成掩码或为完全不存在的对象生成掩码。现有的评估几乎完全依赖于基于文本或标签的扰动,仅检查预测的掩码是否与查询标签匹配。这种评估忽略了幻觉的空间足迹和严重程度,因此无法揭示由视觉驱动的幻觉,这些幻觉更具挑战性且更为普遍。为解决这一差距,我们形式化了反事实分割推理(CSR)任务,其中模型必须在事实图像中分割参考对象,并在反事实对应物中避免。为了支持这一任务,我们构建了HalluSegBench,这是首个使用受控视觉反事实来诊断引用和推理表达分割幻觉的大规模基准,并引入了新的评估指标来衡量幻觉的严重程度并分离视觉和语言驱动的失败模式。我们还引入了RobustSeg,这是一种通过反事实微调(CFT)训练的分割VLM,使其学习何时分割何时避免。实验结果表明,RobustSeg将幻觉减少了30%,同时在FP-RefCOCO(+/g)上提高了分割性能。
Summary / 总结
The paper addresses the issue of pixel-grounding hallucinations in Segmentation Vision-Language Models (VLMs) by introducing Counterfactual Segmentation Reasoning (CSR) and a new benchmark, HalluSegBench. The method involves training a model to segment the correct object in the original image and abstain in a counterfactual image, leading to a 30% reduction in hallucinations while improving segmentation performance on FP-RefCOCO(+/g).
论文通过引入Counterfactual Segmentation Reasoning (CSR) 和新的基准HalluSegBench,解决了Segmentation Vision-Language Models (VLMs) 中的像素定位幻觉问题。该方法要求模型在原始图像中正确分割目标,在反事实图像中则不进行分割,从而将幻觉减少了30%,同时在FP-RefCOCO(+/g)上提高了分割性能。
Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi
First: 2025-06-24T12:45:09+00:00 · Latest: 2026-04-23T17:05:26+00:00
Abstract
Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely or reward fluent but factually incorrect captions. These findings highlight the limitations of deploying foundation models for embodied agents and the need for more robust architectures and evaluation protocols in physical robotic applications.
Summary / 总结
This study evaluates the robustness of Vision-Language Models (VLMs) in single-view robotic scene understanding by introducing a controlled physical domain shift between real-world tools and their 3D-printed counterparts. The research benchmarks several state-of-the-art VLMs using multiple metrics to assess semantic alignment and factual grounding. Results show that VLMs perform well on common real-world objects but struggle with 3D-printed items, despite their similar structures. The study also reveals limitations in current evaluation metrics, which sometimes fail to detect domain shifts or reward incorrect captions. These findings underscore the need for more robust architectures and evaluation methods for VLMs in physical robotic applications.
研究通过引入实物工具与其3D打印相似但材质、颜色和纹理不同的替代品之间的可控物理域移,评估了Vision-Language模型(VLM)在单视角机器人场景理解中的鲁棒性。研究使用多个指标对几种最先进的VLM进行基准测试,以评估语义对齐和事实基础。结果表明,尽管VLM在常见实物对象上的表现良好,但在3D打印物品上的性能显著下降,尽管它们的结构相似。研究还揭示了标准评估指标的关键缺陷,有时无法检测到域移或奖励错误的描述。这些发现强调了在物理机器人应用中部署基础模型时需要更鲁棒的架构和评估协议的必要性。
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
Authors: Katharina Prasse, Steffen Jung, Isaac Bravo, Stefanie Walter, Patrick Knab, Christian Bartelt, Margret Keuper
First: 2026-04-23T15:44:14+00:00 · Latest: 2026-04-23T15:44:14+00:00
Abstract
Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.
中文标题/摘要
标题:从代码本到VLM:评估自动化视觉话语分析在社交媒体上的气候变迁研究
社交媒体平台已成为气候沟通的主要场所,生成了数百万张图片和帖子,如果系统地分析这些内容,可以揭示哪些沟通策略能激发公众关注,哪些则不然。我们旨在通过分析计算机视觉方法在社交媒体话语分析中的应用来促进此类研究。该分析包括基于应用的分类学设计、模型选择、提示工程和验证。我们在X(原Twitter)的两个数据集上对六种可提示的视觉-语言模型和十五种零样本CLIP-like模型进行了基准测试——一个由1,038张专家标注的图片集和一个包含超过120万张图片的更大语料库,其中5万个标签由人工验证,涵盖了五个标注维度:动物内容、气候变迁后果、气候行动、图片场景和图片类型。在基准测试的模型中,Gemini-3.1-flash-lite在所有超类别和两个数据集上均表现出色,与开放权重的中等规模模型之间的差距相对较小。除了实例级指标外,我们提倡分布式评估:VLM预测即使在单张图片准确率较低时也能可靠地恢复总体趋势,使它们成为大规模话语分析的可行起点。我们发现,链式推理反而降低了性能,而针对特定标注维度的提示设计则提高了性能。我们将在https://github.com/KathPra/Codebooks2VLMs.git发布推特ID和标签,并提供我们的代码。
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
Authors: Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Morten Rieger Hannemose
First: 2026-02-21T10:53:50+00:00 · Latest: 2026-04-23T15:15:48+00:00
Comments: Accepted by CVPR2026
Abstract
Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, performs inference over 30x faster than the baseline and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.
Summary / 总结
MaskDiME is designed to generate precise and efficient visual counterfactual explanations by unifying semantic consistency and spatial precision through localized sampling. It adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while maintaining high image fidelity. MaskDiME outperforms existing methods by performing inference 30 times faster than the baseline and achieving comparable or state-of-the-art performance across five benchmark datasets.
MaskDiME 通过局部采样统一语义一致性和空间精度,以生成精确且高效的视觉反事实解释。它会自适应地关注决策相关区域,以实现局部和语义一致的反事实生成,同时保持高图像保真度。MaskDiME 的推理速度比基线快 30 倍,并在五个涵盖不同视觉领域的基准数据集上实现了可比或最先进的性能,从而提供了一种实用且通用的高效反事实解释解决方案。
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Authors: Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He
Venue: CVPR 2026
First: 2026-04-23T14:33:27+00:00 · Latest: 2026-04-23T14:33:27+00:00
Comments: Accepted by CVPR 2026 (Findings Track)
Abstract
Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .
Causal Disentanglement for Full-Reference Image Quality Assessment
Authors: Zhen Zhang, Jielei Chu, Tian Zhang, Weide Liu, Fengmao Lv, Tianrui Li, Jun Cheng, Yuming Fang
First: 2026-04-23T13:18:13+00:00 · Latest: 2026-04-23T13:18:13+00:00
Abstract
Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.
中文标题/摘要
标题:因果分离在全参考图像质量评估中的应用
现有的基于深度网络的全参考图像质量评估(FR-IQA)模型通常通过比较参考图像和失真图像的深度特征来进行成对比较。本文从不同角度出发,提出了一种基于因果推理和解耦表示学习的新型FR-IQA范式。与典型的基于特征比较的FR-IQA模型不同,我们的方法将退化估计公式化为由潜在表示干预引导的因果分离过程。首先,通过利用参考图像和失真图像之间的内容不变性,我们解耦退化表示和内容表示。其次,受到人类视觉掩蔽效应的启发,我们设计了一个掩蔽模块来建模图像内容和退化特征之间的因果关系,从而从失真图像中提取受内容影响的退化特征。最后,我们使用监督回归或无标签降维从这些退化特征预测质量评分。大量实验表明,我们的方法在标准图像质量评估基准上实现了高度竞争力的性能,涵盖全监督、少量标签和无标签设置。此外,我们还在包括水下、放射学、医学、中子和屏幕内容图像在内的多种非标准自然图像领域进行了评估,这些领域数据稀缺。得益于其能够在没有标注图像质量评估数据的情况下进行场景特定的训练和预测的能力,我们的方法在跨域泛化方面优于现有的无训练FR-IQA模型。
Summary / 总结
This paper proposes a novel full-reference image quality assessment (FR-IQA) method based on causal inference and decoupled representation learning. Unlike traditional feature comparison-based models, it formulates degradation estimation as a causal disentanglement process. The method first decouples degradation and content representations, then uses a masking module to extract content-influenced degradation features, and finally predicts quality scores. Experiments show that the proposed method performs competitively across various IQA benchmarks and demonstrates superior cross-domain generalization in diverse image domains with scarce data.
本文提出了一种基于因果推理和解耦表示学习的全参考图像质量评估(FR-IQA)方法。不同于传统的基于特征对比的方法,该方法将退化估计视为因果分离过程。方法首先分离退化和内容表示,然后使用遮罩模块提取内容影响的退化特征,最后使用监督回归或无标签降维预测质量得分。实验表明,所提出的方法在各种IQA基准测试中表现出色,并在具有稀缺数据的多种图像域中展示了优于现有无监督FR-IQA模型的跨域泛化能力。
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Authors: Zhenyu Ning, Guangda Liu, Qihao Jin, Chengwei Li, Wenchao Ding, Minyi Guo, Jieru Zhao
Venue: 63rd ACM/IEEE Design Automation Conference (DAC '26), July 2026
First: 2025-05-21T08:47:15+00:00 · Latest: 2026-04-23T12:54:38+00:00
Comments: Accepted by DAC'26
Abstract
Recent developments in Video Large Language Models (Video LLMs) have enabled models to process hour-long videos and exhibit exceptional performance. Nonetheless, the Key-Value (KV) cache expands linearly over time, leading to substantial memory overhead and response delay--critical challenges in various real-world online applications, such as Deepseek services, autonomous driving and robotics. To mitigate these issues, we propose $\textbf{LiveVLM}$, a training-free and query-agnostic framework specifically designed for online video understanding and real-time interaction. LiveVLM employs a Vision Sink Bucketing (VSB) mechanism to process video streams in real time, retain long-term video details and eliminate redundant KVs. This mechanism utilizes vision-to-vision attention scores as the metric and seeks to maximize the coverage of contextual information during compression. Noting that KV cache compressed in a query-agnostic manner inevitably retains irrelevant information for specific queries, LiveVLM incorporates a Position-agnostic KV Retrieval (PaR) mechanism to reduce interference from redundant context. The keypoint of PaR lies in decoupling positional embeddings to enhance the similarity between key tensors, thereby supporting efficient retrieval at the granularity of pages. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to achieve state-of-the-art accuracy among both training-free query-agnostic methods and training-based online models.
中文标题/摘要
标题:LiveVLM:通过流式导向的KV缓存和检索实现高效的在线视频理解
近期视频大型语言模型(Video LLMs)的发展使模型能够处理长达一小时的视频并表现出色。然而,KV缓存会随着时间线性扩展,导致显著的内存开销和响应延迟——这是各种实际在线应用中的关键挑战,如Deepseek服务、自动驾驶和机器人技术。为了解决这些问题,我们提出了一种名为$\textbf{LiveVLM}$的无需训练且与查询无关的框架,专门用于在线视频理解和实时交互。LiveVLM采用视觉桶化(VSB)机制实时处理视频流,保留长期视频细节并消除冗余的KV。该机制利用视觉到视觉注意力分数作为度量标准,并力求在压缩过程中最大化上下文信息的覆盖范围。鉴于以查询无关的方式压缩的KV缓存不可避免地保留了特定查询的相关信息,LiveVLM引入了一种位置无关的KV检索(PaR)机制以减少冗余上下文的干扰。PaR的关键在于解耦位置嵌入以增强关键张量之间的相似性,从而支持在页面级别进行高效的检索。大量实验表明,LiveVLM使基础的LLaVA-OneVision模型在无需训练的查询无关方法和基于训练的在线模型中均达到了最先进的准确率。
Summary / 总结
LiveVLM is a framework designed to address the memory overhead and response delay issues in Video Large Language Models (Video LLMs) for online applications. It uses a Vision Sink Bucketing (VSB) mechanism to process video streams in real time and a Position-agnostic KV Retrieval (PaR) mechanism to reduce irrelevant information. Experiments show that LiveVLM enables the LLaVA-OneVision model to achieve state-of-the-art accuracy in both training-free query-agnostic methods and training-based online models.
LiveVLM 是一个框架,旨在解决视频大型语言模型(Video LLMs)在在线应用中的内存开销和响应延迟问题。它使用 Vision Sink Bucketing (VSB) 机制实时处理视频流,并使用 Position-agnostic KV Retrieval (PaR) 机制减少无关信息。实验表明,LiveVLM 使 LLaVA-OneVision 模型在训练-free 查询-agnostic 方法和训练-based 在线模型中均达到最先进的准确率。
Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models
Authors: Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu
First: 2026-04-17T10:17:16+00:00 · Latest: 2026-04-23T12:41:25+00:00
Comments: 30 pages, 5 figures
Abstract
While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.
中文标题/摘要
标题:流形上的推理:双向一致性在扩散语言模型中的自我验证
虽然扩散大型语言模型(dLLMs)在全局规划方面具有结构优势,但高效验证它们是否通过有效的推理轨迹到达正确答案仍然是一个关键挑战。在本文中,我们提出了一种几何视角:流形上的推理。我们假设有效的生成轨迹作为学习分布高密度流形上的稳定吸引子存在,而无效路径则表现出流形外的漂移。为了实现这一点,我们引入了双向流形一致性(BMC),这是一种无需训练、无监督的度量标准,通过前向掩蔽和后向重构循环量化生成序列的稳定性。实证上,我们展示了BMC在推理生命周期的全过程中具有灵活性:(1)在诊断中,它作为稳健的解决方案有效性鉴别器,无需参考答案;(2)在推理中,它使拒绝采样得以有效集中计算资源于复杂推理任务;(3)在对齐中,它作为密集的几何奖励,将稀疏的结果监督转化为精细的指导,使模型能够超越标准基线自我进化。我们的结果确立了内在的几何稳定性作为dLLMs正确性的稳健指标。
Summary / 总结
This work addresses the challenge of verifying the correctness of answers generated by Diffusion Large Language Models (dLLMs) through a geometric perspective called Reasoning on the Manifold. The proposed Bidirectional Manifold Consistency (BMC) method quantifies the stability of generated sequences by comparing forward-masking and backward-reconstruction. BMC is shown to be versatile, serving as a robust discriminator for solution validity, enabling efficient resampling in complex reasoning tasks, and providing dense geometric rewards for model alignment. The results indicate that intrinsic geometric stability is a reliable indicator of correctness for dLLMs.
本文通过几何视角“Reasoning on the Manifold”解决了验证扩散大型语言模型(dLLMs)生成答案正确性的挑战。提出的Bidirectional Manifold Consistency(BMC)方法通过前向掩码和后向重构比较来量化生成序列的稳定性。BMC在诊断、推理和对齐方面表现出色,作为解决方案有效性的稳健鉴别器,能够高效地在复杂推理任务中进行重采样,并提供密集的几何奖励以细化指导模型的进化。结果表明,内在的几何稳定性是dLLMs正确性的可靠指标。
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Authors: Hao-Yuan Chen
First: 2026-04-23T12:36:12+00:00 · Latest: 2026-04-23T12:36:12+00:00
Abstract
Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.
中文标题/摘要
标题:通过口头批评进行过程监督以提高大型语言模型的推理能力
针对大语言模型(LLM)推理时的扩展性,研究主要集中在三个维度上:推理链深度、样本广度和学习步骤评分器(PRMs)。我们引入了第四个维度,即外部口头监督的精细度,通过口头过程监督(VPS)框架,该框架利用更强的监督者提供的结构化自然语言批评来引导生成-批评-修正循环,直到轮次预算R。在GPQA Diamond、AIME 2025和LiveCodeBench V6(涵盖封闭和开放模型)上,VPS取得了三个关键结果。首先,在GPQA Diamond上,GPT-5.4(高)| GPT-5.4(低)在R=4时达到94.9%,超越了94.1%的最新技术水平,无需梯度更新。其次,在AIME 2025上,VPS使弱演员救援变得强大,分数从11.7%-26.7%提升到63.3%-90.0%(最多提升63.3分)。第三,在匹配计算资源的情况下,VPS在Reflexion上高出8.5到12.1分,在GPQA上比Self-Consistency@5高出5.0分,在LiveCodeBench上高出8.3分,将批评的精细度作为关键驱动因素。性能与监督者-演员能力差距成正比(皮尔逊r=0.90),当错误无法用语言表达(例如代码合成)时性能会下降,这促使了混合口头-执行方法的发展。这些结果确立了批评的精细度作为推理时扩展的新维度。
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Authors: Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Bo Zhang, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, Kinhei Lee, Z henxuan Zhang, Xiaobing Li, Maosong Sun
Venue: ACL 2026
First: 2025-11-24T06:40:38+00:00 · Latest: 2026-04-23T11:52:26+00:00
Comments: Accepted to ACL 2026 Main Conference
Abstract
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
Component-Based Out-of-Distribution Detection
Authors: Wenrui Liu, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
First: 2026-04-23T11:19:39+00:00 · Latest: 2026-04-23T11:19:39+00:00
Abstract
Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.
Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
Authors: Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra
First: 2026-04-23T10:36:50+00:00 · Latest: 2026-04-23T10:36:50+00:00
Abstract
Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.
中文标题/摘要
标题:看而不信:揭示评估者视觉-语言模型的盲点
大型视觉-语言模型(VLMs)越来越多地用于评估其他模型的输出,特别是在图像到文本(I2T)任务如视觉问答和文本到图像(T2I)生成任务中。尽管如此,这些评估者VLMs的可靠性仍然没有得到充分探索。在本研究中,我们系统地评估了评估者VLMs在I2T和T2I任务中的可靠性。我们引入了有针对性的扰动,这些扰动在关键错误维度上降低了输出质量,包括物体幻觉、空间推理、事实基础和视觉保真度。这些扰动测试了评估者VLMs是否能够可靠地在其评估中考虑到这些质量降低的错误。使用涵盖4000多个扰动实例和40个扰动维度的综合基准,我们使用单答案评分、成对比较和参考引导的方法评估了4个主要的VLMs。我们的研究结果揭示了当前VLM评估器存在显著的盲点:它们经常无法检测到扰动输出,在某些情况下超过50%;特别难以处理细粒度的组合和空间错误;并且对与输入图像相矛盾的幻觉内容往往不够敏感。成对比较虽然更可靠,但失败率仍然存在。这些结果突显了当前评估者VLMs的不可靠性,并要求在基准测试和开发决策中谨慎使用。代码和数据已公开。
Summary / 总结
This study evaluates the reliability of Evaluator Vision-Language Models (VLMs) in assessing image-to-text and text-to-image tasks. By introducing targeted perturbations that degrade output quality, the research reveals significant blind spots in these models, especially in detecting fine-grained compositional and spatial errors, and hallucinations that contradict input images. The study uses a comprehensive benchmark and multiple evaluation paradigms to show that current VLMs often fail to reliably detect quality-degrading errors, highlighting the need for caution in their deployment for benchmarking and development decisions.
研究评估了评价视觉语言模型(VLMs)在图像到文本和文本到图像任务中的可靠性。通过引入降级输出质量的针对性干扰,研究揭示了这些模型在检测细粒度组合和空间错误以及与输入图像矛盾的幻觉方面存在显著盲点。研究使用了综合基准和多种评估范式,表明当前VLMs往往无法可靠地检测质量下降的错误,强调了在基准测试和开发决策中谨慎使用这些模型的必要性。
PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation
Authors: Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim
Venue: ACL 2026
First: 2025-08-29T15:36:06+00:00 · Latest: 2026-04-23T09:35:10+00:00
Comments: ACL 2026
Abstract
Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.
中文标题/摘要
标题:PosterForest:科学海报生成的分层多智能体协作
自动化科学海报生成需要分层文档理解和连贯的内容-布局规划。现有方法通常依赖于平面总结或分别优化内容和布局,因此往往存在信息丢失、逻辑流程弱和视觉平衡差的问题。我们提出了PosterForest,一种无需训练的科学海报生成框架。我们的方法引入了Poster树,这是一种结构化的中间表示,能够捕捉多个层次上的文档层次和视觉-文本语义。基于这种表示,内容和布局代理进行分层推理和递归细化,逐步从全局组织到局部组成优化海报。这种联合优化提高了语义连贯性、逻辑流程和视觉和谐性。实验表明,PosterForest在自动和人工评估中均优于先前的方法,无需额外训练或领域特定监督。
Summary / 总结
The research aims to improve the hierarchical understanding and coherent planning of scientific posters. The method introduces a Poster Tree as a structured intermediate representation to capture document hierarchy and visual-textual semantics. Content and layout agents perform hierarchical reasoning and recursive refinement, optimizing the poster from global organization to local composition. Experiments demonstrate that PosterForest outperforms previous methods in both automatic and human evaluations without additional training or domain-specific supervision.
研究旨在提高科学海报的层次理解和连贯规划。方法引入了Poster Tree作为结构化的中间表示,以捕捉文档层次和视觉-文本语义。内容和布局代理进行层次推理和递归细化,从全局组织到局部组成优化海报。实验表明,PosterForest在自动和人工评估中均优于先前的方法,无需额外训练或领域特定监督。
Instance-level Visual Active Tracking with Occlusion-Aware Planning
Authors: Haowei Sun, Kai Zhou, Hao Gao, Shiteng Zhang, Jinwu Hu, Xutao Wen, Qixiang Ye, Mingkui Tan
Venue: CVPR 2026 Poster
First: 2026-04-23T09:11:50+00:00 · Latest: 2026-04-23T09:11:50+00:00
Comments: CVPR 2026 Poster
Abstract
Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.
中文标题/摘要
标题:基于实例的视觉主动跟踪与遮挡感知规划
视觉主动跟踪(VAT)旨在控制相机在三维空间内跟随目标,对于无人机导航和安全监控等应用至关重要。然而,实际部署中面临两个关键瓶颈:由于实例级区分不足导致的视觉相似干扰物混淆,以及由于缺乏主动规划而导致的严重遮挡失效。为了解决这些问题,我们提出了OA-VAT,这是一种统一的管道,包含三个互补模块。首先,一种无需训练的实例感知离线原型初始化模块通过DINOv3聚合多视角增强特征,构建区分性实例原型,减轻干扰物混淆。其次,一种在线原型增强跟踪器在线增强原型,并结合一种基于置信度的卡尔曼滤波器,以应对外观和运动变化下的稳定跟踪。第三,一种遮挡感知轨迹规划器,基于我们新构建的Planning-20k数据集进行训练,使用条件扩散生成避障路径,以恢复遮挡。实验表明,OA-VAT在UnrealCV上实现了0.93的平均SR(比SOTA TrackVLA高2.2%),在真实世界数据集上实现了90.8%的平均CAR(比SOTA GC-VAT高12.1%),在DJI Tello无人机上实现了81.6%的TSR。在RTX 3090上运行速度为35 FPS,实现了稳健的实时性能,适用于实际部署。
Summary / 总结
OA-VAT is designed to address the challenges of visual active tracking by integrating instance-level discrimination and occlusion handling. It uses a unified pipeline with three modules: Instance-Aware Offline Prototype Initialization, Online Prototype Enhancement Tracker, and Occlusion-Aware Trajectory Planner. The system demonstrates superior performance, achieving 0.93 average success rate on UnrealCV, 90.8% average correct association rate on real-world datasets, and 81.6% tracking success rate on a DJI Tello drone, outperforming existing methods by significant margins.
OA-VAT 通过整合实例级区分和遮挡处理来解决视觉主动跟踪的挑战,采用一个统一的管道,包含三个模块:实例感知的离线原型初始化、在线原型增强跟踪器和遮挡感知轨迹规划器。该系统展示了卓越的性能,实现在 UnrealCV 上的平均成功率 0.93、真实世界数据集上的平均正确关联率 90.8%,以及在 DJI Tello 无人机上的跟踪成功率 81.6%,显著优于现有方法。
PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding
Authors: Junjie Wen, Junlin He, Fei Ma, Jinqiang Cui
First: 2026-04-17T07:24:14+00:00 · Latest: 2026-04-23T09:00:00+00:00
Comments: Accepted by ICCA 2026
Abstract
Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.
中文标题/摘要
标题:PLAF:像素级语言对齐特征提取以实现高效的3D场景理解
准确的开放词汇3D场景理解需要同时在像素级别上具有语义对齐和空间精确性的语义表示,同时在提升到3D空间时保持可扩展性。然而,现有的表示方法难以同时满足这些要求,而密集传播像素级语义到3D通常会导致大量冗余,导致在大规模场景中存储和查询效率低下。为了解决这些挑战,我们提出了\emph{PLAF},一种像素级语言对齐特征提取框架,能够在2D中实现密集且准确的语义对齐,而不牺牲开放词汇的表达能力。在此表示基础上,我们进一步设计了一种高效的语义存储和查询方案,显著减少了2D和3D域中的冗余。实验结果表明,\emph{PLAF}为准确高效的开放词汇3D场景理解提供了强大的语义基础。代码已公开发布在https://github.com/RockWenJJ/PLAF。
Summary / 总结
The research aims to develop a method for accurate open-vocabulary 3D scene understanding by creating a pixel-wise language-aligned feature extraction framework called PLAF. This framework enables dense and precise semantic alignment in 2D while maintaining scalability in 3D. The key experimental finding is that PLAF provides a strong semantic foundation for efficient and accurate 3D scene understanding without substantial redundancy. The codes are publicly available.
研究旨在通过创建像素级语言对齐特征提取框架PLAF来实现准确的开放词汇3D场景理解。该框架在2D中实现密集且精确的语义对齐,同时在3D中保持可扩展性。关键实验发现是,PLAF为高效的3D场景理解提供了强大的语义基础,且没有大量冗余。代码已公开可用。
VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
Authors: Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun, YoungBin Kim
First: 2026-04-23T08:04:07+00:00 · Latest: 2026-04-23T08:04:07+00:00
Comments: Accepted to LREC 2026
Abstract
The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.
中文标题/摘要
标题:VG-CoT:通过基于视觉的链式思考迈向可信赖的视觉推理
大型视觉-语言模型(LVLM)的进步需要精确的局部区域推理,使模型的逻辑与实际视觉证据紧密联系。然而,现有数据集由于大量手动标注和缺乏多步推理与相应图像区域的显式对齐而面临可扩展性限制,这限制了对模型可信度的评估。为解决这些挑战,我们提出了视觉定位链式思考(VG-CoT)数据集,通过全自动三阶段管道将每个推理步骤明确链接到图像中的实际视觉证据。该管道首先使用最先进的检测和OCR模型提取对象和文本级别的视觉证据,然后使用GPT-4o生成逐步的基于视觉的推理,最后通过基于推理的开放集检测过程进行细化。此外,我们引入了一个新的基准,全面评估LVLM在三个互补维度上的推理能力:推理质量、答案准确性以及推理-答案对齐。实验表明,包括LLaVA-1.5和Qwen2-VL在内的代表性LVLM在大多数评估指标上都取得了持续改进,证明VG-CoT有效地增强了基于证据的可信赖推理,同时保持了可扩展和成本效益的数据集构建。数据集和代码将在接受后公开,以促进进一步的研究。
Summary / 总结
VG-CoT is a dataset designed to improve the trustworthiness of visual reasoning by linking each reasoning step to actual visual evidence in images. It uses a three-stage pipeline involving object and text detection, GPT-4o for grounded reasoning, and open-set detection for refinement. This dataset and a new benchmark are used to evaluate Large Vision-Language Models (LVLMs) across rationale quality, answer accuracy, and reasoning-answer alignment, showing consistent improvements in these metrics. The dataset and code will be publicly released.
VG-CoT旨在通过将每个推理步骤与图像中的实际视觉证据联系起来,提高视觉推理的可信度。它使用三阶段流水线来提取视觉证据、生成基于视觉的推理并对其进行精炼。该数据集和一个新的基准被引入,以从推理质量、答案准确性和推理-答案一致性三个方面全面评估LVLMs。实验结果显示,在评估指标上的一致改进,证实了VG-CoT在增强基于证据的推理方面的有效性。
Prototype-Based Test-Time Adaptation of Vision-Language Models
Authors: Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji
First: 2026-04-23T07:20:56+00:00 · Latest: 2026-04-23T07:20:56+00:00
Abstract
Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
Authors: Mohit Vaishnav, Tanel Tammet
Venue: 30th Conference on Computational Natural Language Learning (CoNLL), 2026
First: 2026-04-23T07:03:48+00:00 · Latest: 2026-04-23T07:03:48+00:00
Abstract
Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
Summary / 总结
The study investigates whether the bottleneck in vision-language models (VLMs) for abstract visual reasoning tasks like Bongard problems is due to reasoning or representation. By comparing VLMs on raw images with large language models (LLMs) given symbolic inputs, the research reveals that LLMs achieve significant gains, reaching mid-90s accuracy, while visual models remain near chance. This indicates that representation is a key bottleneck, and symbolic inputs serve as a useful diagnostic tool.
研究通过将视觉-语言模型(VLMs)与给定图像符号输入的大语言模型(LLMs)在Bongard-LOGO合成基准上的比较,来探究抽象视觉推理中的瓶颈。使用Componential--Grammatical(C--G)范式将任务重新表述为基于LOGO风格动作程序或结构描述的符号推理问题,结果显示LLMs显著优于VLMs,达到中90年代的准确率。这表明表示是抽象视觉推理中的关键瓶颈,而符号输入可以作为控制诊断的上限。
Semantic-Fast-SAM: Efficient Semantic Segmenter
Authors: Byunghyun Kim
First: 2026-04-22T04:18:39+00:00 · Latest: 2026-04-23T05:32:11+00:00
Comments: APSIPA ASC 2025
Abstract
We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic-Fast-SAM.
Summary / 总结
Semantic-Fast-SAM (SFS) combines FastSAM with a semantic labeling pipeline to achieve real-time semantic segmentation with high accuracy. The model uses FastSAM's efficient mask generation and integrates a Semantic-Segment-Anything (SSA) strategy to assign categories, resulting in high-quality segmentation maps at a fraction of the computational cost of the original SAM-based approach. Experiments show that SFS matches the accuracy of prior SAM-based methods while achieving approximately 20x faster inference in the closed-set setting and outperforms recent open-vocabulary models in broad class labeling. This work enables practical real-time semantic segmentation for robotics applications.
Semantic-Fast-SAM (SFS) 结合了 FastSAM 和语义标注管道,实现了实时语义分割,精度与先前基于 SAM 的方法在 Cityscapes 和 ADE20K 基准上的表现相当,同时比封闭集设置下的 SSA 快约 20 倍。SFS 还通过利用基于 CLIP 的语义头部,在处理广泛类别标注时优于近期的开放词汇模型。该工作使实时语义分割成为现实,适用于机器人场景。
BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion
Authors: Tianzhi Jia, Kaixing Yang, Xiaole Yang, Xulong Tang, Ke Qiu, Shikui Wei, Yao Zhao
First: 2026-04-06T03:49:36+00:00 · Latest: 2026-04-23T03:50:19+00:00
Comments: 15 pages, 7 figures
Abstract
3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.
Summary / 总结
The paper addresses the challenge of generating fine-grained 3D conducting motions from music, focusing on the lack of large datasets and effective generation methods. To tackle these issues, the authors developed a high-quality 3D conducting motion collection pipeline and created CM-Data, the first large public dataset for this task. They also proposed BiTDiff, a novel framework using a BiMamba-Transformer hybrid model and a diffusion-based generative strategy, which introduces physical-consistency losses and a hand/body-specific forward-kinematics design to enhance motion quality and efficiency. Experiments show that BiTDiff outperforms existing methods on the CM-Data dataset.
该论文旨在生成从音乐中提取的精细3D指挥动作,重点关注数据集缺乏和有效生成方法的不足。为解决这些问题,作者开发了一种高质量的3D指挥动作采集流水线,并创建了CM-Data,这是首个公开的此类数据集。他们还提出了BiTDiff,这是一种使用BiMamba-Transformer混合模型和基于扩散的生成策略的新框架,引入了物理一致性损失和手/身体特定的前向动力学设计,以提高动作质量和效率。实验表明,BiTDiff在CM-Data数据集上的性能优于现有方法。
PAT3D: Physics-Augmented Text-to-3D Scene Generation
Authors: Guying Lin, Kemeng Huang, Michael Liu, Ruihan Gao, Hanke Chen, Lyuhao Chen, Beijia Lu, Taku Komura, Yuan Liu, Jun-Yan Zhu, Minchen Li
First: 2025-11-26T23:23:58+00:00 · Latest: 2026-04-23T03:17:53+00:00
Comments: 19 pages, 12 figures
Abstract
We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data are available at: https://github.com/Simulation-Intelligence/PAT3D.
Summary / 总结
PAT3D is a physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to create physically plausible and simulation-ready 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree. A differentiable rigid-body simulator ensures realistic interactions under gravity, and a simulation-in-the-loop optimization procedure further enhances scene quality. Experiments show that PAT3D outperforms previous methods in physical plausibility, semantic consistency, and visual quality, and enables simulation-ready 3D scenes for tasks like scene editing and robotic manipulation.
PAT3D 是一种将视觉语言模型与物理仿真结合的物理增强文本到3D场景生成框架,用于生成物理上合理且可用于仿真的3D场景。给定一个文本提示,PAT3D 生成3D对象,推断它们的空间关系,并组织成层次场景树。一个可微分的刚体仿真器确保在重力作用下的真实交互,并通过仿真循环优化程序进一步提高场景质量。实验表明,PAT3D 在物理合理性、语义一致性和视觉质量方面优于先前的方法,并且能够为场景编辑和机器人操作等任务生成可用于仿真的3D场景。
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
Authors: Jingkun Chen, Ruoshi Xu, Mingqi Gao, Shengda Luo, Jungong Han
First: 2026-04-23T00:01:40+00:00 · Latest: 2026-04-23T00:01:40+00:00
Comments: 10 pages, 3 figures, 5 tables
Abstract
Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.
Summary / 总结
This paper addresses the issue of geometric hallucination in Point-Vision-Language Models by proposing Geometric Reward Credit Assignment, which disentangles holistic supervision into field-specific signals and routes them to their responsible token spans. This method improves 3D keypoint accuracy from 0.64 to 0.93, increases 3D bounding box IoU to 0.686, and raises reprojection consistency scores to 0.852, while maintaining 2D localization performance. Additionally, a Reprojection-Consistency term is used to penalize physically impossible geometries, enhancing the model's reliability in generating spatial predictions.
本文提出了一种几何奖励信用分配方法,将整体监督分解为特定领域的信号,并将其路由到负责的标记跨度上,以解决点-视觉-语言模型中的几何幻觉问题。该方法将3D关键点准确性从0.64提升到0.93,将3D边界框IoU提高到0.686,并将重投影一致性分数提高到0.852,同时保持2D定位性能。此外,还使用重投影一致性项来惩罚物理上不可能的几何形状,从而提高模型生成空间预测的可靠性。
Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting
Authors: Alexander Weers, Daniel Rueckert, Martin J. Menten
First: 2026-04-22T20:51:17+00:00 · Latest: 2026-04-22T20:51:17+00:00
Abstract
Training vision-language models (VLMs) for medical report generation is often hindered by the scarcity of high-quality annotated data. This work evaluates the use of a weighted loss function to improve data efficiency. Compared to standard cross-entropy loss, which treats all token prediction errors equally, the reweighted loss shifts the focus to semantically salient tokens with outsized clinical importance. In experiments on ophthalmological report generation, we show that this simple method improves efficiency across multiple data scales, achieving similar report quality with up to ten times less training data.
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Authors: Juhong Min, Lazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan
First: 2026-04-22T20:44:24+00:00 · Latest: 2026-04-22T20:44:24+00:00
Abstract
Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.
InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language
Authors: Nicklas Neu, Thomas Ebner, Jasmin Primus, Raphael Zefferer, Bernhard Schenkenfelder, Mathias Brunbauer, Florian Kromp
First: 2026-04-22T20:05:37+00:00 · Latest: 2026-04-22T20:05:37+00:00
Comments: 15 pages, 2 figures
Abstract
The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.
Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning
Authors: Dahun Kim, Ganesh Satish Mallya, Anelia Angelova
First: 2026-04-22T19:23:52+00:00 · Latest: 2026-04-22T19:23:52+00:00
Comments: Accepted to IGARSS 2026
Abstract
Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs' understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.
中文标题/摘要
标题:利用引导输入和链式推理解锁多光谱数据的多模态模型应用
多光谱影像是遥感应用中的宝贵输入信号,例如土地利用和土地覆盖分类以及环境监测。然而,通用的大规模多模态模型(LMMs)通常仅训练于RGB图像,限制了其在RGB域的应用。同时,训练多光谱多模态模型成本高昂且产生专门化的模型。为解决这一问题,我们提出了一种新的无需训练的方法,在标准仅RGB的LMMs推理管道中引入多光谱数据,从而实现性能的巨大提升。该方法通过将非RGB输入适应视觉空间,并注入领域特定信息和链式推理指令,利用LMMs对视觉空间的理解。我们使用Gemini 2.5模型进行了演示,并在流行的遥感基准测试中观察到了显著的零样本性能提升。这些结果突显了地理空间专业人士利用强大通用模型处理专门传感器输入的潜力,从而受益于丰富的基于专门数据的推理能力。
GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure
Authors: Leslie Gu, Junhwa Hur, Charles Herrmann, Fangneng Zhan, Todd Zickler, Deqing Sun, Hanspeter Pfister
First: 2025-12-25T03:28:28+00:00 · Latest: 2026-04-22T18:38:16+00:00
Abstract
We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.
Summary / 总结
GeCo is a geometry-grounded method for evaluating geometric consistency in video generation by fusing residual motion and depth priors to generate interpretable dense consistency maps. The method is used used used dense artifacts and benchmark recent models on models uncovering their they limitations and employ a guidance-free approach to improve deformation artifacts.
GeCo 是一个基于几何的度量标准,通过结合残余运动和深度先验来检测静态场景中的几何变形和遮挡不一致性。它生成可解释的密集一致性图来揭示这些缺陷。GeCo 用于评估近期的视频生成模型,发现了常见的失败模式,并且还作为无训练指导损失来减少视频生成过程中的变形缺陷。
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
Authors: Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin, Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Shifat E. Arman
Venue: ACL 2026
First: 2026-04-22T18:12:07+00:00 · Latest: 2026-04-22T18:12:07+00:00
Comments: Accepted at ACL 2026 Findings
Abstract
Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.
中文标题/摘要
标题:像植物学家一样思考:用意图驱动的链式询问挑战多模态语言模型
视觉评估通常通过多步过程进行。在大多数当代领域中,专家使用结构化、基于证据的适应性提问来分析图像。在植物病理学中,植物学家检查叶片图像,识别视觉线索,推断诊断意图,并根据物种、症状和严重程度提出有针对性的问题。这种结构化的探询对于准确的疾病诊断和治疗方案的制定至关重要。然而,当前的视觉-语言模型仅在单轮问答上进行评估。为解决这一差距,我们引入了PlantInquiryVQA,这是一个用于研究植物诊断中多步、意图驱动的视觉推理基准。我们形式化了一个链式询问框架,将诊断轨迹建模为基于视觉线索和明确的本体论意图的有序问题-答案序列。我们发布了一个包含24,950张专家标注的植物图像和138,068个问题-答案对的数据集,这些对都标注了视觉定位、严重程度标签和领域特定的推理模板。顶级多模态大型语言模型的评估显示,虽然它们能够充分描述视觉症状,但在安全临床推理和准确诊断方面存在困难。重要的是,结构化问题引导的询问显著提高了诊断的准确性,减少了幻觉,并提高了推理效率。我们希望PlantInquiryVQA能够作为基础基准,推动研究以训练诊断代理像专家植物学家一样推理,而不是像静态分类器。
Summary / 总结
The research aims to improve vision-language models for botanical diagnosis by challenging them with multi-step, intent-driven questioning. The study introduces PlantInquiryVQA, a benchmark dataset of 24,950 plant images and 138,068 question-answer pairs, which reveals that current models struggle with safe clinical reasoning and accurate diagnosis but perform better with structured question-guided inquiry, reducing hallucination and improving diagnostic correctness and efficiency.
研究旨在通过多步、意图驱动的提问来提升视觉语言模型在植物病理诊断中的表现。研究引入了包含24,950张植物图像和138,068个问答对的PlantInquiryVQA基准数据集,结果显示当前模型在安全临床推理和准确诊断方面表现不佳,但在结构化问题引导的询问下,诊断的正确性、减少幻觉以及提高推理效率得到了显著提升。
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
Authors: Qiguang Chen, Chengyu Luan, Jiajun Wu, Qiming Yu, Yi Yang, Yizhuo Li, Jingqi Tong, Xiachong Feng, Libo Qin, Wanxiang Che
Venue: ACL 2026
First: 2026-04-22T17:37:40+00:00 · Latest: 2026-04-22T17:37:40+00:00
Comments: ACL 2026 Camera Ready
Abstract
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
中文标题/摘要
标题:OMIBench:大型视觉语言模型在奥林匹克级多图像推理中的基准测试
大型视觉语言模型(LVLMs)在奥林匹克级别的推理任务上取得了显著进展。然而,当前用于这些模型的奥林匹克级别多模态推理基准往往侧重于单图像分析,未能充分利用多张图像之间的上下文信息。我们提出了OMIBench,一个旨在评估当所需证据分布在多张图像中时奥林匹克级别推理能力的基准。它包含来自生物学、化学、数学和物理奥林匹克竞赛的问题,以及手动标注的推理和针对精确和语义答案匹配的评估协议。在OMIBench的广泛实验中,我们观察到现有模型之间存在显著的性能差距。即使是最强的LVLMs,如Gemini-3-Pro,也只能在基准测试中达到约50%的性能。这些结果将OMIBench定位为研究和改进LVLMs中多图像推理的集中资源。
Summary / 总结
OMIBench is designed to evaluate large vision-language models (LVLMs) in Olympiad-level reasoning tasks that require evidence from multiple images. The benchmark includes problems from various scientific Olympiads with annotated rationales and evaluation protocols. Extensive experiments show that even the strongest LVLMs, like Gemini-3-Pro, achieve only about 50% on OMIBench, highlighting the need for improved multi-image reasoning capabilities in LVLMs.
OMIBench 旨在评估大型视觉-语言模型(LVLM)在需要从多张图片中获取证据的奥林匹克级别推理任务中的表现。基准数据集包含来自不同科学奥林匹克竞赛的问题,并附有注释的推理和评估协议。广泛的实验表明,即使是最强的LVLM,如Gemini-3-Pro,在OMIBench上的得分也只有约50%,这表明需要改进LVLM的多图推理能力。
History
20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553