arXiv 论文速递

Snapshot: 20260401_0408

See it to Place it: Evolving Macro Placements with Vision-Language Models

Authors: Ikechukwu Uchendu, Swati Goel, Karly Hou, Ebrahim Songhori, Kuang-Huei Lee, Joe Wenjie Jiang, Vijay Janapa Reddi, Vincent Zhuang

First: 2026-03-30T17:47:34+00:00 · Latest: 2026-03-30T17:47:34+00:00

Comments: 31 pages, 11 figures, 14 tables

Abs · PDF · Code1 · Code2

Abstract

We propose using Vision-Language Models (VLMs) for macro placement in chip floorplanning, a complex optimization task that has recently shown promising advancements through machine learning methods. Because human designers rely heavily on spatial reasoning to arrange components on the chip canvas, we hypothesize that VLMs with strong visual reasoning abilities can effectively complement existing learning-based approaches. We introduce VeoPlace (Visual Evolutionary Optimization Placement), a novel framework that uses a VLM, without any fine-tuning, to guide the actions of a base placer by constraining them to subregions of the chip canvas. The VLM proposals are iteratively optimized through an evolutionary search strategy with respect to resulting placement quality. On open-source benchmarks, VeoPlace outperforms the best prior learning-based approach on 9 of 10 benchmarks with peak wirelength reductions exceeding 32%. We further demonstrate that VeoPlace generalizes to analytical placers, improving DREAMPlace performance on all 8 evaluated benchmarks with gains up to 4.3%. Our approach opens new possibilities for electronic design automation tools that leverage foundation models to solve complex physical design problems.

中文标题/摘要

标题：见之即置：利用视觉语言模型进行宏放置优化

我们提出使用视觉语言模型（VLMs）进行芯片版图的宏放置，这是一个复杂的优化任务，最近通过机器学习方法显示出有希望的进步。由于人类设计师在安排芯片画布上的组件时高度依赖空间推理，我们假设具有强大视觉推理能力的VLMs可以有效补充现有的基于学习的方法。我们引入了VeoPlace（视觉进化优化放置）框架，该框架使用一个未经微调的VLM来指导基础放置器的动作，将其限制在芯片画布的子区域。VLM的建议通过进化搜索策略迭代优化，以提高最终的放置质量。在开源基准测试中，VeoPlace在9个基准中的10个基准上优于最佳的先前基于学习的方法，峰值线长减少超过32%。我们进一步证明VeoPlace可以泛化到分析型放置器，提高DREAMPlace在所有8个评估基准上的性能，最高增幅为4.3%。我们的方法为利用基础模型解决复杂物理设计问题的电子设计自动化工具打开了新的可能性。

Summary / 总结

The research aims to enhance chip floorplanning using Vision-Language Models (VLMs) to assist in macro placement, a critical task in electronic design automation. VeoPlace, a novel framework, uses a VLM to guide a base placer by constraining its actions to specific regions of the chip canvas. Iterative optimization through an evolutionary search strategy improves placement quality. The method outperforms previous learning-based approaches on 9 out of 10 benchmarks, achieving up to 32% reductions in peak wirelength. Additionally, VeoPlace improves the performance of DREAMPlace on all evaluated benchmarks, demonstrating its versatility and effectiveness.

研究旨在利用Vision-Language模型（VLM）提升芯片布图设计中的宏放置，这是电子设计自动化中的关键任务。VeoPlace是一种新型框架，通过约束基放置器的动作到芯片画布的特定区域，由VLM引导其操作。通过进化搜索策略进行迭代优化，提高布图质量。该方法在10个基准中的9个上优于之前的基于学习的方法，达到32%以上的峰值线长减少。此外，VeoPlace还提高了DREAMPlace在所有评估基准上的性能，展示了其灵活性和有效性。

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Authors: Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza

First: 2026-03-30T17:46:31+00:00 · Latest: 2026-03-30T17:46:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.

中文标题/摘要

标题：SOLE-R1：仅基于视频-语言推理的机器人在线强化学习奖励

视觉-语言模型（VLMs）在多种任务中展现了令人印象深刻的性能，激发了利用这些模型监督机器人学习的努力。然而，当作为强化学习（RL）中的评估器使用时，当前最强的模型往往在部分可观测性和分布偏移下失效，使策略利用感知错误而非解决任务。为解决这一局限，我们提出了SOLE-R1（自我观察学习者），这是一种专门设计用于作为在线RL唯一奖励信号的视频-语言推理模型。仅给定原始视频观察和自然语言目标，SOLE-R1 每个时间步进行时空链式思考（CoT）推理，并生成可以直接作为奖励使用的任务进度密集估计。为了训练SOLE-R1，我们开发了一个大规模的视频轨迹和推理合成管道，生成与连续进度监督对齐的时间上接地CoT轨迹。该数据结合基础的空间和多帧时空推理，用于训练模型的混合框架，该框架将监督微调与可验证奖励的RL结合在一起。在四个不同的模拟环境中和一个真实机器人设置中，SOLE-R1 使从随机初始化开始的零样本在线RL成为可能：机器人在没有真实奖励、成功指标、演示或任务特定调整的情况下学习以前未见过的操控任务。SOLE-R1 在24个未见过的任务上取得成功，并显著优于包括GPT-5和Gemini-3-Pro在内的强大视觉-语言奖励器，同时表现出明显的抗奖励劫持的鲁棒性。

Vision-Language Agents for Interactive Forest Change Analysis

Authors: James Brock, Ce Zhang, Nantheera Anantrasirichai

First: 2026-01-08T02:02:36+00:00 · Latest: 2026-03-30T17:23:33+00:00

Comments: 5 pages, 4 figures, Accepted into IGARSS 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.

中文标题/摘要

标题：视觉-语言代理在交互式森林变化分析中的应用

现代森林监测工作流程越来越多地得益于高分辨率卫星图像的日益可用和深度学习的进步。在此背景下，准确的像素级变化检测和复杂森林动态的有意义语义变化描述是两个持续存在的挑战。虽然大型语言模型（LLMs）正在被适应用于交互式数据探索，但它们与视觉-语言模型（VLMs）结合进行遥感图像变化解释（RSICI）的研究仍然不足。为了解决这一差距，我们引入了一个由LLM驱动的集成森林变化分析代理，支持跨多个RSICI任务的自然语言查询。所提出系统基于多级变化解释（MCI）视觉-语言骨干，并通过LLM进行编排。为了在森林环境中促进适应和评估，我们进一步引入了森林变化数据集，该数据集包含双时相卫星图像、像素级变化掩码以及使用人类注释和基于规则的方法生成的多粒度语义变化描述。实验结果表明，所提出系统在森林变化数据集上的mIoU和BLEU-4得分为67.10%和40.17%，在LEVIR-MCI-Trees上的得分为88.13%和34.41%，LEVIR-MCI基准数据集的一个专注于树木的子集。这些结果突显了交互式、LLM驱动的RSICI系统在提高森林变化分析的可访问性、可解释性和效率方面的潜力。所有数据和代码均可在https://github.com/JamesBrockUoB/ForestChat/上公开获取。

Summary / 总结

This paper addresses the challenges of accurate pixel-level change detection and semantic change captioning in forest monitoring using deep learning and large language models. The authors introduce a vision-language agent driven by an LLM for integrated forest change analysis, which supports natural language querying across multiple tasks. The system is evaluated on the Forest-Change dataset and achieves mIoU and BLEU-4 scores of 67.10% and 40.17%, respectively, demonstrating the potential of interactive LLM-driven systems for improving forest change analysis.

本文旨在利用深度学习和大型语言模型（LLMs）解决森林监测中的像素级变化检测和语义变化描述的挑战。该研究引入了一个结合了LLM和多级变化解释（MCI）骨干的视觉-语言代理，用于遥感图像变化解释（RSICI）。该系统支持自然语言查询，并在Forest-Change数据集上进行评估，分别实现了67.10%的mIoU和40.17%的BLEU-4分数。结果表明，交互式、LLM驱动的RSICI系统能够提升森林变化分析的可访问性、可解释性和效率。所有数据和代码均已公开。

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Authors: Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys

First: 2026-03-30T17:14:15+00:00 · Latest: 2026-03-30T17:14:15+00:00

Comments: Project page: https://haozheqi.github.io/adapt-token

Abs · PDF · Code1 · Code2 · Project1

Abstract

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

中文标题/摘要

标题：AdaptToken：基于熵的自适应令牌选择方法在MLLM长视频理解中的应用

由于高内存成本和上下文长度限制，多模态大型语言模型（MLLMs）在理解长视频方面仍然具有挑战性。先前的方法通过为短片段内的帧/令牌评分和选择来缓解这一问题，但它们缺乏一种原理性的机制来（i）比较不同视频片段的相关性，以及（ii）在收集到足够证据后停止处理。我们提出了一种无需训练的框架AdaptToken，将MLLM的自我不确定性转化为长视频令牌选择的全局控制信号。AdaptToken将视频划分为组，提取跨模态注意力以在每组内对令牌进行排名，并使用模型的响应熵来估计每组提示的相关性。该熵信号使模型能够在组间分配全局令牌预算，并进一步支持早期停止（AdaptToken-Lite），当模型变得足够确定时，跳过剩余的组。在四个长视频基准（VideoMME、LongVideoBench、LVBench和MLVU）和多个基础MLLM（7B-72B）上，AdaptToken在准确率上始终优于基线（例如，Qwen2.5-VL 7B平均提高6.7%），并且能够从非常长的输入中受益（多达10K帧），而AdaptToken-Lite则将推理时间减少了一半，同时保持了相当的性能。项目页面：https://haozheqi.github.io/adapt-token

Summary / 总结

AdaptToken is a training-free framework that uses an MLLM's self-uncertainty to select relevant tokens in long videos, addressing the challenges of high memory costs and context-length limits. It splits videos into groups, ranks tokens within each group using cross-modal attention, and allocates a global token budget based on entropy. This method improves accuracy across four benchmarks and supports early stopping, reducing inference time by about half while maintaining performance.

AdaptToken 是一个无需训练的框架，通过利用模型的自我不确定性来适应性地选择视频中的令牌，从而增强 MLLMs 对长视频的理解。该方法将视频划分为组，使用跨模态注意力对每组内的令牌进行排序，并使用熵来估计相关性并分配全局令牌预算。这种方法支持早期停止，减少推理时间。在各种基准测试中，AdaptToken 提高了准确性，并且受益于极长的输入，而 AdaptToken-Lite 将推理时间减少了一半，同时保持了相当的性能。

Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification

Authors: Nghia Nguyen, Tianjiao Ding, René Vidal

Venue: CVPR

First: 2026-02-11T23:53:15+00:00 · Latest: 2026-03-30T16:55:56+00:00

Comments: To be published in Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Abs · PDF · Code1 · Code2

Abstract

Interpretable-by-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as sparse combinations of concept embeddings. However, by ignoring the hierarchical structure of semantic concepts, these methods may produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding & Pursuit (HCEP), a framework that induces a hierarchy of concept embeddings in the latent space and performs hierarchical sparse coding to recover the concepts present in an image. Given a hierarchy of semantic concepts, we introduce a geometric construction for the corresponding hierarchy of embeddings. Under the assumption that the true concepts form a rooted path in the hierarchy, we derive sufficient conditions for their recovery in the embedding space. We further show that hierarchical sparse coding reliably recovers hierarchical concept embeddings, whereas standard sparse coding fails. Experiments on real-world datasets show that HCEP improves concept precision and recall compared to existing methods while maintaining competitive classification accuracy. Moreover, when the number of samples available for concept estimation and classifier training is limited, HCEP achieves superior classification accuracy and concept recovery. Our results demonstrate that incorporating hierarchical structure into sparse concept recovery leads to more faithful and interpretable image classification models.

中文标题/摘要

标题：层次概念嵌入与追求以实现可解释的图像分类

设计可解释的模型在计算机视觉中正逐渐受到关注，因为它们能够为预测提供忠实的解释。在图像分类中，这些模型通常从图像中恢复出可由人类理解的概念，并使用这些概念进行分类。稀疏概念恢复方法利用视觉-语言模型的潜在空间，将图像嵌入表示为概念嵌入的稀疏组合。然而，这些方法通过忽略概念的层次结构，可能会产生与层次结构不一致的解释。在本文中，我们提出了层次概念嵌入与追求（HCEP）框架，该框架在潜在空间中诱导概念嵌入的层次结构，并执行层次稀疏编码以恢复图像中存在的概念。给定概念的层次结构，我们引入了相应的嵌入层次结构的几何构造。在真实概念形成层次结构中根路径的假设下，我们推导了其在嵌入空间中恢复的充分条件。我们进一步表明，层次稀疏编码可靠地恢复了层次概念嵌入，而标准稀疏编码则失败。在真实世界数据集上的实验表明，与现有方法相比，HCEP 在保持竞争力的分类准确性的同时，提高了概念的精确度和召回率。此外，当用于概念估计和分类器训练的样本数量有限时，HCEP 达到了更高的分类准确性和概念恢复效果。我们的结果表明，将层次结构纳入稀疏概念恢复中可以导致更忠实和可解释的图像分类模型。

Summary / 总结

This work addresses the need for interpretable image classification models by proposing HCEP, a framework that incorporates hierarchical structure into concept embeddings. HCEP constructs a hierarchy of concept embeddings and uses hierarchical sparse coding to recover concepts from images, leading to improved concept precision and recall while maintaining competitive classification accuracy. In contrast to standard sparse coding, HCEP reliably recovers hierarchical concept embeddings, especially when sample sizes are limited.

本文通过提出层次概念嵌入与追求（HCEP）框架，解决了需要具有解释性的图像分类模型的问题，该框架将语义概念的层次结构纳入考虑。HCEP 构建概念嵌入的层次结构，并使用层次稀疏编码来恢复图像中的概念。实验表明，HCEP 在概念精确度和召回率方面优于现有方法，同时保持了竞争力的分类准确性，尤其是在样本数量有限的情况下。

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Authors: Min Wang, Ata Mahjoubfar

First: 2026-03-30T16:48:51+00:00 · Latest: 2026-03-30T16:48:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.

中文标题/摘要

标题：AMIGO：代理多图像定位Oracle基准

代理型视觉-语言模型越来越多地通过扩展交互来行动，但大多数评估仍然集中在单张图像、单轮次的正确性上。我们引入了AMIGO（代理型多图像定位Oracle基准），这是一个针对画廊中视觉相似图像的隐藏目标识别的长期基准。在AMIGO中，Oracle私下选择一个目标图像，模型必须通过一系列属性导向的Yes/No/Unsure问题来逐步恢复它，且必须遵循严格的协议，对无效操作进行惩罚。这一设置强调了(i) 不确定性下的问题选择，(ii) 轮次间一致的约束跟踪，以及(iii) 随着证据积累的精细区分。AMIGO还支持控制Oracle的不完美性，以探究在不一致反馈下的鲁棒性和验证行为。我们以Guess My Preferred Dress任务实例化AMIGO，并报告了涵盖结果和交互质量的指标，包括识别成功率、证据验证、效率、协议合规性、噪声容忍度和轨迹级诊断。

Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

Authors: Mih Dinh, SouYoung Jin

Venue: CVPR 2026

First: 2026-03-30T15:54:47+00:00 · Latest: 2026-03-30T15:54:47+00:00

Comments: Accepted at CVPR 2026 and CVPR 2026 Workshop on Machine Unlearning for Computer Vision

Abs · PDF · Code1 · Code2

Abstract

Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.

中文标题/摘要

标题：Unsafe2Safe：可控图像匿名化以提高下游应用实用性

大规模图像数据集经常包含可识别或敏感内容，这在训练模型时可能引发隐私风险，因为模型可能会记住并泄露这些信息。我们提出了Unsafe2Safe，这是一种全自动流水线，用于检测隐私风险高的图像，并仅通过多模态引导扩散编辑重新写其敏感区域。Unsafe2Safe分为两个阶段。第一阶段使用视觉-语言模型来（i）检查图像中的隐私风险，（ii）生成包含和不包含敏感属性的配对私人和公共描述，（iii）根据公共描述提示一个大型语言模型生成结构化且身份中立的编辑指令。第二阶段采用指令驱动的扩散编辑器应用这些双文本提示，生成隐私安全的图像，这些图像保留了全局结构和任务相关语义，同时消除了私人内容。为了衡量匿名化质量，我们引入了一个统一的评估套件，涵盖质量、作弊、隐私和实用性四个维度。在MS-COCO、Caltech101和MIT Indoor67上，Unsafe2Safe大幅降低了面部相似度、文本相似度和人口统计学可预测性，同时保持下游模型准确性与使用原始数据训练相当。在我们自动生成的三元组（私人描述、公共描述、编辑指令）上微调扩散编辑器进一步提高了隐私保护和语义保真度。Unsafe2Safe提供了一种可扩展且原则性的解决方案，用于构建大型、隐私安全的数据集，同时不牺牲视觉一致性或下游应用实用性。

Summary / 总结

Unsafe2Safe is a pipeline that detects privacy risks in images and anonymizes only sensitive regions using multimodal guidance. It operates in two stages: the first uses a vision-language model to inspect images and generate edit instructions, while the second applies these instructions with diffusion editors. The system significantly reduces face and text similarity and demographic predictability while preserving model accuracy and visual consistency.

Unsafe2Safe 是一个全自动流水线，通过检测隐私风险并仅重写敏感区域来匿名化包含敏感内容的图像，使用多模态指导。该流水线分为两个阶段：第一阶段使用视觉-语言模型检查图像、生成私人和公共描述以及生成编辑指令，第二阶段使用这些指令通过扩散编辑器生成隐私安全的图像。实验结果显示，Unsafe2Safe 显著减少了面部和文本相似度以及人口统计学可预测性，同时保持与使用原始数据训练的下游模型准确性相当，进一步对扩散编辑器进行自动生成的三元组（私人描述、公共描述、编辑指令）微调可以提高隐私保护和语义保真度。

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Authors: Yanjie Zhang, Yafei Li, Rui Sheng, Zixin Chen, Yanna Lin, Huamin Qu, Lei Chen, Yushi Sun

First: 2026-03-30T15:32:24+00:00 · Latest: 2026-03-30T15:32:24+00:00

Comments: 10pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a "skeptical" reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a Diagnostic Vision Path captures structural anomalies (e.g., inverted axes) through strategic ROI cropping, while an OCR-Driven Data Path ensures numerical grounding. To resolve cross-modal conflicts, we introduce an Agentic Summarizer optimized via a two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment. This pipeline effectively penalizes visual traps and enforces logical consistency. Evaluations on two benchmarks show that ChartCynics achieves 74.43% and 64.55% accuracy, providing an absolute performance boost of ~29% over the Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models. Our results demonstrate that specialized agentic workflows can grant smaller open-source models superior robustness, establishing a new foundation for trustworthy chart interpretation.

中文标题/摘要

标题：穿越幻象：稳健误导图表问答的双重路径代理框架

尽管视觉语言模型（VLMs）取得了成功，但由于其欺骗性的视觉结构和失真的数据表示，误导图表仍然是一个重大挑战。我们提出了ChartCynics，这是一种代理双重路径框架，旨在通过“怀疑”推理范式揭露视觉欺骗。与整体模型不同，ChartCynics将感知与验证分离：诊断视觉路径通过策略性ROI裁剪捕获结构异常（例如，倒置的轴），而OCR驱动的数据路径确保数值基础。为了解决跨模态冲突，我们引入了一种通过两阶段协议优化的代理总结器：Oracle指导的SFT进行推理提炼和欺骗感知的GRPO进行对抗对齐。该流水线有效地惩罚视觉陷阱并强制逻辑一致性。在两个基准上的评估显示，ChartCynics的准确率为74.43%和64.55%，相对于Qwen3-VL-8B主干模型绝对性能提升约29%，并优于最先进的专有模型。我们的结果表明，专门的代理工作流程可以赋予较小的开源模型更强的稳健性，为可信图表解释奠定了新的基础。

Summary / 总结

The research addresses the challenge of misleading charts in Vision-Language Models by proposing ChartCynics, a dual-path framework. It decouples perception and verification, using a Diagnostic Vision Path to identify structural anomalies and an OCR-Driven Data Path to ensure numerical accuracy. An Agentic Summarizer, optimized through a two-stage protocol, resolves cross-modal conflicts, enhancing logical consistency. Experiments on two benchmarks show ChartCynics outperforms state-of-the-art models by ~29% in accuracy, highlighting the effectiveness of specialized agentic workflows for robust chart interpretation.

研究提出了ChartCynics，一种双路径框架，以应对误导性图表的挑战。该框架将感知和验证分离，使用诊断视觉路径检测结构异常，并使用OCR驱动的数据路径确保数值准确性。通过两阶段协议优化的代理总结器解决跨模态冲突。实验表明，ChartCynics在两个基准上的准确率比最先进的模型高出29%，证明了专门的代理工作流程在图表解释中的优越性。

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

Authors: Chengyin Hu, Jiaju Han, Xuemeng Sun, Qike Zhang, Yiwei Wei, Ang Li, Chunlei Meng, Xiang Chen, Jiahuan Long

First: 2026-03-30T15:24:34+00:00 · Latest: 2026-03-30T15:24:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.

中文标题/摘要

标题：XSPA：为VLMs的转移攻击构建不可感知的X形稀疏对抗扰动

视觉语言模型（VLMs）依赖于共享的视觉-文本表示空间来执行零样本分类、图像描述和视觉问答（VQA）等任务。虽然共享空间能够实现强大的跨任务泛化，但也可能引入一个共同的脆弱性：小的视觉扰动可以通过共享嵌入空间传播，并在任务之间引起相关的语义失败。这种风险在交互式和决策支持环境中尤为重要，但尚不清楚VLMs是否对高度约束、稀疏且几何固定的扰动具有鲁棒性。为了解决这一问题，我们提出了X形稀疏像素攻击（XSPA），这是一种不可感知的结构化攻击，限制扰动仅在两条相交的对角线上。与密集扰动或灵活的局部补丁相比，XSPA在攻击预算上更为严格，因此为VLM的鲁棒性提供了更严格的测试。在稀疏支持下，XSPA同时优化分类目标、跨任务语义指导以及扰动幅度和沿线平滑性的正则化，导致在描述和VQA中出现可转移的错误分类和语义漂移，同时保持视觉上的微妙性。在默认设置下，XSPA仅修改约1.76%的图像像素。在COCO数据集上的实验表明，XSPA在所有三个任务中都一致地降低了性能。零样本准确性分别在OpenAI CLIP ViT-L/14和OpenCLIP ViT-B/16上下降了52.33分和67.00分，而GPT-4评估的描述一致性下降了最多58.60分，VQA的正确性下降了最多44.38分。这些结果表明，即使具有固定几何先验的稀疏且视觉上微妙的扰动也能够显著破坏VLM中的跨任务语义，揭示了当前多模态系统中的显著鲁棒性差距。

Summary / 总结

The research aims to investigate the robustness of vision-language models (VLMs) against sparse and geometrically fixed perturbations. The study introduces X-shaped Sparse Pixel Attack (XSPA), which imposes perturbations on two intersecting diagonal lines, providing a stringent test of VLM robustness. Experiments on the COCO dataset demonstrate that XSPA significantly degrades performance across zero-shot classification, image captioning, and VQA tasks, with up to 67.00 point drops in zero-shot accuracy and substantial decreases in caption and VQA performance. This indicates that even subtle, sparse, and structured perturbations can disrupt cross-task semantics in VLMs, highlighting a robustness gap in current multimodal systems.

论文提出了X形稀疏像素攻击(XSPA)，该方法用于测试视觉语言模型(VLMs)在面对稀疏且几何固定的扰动时的鲁棒性。XSPA将扰动限制在两条相交的对角线上，提供了一个对VLMs鲁棒性的严格测试。在COCO数据集上的实验显示，XSPA显著降低了零样本分类、图像描述和视觉问答任务的性能，最高降幅达67.00分。这表明即使是细微且固定几何结构的扰动也能破坏VLMs的跨任务语义，揭示了当前多模态系统中的鲁棒性差距。

Domain-Invariant Prompt Learning for Vision-Language Models

Authors: Arsham Gholamzadeh Khoee, Yinan Yu, Robert Feldt

First: 2026-03-30T15:18:31+00:00 · Latest: 2026-03-30T15:18:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization (CoOp), effectively adapts these models for downstream recognition tasks by learning a set of context vectors. However, CoOp lacks explicit mechanisms for handling domain shifts across unseen distributions. To address this, we propose Domain-invariant Context Optimization (DiCoOp), an extension of CoOp optimized for domain generalization. By employing an adversarial training approach, DiCoOp forces the model to learn domain-invariant prompts while preserving discriminative power for classification. Experimental results show that DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains.

中文标题/摘要

标题：视觉语言模型的领域不变提示学习

像CLIP这样的大型预训练视觉语言模型通过在共享特征空间中对齐图像和文本，已彻底改变了计算机视觉领域，使这些模型能够通过提示实现稳健的零样本迁移。软提示，如Context Optimization (CoOp)，通过学习一组上下文向量有效地适应这些模型以执行下游识别任务。然而，CoOp缺乏处理未见分布领域转移的显式机制。为解决这一问题，我们提出了一种领域不变的Context Optimization (DiCoOp)，这是一种针对领域泛化的CoOp的扩展。通过采用对抗训练方法，DiCoOp迫使模型学习领域不变的提示，同时保持分类的区分能力。实验结果表明，DiCoOp在多种视觉领域中的领域泛化任务中始终优于CoOp。

Summary / 总结

The research aims to improve the robustness of vision-language models in handling unseen domains by addressing the limitations of existing soft-prompting methods like CoOp. The proposed Domain-invariant Context Optimization (DiCoOp) uses adversarial training to learn domain-invariant prompts, thereby enhancing the model's ability to generalize across different visual domains. Experiments demonstrate that DiCoOp outperforms CoOp in domain generalization tasks across various visual domains.

研究旨在通过解决现有软提示方法（如Context Optimization CoOp）的局限性，提高视觉-语言模型的领域泛化能力。提出的Domain-invariant Context Optimization (DiCoOp) 方法通过对抗训练学习领域不变的提示，从而增强模型处理未见过的数据分布的能力。实验结果表明，DiCoOp 在各种视觉领域中的领域泛化任务中优于 CoOp。

Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Authors: Athos Georgiou

First: 2026-03-30T15:17:41+00:00 · Latest: 2026-03-30T15:17:41+00:00

Comments: Comments: 17 pages, 2 figures, 7 tables. ## Model Cards - https://huggingface.co/athrael-soju/HydraQwen3.5-4B - https://huggingface.co/athrael-soju/HydraQwen2.5-Omni-3B - https://huggingface.co/athrael-soju/ColQwen3.5-4B-controlled-baseline - https://huggingface.co/athrael-soju/DualHead-GritLM-Qwen3.5-4B ## Scripts & evals - https://github.com/athrael-soju/hydra

Abs · PDF · Code1 · Code2 · Code3

Abstract

Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.

中文标题/摘要

标题：Hydra：在单一视觉语言模型中统一文档检索与生成

视觉文档理解通常需要单独的检索和生成模型，这会加倍内存和系统复杂性。我们提出了Hydra，这是一种双头方法，可以从单一视觉语言模型（VLM）提供ColBERT风格的晚期交互检索和自回归生成。在推理时，通过一个仅用于检索的单个LoRA适配器进行切换：启用它会产生多向量嵌入；禁用它则恢复基模型的生成质量——在10,500个贪婪和随机样本中，字节级输出完全一致，与四个VQA基准（三个信息性；ChartQA在贪婪解码下两者均为接近零）中的15,301个样本相比，最大delta-ANLS = 0.0044。我们确定了三个工程要求（注意力模式恢复、lm_head保护、KV缓存感知解码），其缺失会无声地破坏生成，尽管权重恢复正确。在ViDoRe V1上，Hydra（4B）在单次训练运行中与受控单头基线相差不到1个百分点，V2和V3的综合得分更高，集中在某些任务上；需要多种子实验来确认这些趋势。单一模型设计将峰值GPU内存减少了41%，尽管适配器切换在并发服务负载下引入了吞吐量开销。消融实验表明，在基于LoRA的（r=16）训练制度中，GritLM风格的联合训练没有提供任何好处。Qwen2.5-Omni-3B的原理证明表明，该机制可以推广到音频检索和视频嵌入，以及语音生成。

Summary / 总结

Hydra unifies document retrieval and generation in a single vision-language model, reducing system complexity. It uses a single LoRA adapter toggled at inference to switch between retrieval and generation modes. The model matches the base model's generation quality when the adapter is disabled and provides multi-vector embeddings for retrieval when enabled. On VQA benchmarks, Hydra performs comparably to a controlled single-head baseline, with higher aggregate scores on later versions of the dataset. The single-model design reduces peak GPU memory by 41%, but introduces throughput overhead under concurrent serving loads.

Hydra 在单一视觉语言模型中统一了文档检索和生成，降低了系统复杂性。通过在推理时切换单一的 LoRA 适配器，在检索模式下生成多向量嵌入，在生成模式下恢复基模型的质量。在 VQA 基准测试中，Hydra 在某些版本的数据集上得分更高，但与控制单头基线模型相当。单模型设计将峰值 GPU 内存减少了 41%，但在并发服务负载下会引入吞吐量开销。

Coarse-Guided Visual Generation via Weighted h-Transform Sampling

Authors: Yanghao Wang, Ziqi Jiang, Zhen Wang, Long Chen

First: 2026-03-12T15:26:19+00:00 · Latest: 2026-03-30T15:10:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.

中文标题/摘要

标题：基于加权h-变换采样的粗粒度视觉生成

粗粒度视觉生成是从退化或低保真度的粗略参考中合成精细视觉样本的关键技术，对于各种实际应用至关重要。虽然基于训练的方法很有效，但它们固有限制于高昂的训练成本和由于配对数据收集受限的泛化能力。因此，最近的无训练方法提出利用预训练的扩散模型，并在采样过程中引入指导。然而，这些无训练方法要么需要知道正向（精细到粗略）变换算子，例如双立方下采样，要么难以在指导和合成质量之间取得平衡。为了解决这些挑战，我们提出了一种新颖的指导方法，使用h-变换，这是一种可以在期望条件下约束随机过程（例如采样过程）的工具。具体来说，我们通过在原始微分方程中添加一个漂移函数来修改每个采样时间步的转换概率，这大约会引导生成向理想的精细样本。为了解决不可避免的近似误差，我们引入了一种噪声级别感知的时间表，随着误差增加逐渐减少该项的权重，从而确保指导的遵守和高质量的合成。广泛的实验表明，我们的方法在各种图像和视频生成任务中具有有效性和泛化能力。

Summary / 总结

The paper addresses the challenge of coarse-guided visual generation, which involves synthesizing fine visual samples from degraded or low-fidelity coarse references. To overcome the limitations of training-based approaches, the authors propose a novel method using the h-transform, which modifies the transition probability at each sampling timestep to guide the generation process. This method introduces a noise-level-aware schedule to de-weight the term as errors increase, ensuring both adherence to guidance and high-quality synthesis. Experiments across various tasks show the method's effectiveness and generalization capabilities.

论文旨在解决从低保真度的粗糙参考中合成高质量的精细视觉样本的问题，这对于各种应用至关重要。提出了一种新颖的指导方法，使用h-transform来约束采样过程，在每个采样时间步添加一个漂移函数来引导生成向理想的精细样本。为处理近似误差，引入了一个噪声级别感知的时间表，逐渐减少该项的权重，确保指导的遵守和高质量的合成。在不同任务上的实验表明了该方法的有效性和泛化能力。

MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations

Authors: Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Morten Rieger Hannemose

First: 2026-02-21T10:53:50+00:00 · Latest: 2026-03-30T14:49:34+00:00

Comments: Accepted by CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, performs inference over 30x faster than the baseline and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.

中文标题/摘要

标题：MaskDiME：自适应掩码扩散以实现精确高效的视觉反事实解释

视觉反事实解释旨在揭示能够改变模型预测的最小语义修改，为深度神经网络提供因果和可解释的洞察。然而，现有的基于扩散的反事实生成方法通常计算成本高、采样速度慢且在局部修改区域定位方面不够精确。为了解决这些限制，我们提出了一种名为MaskDiME的简单、快速且有效的扩散框架，通过局部采样统一语义一致性和空间精度。我们的方法适应性地关注决策相关区域，以实现局部和语义一致的反事实生成，同时保持高图像保真度。我们的无需训练框架MaskDiME在推理速度上比基线快30倍，并在五个涵盖不同视觉领域的基准数据集上实现了可比或最先进的性能，为高效的反事实解释提供了一种实用且可推广的解决方案。

Summary / 总结

MaskDiME is designed to generate precise and efficient visual counterfactual explanations by addressing the computational inefficiency and imprecision of existing methods. It uses a localized sampling approach to focus on decision-relevant regions, ensuring both semantic consistency and spatial precision. Experimental results show that MaskDiME is 30 times faster than the baseline while achieving comparable or state-of-the-art performance across various datasets.

MaskDiME 通过局部采样统一语义一致性和空间精度，以生成精确且高效的视觉反事实解释。它会适应性地关注决策相关区域，从而实现局部和语义一致的反事实生成，同时保持高图像保真度。MaskDiME 比基线快 30 倍，并在五个涵盖不同视觉领域的基准数据集上实现了可比或最先进的性能，使其成为一种实用且通用的高效反事实解释解决方案。

INSID3: Training-Free In-Context Segmentation with DINOv3

Authors: Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, Stefan Roth

Venue: CVPR 2026

First: 2026-03-30T14:16:37+00:00 · Latest: 2026-03-30T14:16:37+00:00

Comments: CVPR 2026. Project page: https://visinf.github.io/INSID3

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .

中文标题/摘要

标题：INSID3：无需训练的上下文内分割方法与DINOv3

上下文内分割（ICS）旨在根据一个标注的视觉示例分割任意概念，例如对象、部件或个性化实例。现有工作依赖于（i）微调视觉基础模型（VFMs），这在领域内提高了结果但损害了泛化能力，或（ii）结合多个冻结的VFMs，这保留了泛化能力但带来了架构复杂性和固定的分割粒度。我们从极简主义的角度重新审视ICS，并提出：单个自监督骨干能否在无需任何监督或辅助模型的情况下同时支持语义匹配和分割？我们展示了放大后的密集自监督特征从DINOv3中表现出强烈的空间结构和语义对应。我们引入了INSID3，这是一种无需训练的方法，仅从冻结的DINOv3特征中分割不同粒度的概念，给定一个上下文示例。INSID3在单次分割、部分分割和个性化分割方面达到了最先进的结果，比之前的工作高出7.5%的mIoU，同时使用了3倍少的参数且没有任何掩码或类别级别的监督。代码可在https://github.com/visinf/INSID3 获取。

Summary / 总结

INSID3 is a training-free in-context segmentation approach that leverages the strong spatial structure and semantic correspondence of scaled-up dense self-supervised features from DINOv3. It segments arbitrary concepts from a single in-context example without any additional supervision or auxiliary models, achieving state-of-the-art results with 3x fewer parameters compared to previous methods.

该论文通过利用DINOv3的密集自监督特征提出了INSID3方法，解决了一次性上下文分割（ICS）的问题。该方法仅从一个标注的示例中分割任意概念，无需额外的监督或辅助模型。INSID3在mIoU上比之前的方法提高了7.5%，使用参数量减少了3倍，并且没有掩码或类别级别的监督。

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Authors: Doan Nam Long Vu, Simone Balloccu

First: 2026-03-30T12:58:10+00:00 · Latest: 2026-03-30T12:58:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

中文标题/摘要

标题：支架效应：提示框架如何驱动临床VLM评估中看似多模态的性能提升

可靠的临床AI需要确保性能提升反映的是真正的证据整合，而不是表面的伪影。我们对12个开放权重的视觉-语言模型（VLMs）在两个临床神经影像学队列\textsc{FOR2107}（情感障碍）和\textsc{OASIS-3}（认知衰退）上的二分类任务进行了评估。两个数据集都包含结构MRI数据，但没有可靠的个体水平诊断信号。在这些条件下，较小的VLM在引入神经影像学上下文后，F1分数可提升高达58%，精简模型甚至与大一个数量级的模型竞争。对比置信度分析表明，仅仅在任务提示中提到MRI可用性就可解释70-80%的这种变化，这与是否实际有影像数据无关，我们将其称为特定领域的模态坍塌现象，称为“支架效应”。专家评估发现，在所有条件下都存在基于神经影像学的伪造论证，而偏好对齐虽然消除了MRI引用行为，但使两种条件都向随机基线坍塌。我们的研究结果表明，表面评估不足以反映多模态推理，这对VLM在临床环境中的部署具有直接的含义。

Summary / 总结

This study evaluates the performance of 12 open-weight vision-language models on binary classification tasks using clinical neuroimaging data from two cohorts. Despite MRI data lacking individual-level diagnostic signals, smaller models showed significant performance gains, with distilled models becoming competitive with much larger counterparts. The key finding is that merely mentioning MRI availability in the task prompt led to a 70-80% shift in performance, indicating a 'scaffold effect' where models fabricate neuroimaging-grounded justifications. Expert evaluation confirmed that these justifications were fabricated and that removing MRI references led to random performance, suggesting that surface-level evaluations may not accurately reflect genuine multimodal reasoning capabilities.

研究评估了12个视觉-语言模型在临床神经影像数据上的表现，发现引入神经影像上下文后，小型模型可以获得高达58%的F1增益，这归因于一种‘支架效应’，即在任务提示中提及MRI可用性显著影响了模型性能。专家评估表明，这些增益并非来自真正的跨模态推理，而是表面层面的伪影，强调了临床AI中需要更 robust 的评估方法。

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

Authors: Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

First: 2026-02-18T21:28:56+00:00 · Latest: 2026-03-30T12:50:11+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

中文标题/摘要

标题：MALLVI：一种综合通用机器人操作的多智能体框架

使用大型语言模型（LLMs）进行机器人操作的任务规划是一个新兴领域。先前的方法依赖于专门的模型、微调或提示调优，并且通常以开环方式运行，缺乏稳健的环境反馈，使其在动态环境中变得脆弱。MALLVI 提出了一种多智能体大型语言和视觉框架，能够实现闭环反馈驱动的机器人操作。给定自然语言指令和环境图像，MALLVI 生成可执行的原子动作供机器人执行。执行动作后，视觉语言模型（VLM）评估环境反馈并决定是否重复该过程或进行下一步。MALLVI 不使用单一模型，而是协调分解器、定位器、思考者和反思者等专门智能体来管理感知、定位、推理和高级规划。可选的描述者智能体提供初始状态的视觉记忆。反思者通过重新激活相关智能体进行有针对性的错误检测和恢复，避免全面重新规划。在模拟和真实世界设置中的实验表明，迭代闭环多智能体协调可以提高泛化能力并增加零样本操作任务的成功率。代码可在 https://github.com/iman1234ahmadi/MALLVI 获取。

Summary / 总结

MALLVI is a multi-agent framework that uses large language models and vision to enable closed-loop feedback-driven robotic manipulation. Given natural language instructions and environmental images, MALLVI generates executable actions and uses a Vision Language Model to evaluate feedback, deciding whether to repeat actions or proceed. The framework includes specialized agents for perception, localization, reasoning, and planning, with an optional visual memory agent. Experiments show that this approach improves generalization and success rates in zero-shot manipulation tasks.

MALLVI 是一个多代理框架，利用大型语言模型和视觉技术实现闭环反馈驱动的机器人操作。给定自然语言指令和图像后，MALLVI 生成原子动作，然后由视觉语言模型评估并决定后续动作。该框架包括感知、定位、推理和规划的专业代理，还有一个可选的代理用于视觉记忆。实验表明，这种迭代闭环方法可以提高零样本操作任务的一般化能力和成功率。

Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

Authors: Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Angela Yao, Gianni Franchi

First: 2025-12-16T08:15:24+00:00 · Latest: 2026-03-30T12:43:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

中文标题/摘要

标题：使用语义高斯过程提高LVLM的语义不确定性量化

大型视觉-语言模型（LVLMs）经常生成看似合理但实际上不可靠的输出，因此稳健的不确定性估计至关重要。最近关于语义不确定性估计的工作依赖于外部模型对多个采样响应进行聚类并测量它们的语义一致性。然而，这些聚类方法往往很脆弱，对细微的措辞变化非常敏感，并且可能会错误地将语义相似的答案分组或分开，导致不可靠的不确定性估计。我们提出了语义高斯过程不确定性（SGPU），这是一种贝叶斯框架，通过分析答案嵌入的几何结构来量化语义不确定性，避免了脆弱的聚类。SGPU 将生成的答案映射到密集的语义空间，计算它们嵌入的格拉姆矩阵，并通过特征谱总结它们的语义配置。然后将这种谱表示输入到高斯过程分类器中，该分类器学习将语义一致性的模式映射到预测不确定性，并且可以在黑盒和白盒设置中应用。在六个LLM和LVLM的八个数据集上，包括VQA、图像分类和文本问答，SGPU始终实现了最先进的校准（ECE）和区分（AUROC、AUARC）性能。我们进一步表明，SGPU可以在模型和模态之间进行迁移，表明其谱表示捕捉到了语义不确定性的普遍模式。

Summary / 总结

The research aims to improve the reliability of semantic uncertainty quantification in Large Vision-Language Models (LVLMs) by proposing Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that avoids fragile clustering methods. SGPU maps answers into a semantic space, computes the Gram matrix of their embeddings, and uses the eigenspectrum to represent semantic consistency, which is then used by a Gaussian Process Classifier to estimate uncertainty. Across various LLMs and LVLMs on different datasets, SGPU achieves state-of-the-art calibration and discriminative performance, and demonstrates transferability across models and modalities.

论文针对大型视觉-语言模型（LVLMs）产生不可靠输出的问题，提出了语义高斯过程不确定性（SGPU）框架，避免了脆弱的聚类方法。SGPU将答案映射到语义空间，计算嵌入的格拉姆矩阵，并通过特征谱来量化语义不确定性。实验结果显示，SGPU在六个LLM和LVLM的八个数据集上实现了最先进的校准和区分性能，并且在不同模型和模态之间具有可转移性，表明其特征谱捕捉到了语义不确定性的一般模式。

SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering

Authors: Jiho Park, Sieun Choi, Jaeyoon Seo, Minho Sohn, Yeana Kim, Jihie Kim

First: 2026-03-30T12:30:51+00:00 · Latest: 2026-03-30T12:30:51+00:00

Abs · PDF · Code1 · Code2

Abstract

A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.

中文标题/摘要

标题：SEA：通过元素级常识视觉问答评估素描抽象效率

素描是一种简化的视觉抽象形式，通过简化的但有目的的笔触传达核心概念，同时省略不必要的细节。尽管具有表达力，但量化素描中语义抽象的效率仍然具有挑战性。现有的依赖参考图像、低级视觉特征或识别准确性的评估方法无法捕捉到素描的本质属性——抽象。为了解决这些局限性，我们引入了SEA（素描评估度量法，用于衡量抽象效率），这是一种无需参考图像的度量方法，评估素描如何经济地表示类定义的视觉元素，同时保持语义可识别性。这些元素是根据关于通常在素描中描绘的特征的常识知识，按类别提取的。SEA 利用视觉问答模型来确定每个元素的存在，并返回一个反映在视觉经济下的语义保留程度的定量分数。为了支持这一度量方法，我们提出了CommonSketch，这是第一个语义标注的素描数据集，包含23,100个人工绘制的素描，跨越300个类别，每个素描都配有一个描述和元素级注释。实验表明，SEA 与人类判断高度一致，并可靠地区分不同抽象效率的水平，而CommonSketch 作为基准，提供了对各种视觉-语言模型在元素级素描理解上的系统评估。

Summary / 总结

The paper introduces SEA, a reference-free metric to evaluate the efficiency of sketch abstraction by assessing how economically a sketch represents class-defining visual elements while preserving semantic recognizability. SEA uses a visual question answering model to determine the presence of each element and returns a quantitative score. The method leverages a new dataset, CommonSketch, which includes 23,100 human-drawn sketches across 300 classes with element-level annotations. Experiments show that SEA aligns well with human judgments and reliably discriminates levels of abstraction efficiency.

论文提出了SEA，这是一种无参考的度量方法，通过评估素描如何经济地表示类定义的视觉元素并保持语义可识别性来评估素描抽象的效率。SEA 使用视觉问答模型来确定这些元素的存在，并提供一个定量评分。该方法得到了一个名为CommonSketch的语义注释素描数据集的支持，包含23,100个人手绘制的素描，覆盖300个类别，用于评估视觉-语言模型对素描理解的系统性评估。实验表明，SEA 与人类判断一致，并能可靠地区分抽象效率的不同水平。

CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems

Authors: Kangkang Sun, Jun Wu, Jianhua Li, Minyi Guo, Xiuzhen Che, Jianwei Huang

Venue: ICLR

First: 2026-03-30T12:28:26+00:00 · Latest: 2026-03-30T12:28:26+00:00

Comments: 18 pages, 7 figures, has already published in ICLR workshop "Agentic AI in the Wild: From Hallucinations to Reliable Autonomy"

Abs · PDF · Code1 · Code2

Abstract

Uncertainty estimation in multi-LLM systems remains largely single-model-centric: existing methods quantify uncertainty within each model but do not adequately capture semantic disagreement across models. To address this gap, we propose Collaborative Entropy (CoE), a unified information-theoretic metric for semantic uncertainty in multi-LLM collaboration. CoE is defined on a shared semantic cluster space and combines two components: intra-model semantic entropy and inter-model divergence to the ensemble mean. CoE is not a weighted ensemble predictor; it is a system-level uncertainty measure that characterizes collaborative confidence and disagreement. We analyze several core properties of CoE, including non-negativity, zero-value certainty under perfect semantic consensus, and the behavior of CoE when individual models collapse to delta distributions. These results clarify when reducing per-model uncertainty is sufficient and when residual inter-model disagreement remains. We also present a simple CoE-guided, training-free post-hoc coordination heuristic as a practical application of the metric. Experiments on \textit{TriviaQA} and \textit{SQuAD} with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE provides stronger uncertainty estimation than standard entropy- and divergence-based baselines, with gains becoming larger as additional heterogeneous models are introduced. Overall, CoE offers a useful uncertainty-aware perspective on multi-LLM collaboration.

中文标题/摘要

标题：CoE：多LLM系统中代理性协作中的语义不确定性量化

多LLM系统中的不确定性估计仍然主要集中在单模型上：现有方法在每个模型内部量化不确定性，但未能充分捕捉模型间的语义分歧。为解决这一问题，我们提出了协作熵（CoE），这是一种统一的信息论度量，用于多LLM协作中的语义不确定性。CoE 基于共享的语义簇空间，并结合了两个组成部分：模型内部的语义熵和与整体均值的模型间分歧。CoE 不是一个加权的集成预测器，而是一个系统级的不确定性度量，能够表征协作的信心和分歧。我们分析了CoE 的几个核心属性，包括非负性、完美语义共识下的零值确定性，以及当个体模型退化为δ分布时CoE 的行为。这些结果阐明了何时减少每个模型的不确定性就足够了，以及何时残余的模型间分歧仍然存在。我们还提出了一种简单的CoE引导的、无需训练的后处理协调启发式方法，作为该度量的实际应用。在TriviaQA和SQuAD上的实验使用LLaMA-3.1-8B-Instruct、Qwen-2.5-7B-Instruct和Mistral-7B-Instruct表明，CoE 提供了比标准熵和分歧基线更强的不确定性估计，随着引入更多异构模型，这种优势变得更为明显。总体而言，CoE 提供了多LLM协作的一个有用且具有不确定性的视角。

Summary / 总结

The paper addresses the limitation of single-model-centric uncertainty estimation in multi-LLM systems by proposing Collaborative Entropy (CoE), which quantifies semantic uncertainty through a unified information-theoretic metric. CoE combines intra-model semantic entropy and inter-model divergence to the ensemble mean, providing a system-level measure of collaborative confidence and disagreement. Experiments on TriviaQA and SQuAD demonstrate that CoE outperforms standard entropy- and divergence-based baselines, especially when additional heterogeneous models are included, offering a stronger uncertainty estimation.

论文针对多LLM系统中单模型中心的不确定性估计不足，提出了协作熵（CoE），用于跨模型量化语义不确定性。CoE 结合了模型内部的语义熵和与整体平均值的模型间分歧，提供了一个系统级的协作信心和分歧度量。实验结果表明，CoE 在TriviaQA 和 SQuAD 上的表现优于标准的熵和分歧度量方法，特别是在引入异构模型时效果更佳。

From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Authors: Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner

Venue: CVPR 2026

First: 2025-11-26T14:19:44+00:00 · Latest: 2026-03-30T11:44:12+00:00

Comments: 10 pages, 5 figures, Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

中文标题/摘要

标题：从观察到行动：基于潜在行动的隐式动作分割方法在工业环境中的VLA预训练

我们提出了一种新颖的无监督框架，以解锁来自连续工业视频流的大量未标记的人类演示数据，用于Vision-Language-Action (VLA) 模型的预训练。该方法首先训练一个轻量级的运动分词器以编码运动动力学，然后利用一种新颖的“潜在行动能量”度量的无监督动作分割器来发现和分割语义上一致的动作基元。该流水线输出分割后的视频片段及其对应的潜在动作序列，提供直接适用于VLA预训练的结构化数据。在公共基准测试和一个专有的电动机装配数据集上的评估表明，该方法能够有效分割工作台上的关键任务。进一步的聚类和通过视觉语言模型进行的定量评估证实了发现的动作基元的语义一致性。据我们所知，这是第一个从非结构化工业视频中自动提取和组织VLA预训练数据的端到端系统，为制造中的嵌入式人工智能集成提供了可扩展的解决方案。

ScenePilot-4K: A Large-Scale First-Person Dataset and Benchmark for Vision-Language Models in Autonomous Driving

Authors: Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Li Zhang, Bingzhao Gao, Daxin Tian, Jianqiang Wang, Hong Chen

First: 2026-01-27T13:17:50+00:00 · Latest: 2026-03-30T11:42:45+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this paper, we introduce ScenePilot-4K, a large-scale first-person dataset for safety-aware vision-language learning and evaluation in autonomous driving. Built from public online driving videos, ScenePilot-4K contains 3,847 hours of video and 27.7M front-view frames spanning 63 countries/regions and 1,210 cities. It jointly provides scene-level natural-language descriptions, risk assessment labels, key-participant annotations, ego trajectories, and camera parameters through a unified multi-stage annotation pipeline. Building on this dataset, we establish ScenePilot-Bench, a standardized benchmark that evaluates vision-language models along four complementary axes: scene understanding, spatial perception, motion planning, and GPT-based semantic alignment. The benchmark includes fine-grained metrics and geographic generalization settings that expose model robustness under cross-region and cross-traffic domain shifts. Baseline results on representative open-source and proprietary vision-language models show that current models remain competitive in high-level scene semantics but still exhibit substantial limitations in geometry-aware perception and planning-oriented reasoning. Beyond the released dataset itself, the proposed annotation pipeline serves as a reusable and extensible recipe for scalable dataset construction from public Internet driving videos. The codes and supplementary materials are available at: https://github.com/yjwangtj/ScenePilot-4K, with the dataset available at https://huggingface.co/datasets/larswangtj/ScenePilot-4K.

中文标题/摘要

标题：ScenePilot-4K：自动驾驶中面向视觉语言模型的安全感知数据集和基准

在本文中，我们介绍了ScenePilot-4K，这是一个用于自动驾驶中安全感知视觉语言学习和评估的大规模第一人称数据集。该数据集源自公共在线驾驶视频，包含3,847小时的视频和27.7M个前视帧，覆盖63个国家和地区及1,210个城市。通过统一的多阶段注释流水线，它共同提供了场景级自然语言描述、风险评估标签、关键参与者注释、自我轨迹和相机参数。基于此数据集，我们建立了ScenePilot-Bench，这是一个标准化基准，从场景理解、空间感知、运动规划和基于GPT的语义对齐四个方面评估视觉语言模型。基准包括细粒度的度量标准和地理泛化设置，以暴露模型在跨区域和跨交通领域变化下的鲁棒性。代表性的开源和专有视觉语言模型的基线结果显示，当前模型在高层次场景语义方面仍具有竞争力，但在几何感知和规划导向推理方面仍存在显著局限。除了发布的数据集本身，提出的注释流水线还为从公共互联网驾驶视频构建可扩展数据集提供了一个可重用和可扩展的食谱。相关代码和补充材料可在https://github.com/yjwangtj/ScenePilot-4K 获取，数据集可在https://huggingface.co/datasets/larswangtj/ScenePilot-4K 获取。

Summary / 总结

ScenePilot-4K is a large-scale first-person dataset for autonomous driving, containing 3,847 hours of video and 27.7M frames from 63 countries. It includes scene descriptions, risk labels, and annotations for key participants, ego trajectories, and camera parameters. ScenePilot-Bench, a benchmark based on this dataset, evaluates models in scene understanding, spatial perception, motion planning, and semantic alignment. Baseline results indicate that current models are strong in high-level scene semantics but struggle with geometry-aware perception and planning-oriented reasoning. The annotation pipeline is reusable and scalable for constructing datasets from public driving videos.

ScenePilot-4K 是一个大规模的第一人称数据集，包含来自 63 个国家的 3,847 小时视频和 27.7M 帧。它包括场景描述、风险标签以及关键参与者、 ego 轨迹和相机参数的注释。基准测试 ScenePilot-Bench 评估模型在场景理解、空间感知、运动规划和语义对齐方面的表现，揭示了在几何感知和规划导向推理方面的局限性。数据集和注释流程已公开供研究使用。

Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Authors: Nanaka Hosokawa, Ryo Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu, Tatsuro Hayashi, Yuta Seino, Xiangrong Zhou, Takeshi Hara, Akitoshi Katsumata, Hiroshi Fujita

First: 2025-10-02T13:22:13+00:00 · Latest: 2026-03-30T10:54:10+00:00

Comments: Revised manuscript; supplementary materials added. Submitted to Diagnostics

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) such as GPT (Generative Pre-Trained Transformer) have shown potential for medical image interpretation; however, challenges remain in generating reliable radiological findings in clinical practice, as exemplified by dental pathologies. This study proposes a Self-correction Loop with Structured Output (SLSO) framework as an integrated processing methodology to enhance the accuracy and reliability of AI-generated findings for jaw cysts in dental panoramic radiographs. Dental panoramic radiographs with jaw cysts were used to implement a 10-step integrated processing framework incorporating image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration. The framework functioned as an external validation mechanism for GPT outputs. Performance was compared against the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The SLSO framework improved output accuracy for multiple items compared to the CoT method, with the most notable improvements observed in tooth number identification, tooth movement detection, and root resorption assessment. In successful cases, consistently structured outputs were achieved after up to five regenerations. The framework enforced explicit negative finding descriptions and suppressed hallucinations, although accurate identification of extensive lesions spanning multiple teeth remained limited. This investigation established the feasibility of the proposed integrated processing methodology and provided a foundation for future validation studies with larger, more diverse datasets.

中文标题/摘要

标题：基于GPT的VLM生成牙颌囊肿全景牙片发现：一种两阶段自我校正环结构输出（SLSO）框架的初步研究

视觉语言模型（VLMs）如GPT（生成预训练变换器）在医学图像解释方面显示出潜力；然而，在临床实践中生成可靠的放射学发现仍面临挑战，特别是在牙科病理学方面。本研究提出了一种结构输出（SLSO）框架作为集成处理方法，以提高AI生成的牙颌囊肿发现的准确性和可靠性。使用包含牙颌囊肿的全景牙片实施了一个10步集成处理框架，结合了图像分析、结构化数据生成、牙齿编号提取、一致性检查和迭代再生。该框架作为GPT输出的外部验证机制。在七个评估项目（透明度、内部结构、边界、根吸收、牙齿移动、与其他结构的关系和牙齿编号）上，与传统的链式思维（CoT）方法进行了性能比较。SLSO框架在多个项目上提高了输出准确性，特别是在牙齿编号识别、牙齿移动检测和根吸收评估方面取得了最显著的改进。在成功案例中，经过最多五次再生后实现了结构一致的输出。该框架强制执行明确的负面发现描述并抑制幻觉，尽管对涉及多个牙齿的广泛病变的准确识别仍然有限。这项研究确立了所提出集成处理方法的可行性，并为未来使用更大、更多样数据集的验证研究奠定了基础。

Summary / 总结

This study introduces a Self-correction Loop with Structured Output (SLSO) framework to improve the accuracy and reliability of AI-generated findings for jaw cysts in dental panoramic radiographs using a GPT-based VLM. The framework incorporates image analysis, structured data generation, tooth number extraction, and iterative regeneration, and was compared against the conventional Chain-of-Thought (CoT) method across seven evaluation items. The SLSO framework showed improvements in tooth number identification, tooth movement detection, and root resorption assessment, with consistently structured outputs achieved after up to five regenerations. However, accurate identification of extensive lesions spanning multiple teeth was still limited.

本研究提出了一种Self-correction Loop with Structured Output (SLSO)框架，利用基于GPT的视觉-语言模型来提高牙颌囊肿在全景牙科X光片中的AI生成诊断的准确性。该框架包括图像分析、结构化数据生成和迭代再生，并与Chain-of-Thought (CoT)方法在七个评估项目中进行了比较。SLSO框架在识别牙齿数量、检测牙齿移动和评估根吸收方面提高了准确性，且在最多五次再生后实现了一致的输出结果。

DIV-Nav: Open-Vocabulary Spatial Relationships for Multi-Object Navigation

Authors: Jesús Ortega-Peimbert, Finn Lukas Busch, Timon Homberger, Quantao Yang, Olov Andersson

First: 2025-10-18T14:22:32+00:00 · Latest: 2026-03-30T10:11:54+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Advances in open-vocabulary semantic mapping and object navigation have enabled robots to perform an informed search of their environment for an arbitrary object. However, such zero-shot object navigation is typically designed for simple queries with an object name like "television" or "blue rug". Here, we consider more complex free-text queries with spatial relationships, such as "find the remote on the table" while still leveraging robustness of a semantic map. We present DIV-Nav, a real-time navigation system that efficiently addresses this problem through a series of relaxations: i) Decomposing natural language instructions with complex spatial constraints into simpler object-level queries on a semantic map, ii) computing the Intersection of individual semantic belief maps to identify regions where all objects co-exist, and iii) Validating the discovered objects against the original, complex spatial constrains via a LVLM. We further investigate how to adapt the frontier exploration objectives of online semantic mapping to such spatial search queries to more effectively guide the search process. We validate our system through extensive experiments on the MultiON benchmark and real-world deployment on a Boston Dynamics Spot robot using a Jetson Orin AGX. More details and videos are available at https://anonsub42.github.io/reponame/

中文标题/摘要

标题：DIV-Nav: 开放词汇空间关系的多对象导航

开放词汇语义映射和对象导航的进步使机器人能够对其环境进行有信息量的搜索以找到任意对象。然而，这种零样本对象导航通常仅针对简单的查询，如“电视机”或“蓝色地毯”。在这里，我们考虑更复杂的自然语言查询，包含空间关系，例如“找到桌子上的遥控器”，同时仍然利用语义地图的鲁棒性。我们提出了DIV-Nav，这是一种实时导航系统，通过一系列放松措施高效地解决这一问题：i) 将具有复杂空间约束的自然语言指令分解为语义地图上的简单对象级查询，ii) 计算个体语义信念图的交集以确定所有对象共存的区域，iii) 通过LVLM验证发现的对象是否符合原始的复杂空间约束。我们进一步研究如何将在线语义映射的前沿探索目标适应此类空间搜索查询，以更有效地引导搜索过程。我们通过在MultiON基准上的广泛实验以及在Boston Dynamics Spot机器人上的实际部署进行了验证，使用的是Jetson Orin AGX。更多细节和视频可在https://anonsub42.github.io/reponame/获取/

Summary / 总结

The research aims to enable robots to navigate and find objects based on more complex free-text queries with spatial relationships. DIV-Nav decomposes natural language instructions into simpler object-level queries, computes the intersection of semantic belief maps to identify co-existing regions, and validates objects against spatial constraints using a LVLM. The system was validated through experiments on the MultiON benchmark and real-world deployment on a Boston Dynamics Spot robot.

研究旨在使机器人能够基于包含空间关系的复杂自然语言查询（如“找到桌子上的遥控器”）来导航和寻找物体。DIV-Nav将自然语言指令分解为更简单的查询，计算语义信念图的交集，并通过空间约束验证发现的物体。在MultiON基准测试和Boston Dynamics Spot机器人上的实验表明，该系统能够有效处理复杂查询和空间关系。

From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation

Authors: Dawid Malarz, Filip Manjak, Maciej Zięba, Przemysław Spurek, Artur Kasymov

First: 2025-12-15T23:15:36+00:00 · Latest: 2026-03-30T09:44:18+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car's front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. We construct a benchmark dataset and introduce a novel evaluation framework combining Vision Language Models (VLMs) with segmentation-based classifiers trained on human annotations of logos and trade dress features, addressing the limitations of existing brand detectors that fail to capture abstract trade dress. Furthermore, we observe that newer, higher-fidelity systems (SDXL, FLUX) synthesize brand identifiers more readily than older models, highlighting the urgency of this challenge. Our results confirm that unbranding is a distinct problem requiring specialized techniques. Project Page: https://gmum.github.io/UNBRANDING/.

中文标题/摘要

标题：从去品牌化到UNBRANDING：商标安全文本到图像生成的标准

文本到图像扩散模型的迅速进步引发了对未经授权复制商标内容的重大关切。尽管先前的工作针对一般概念（例如，风格、名人），但未能解决特定品牌标识的问题。品牌识别是多维度的，不仅包括显性的商标标志，还包括独特的结构特征（例如，汽车的前格栅）。为了解决这一问题，我们引入了去品牌化这一新任务，旨在精细地去除商标和微妙的品牌结构特征，同时保持语义连贯性。我们构建了一个基准数据集，并引入了一种结合视觉语言模型（VLMs）与基于人类注释的标志和商标特征分割分类器的新评估框架，解决了现有品牌检测器无法捕捉抽象商标特征的局限性。此外，我们观察到，较新的、更高保真度的系统（SDXL、FLUX）比较旧的模型更容易合成品牌标识，突显了这一挑战的紧迫性。我们的结果证实，去品牌化是一个需要专门技术的独特问题。项目页面：https://gmum.github.io/UNBRANDING/

Summary / 总结

The paper addresses the issue of unauthorized reproduction of trademarked content in text-to-image generation models. It introduces the task of unbranding, which involves removing both explicit and subtle brand features while maintaining semantic coherence. The authors create a benchmark dataset and evaluation framework using Vision Language Models and segmentation-based classifiers. They find that newer models like SDXL and FLUX are more likely to generate brand identifiers, emphasizing the need for specialized unbranding techniques.

该论文针对生成文本到图像内容时不复制商标的问题，引入了一个新的任务——去品牌化。作者开发了一个基准数据集和评估框架，使用视觉语言模型和基于分割的分类器来评估显性logo和细微品牌特征的去除情况。他们发现，像SDXL和FLUX这样的更高保真度的新模型更容易生成品牌标识，强调了需要专门的去品牌化技术的重要性。结果表明，去品牌化是一个需要独特方法来确保文本到图像生成中的商标安全的独立问题。

Explaining CLIP Zero-shot Predictions Through Concepts

Authors: Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas

Venue: CVPR 2026

First: 2026-03-30T09:31:33+00:00 · Latest: 2026-03-30T09:31:33+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.

中文标题/摘要

标题：通过概念解释CLIP零样本预测

大规模的跨模态模型如CLIP在零样本图像识别方面取得了显著成功，但其预测对人类理解仍然不透明。相比之下，概念瓶颈模型通过人类定义的概念进行推理，提供可解释的中间表示，但它们依赖于概念监督，缺乏泛化到未见过的类别的能力。我们提出了EZPC，通过人类可理解的概念来解释CLIP的零样本预测，从而弥合了这两种范式的差距。我们的方法将CLIP的联合图像-文本嵌入投影到从语言描述中学习的概念空间，无需额外监督即可实现忠实和透明的解释。模型通过对齐和重建目标的学习来实现这种投影，确保概念激活保留CLIP的语义结构同时保持可解释性。在五个基准数据集CIFAR-100、CUB-200-2011、Places365、ImageNet-100和ImageNet-1k上的广泛实验表明，我们的方法在保持CLIP强大的零样本分类准确性的同时，还提供了有意义的概念级解释。通过将开放词汇的预测与明确的语义概念联系起来，我们的方法为可解释和可信的跨模态模型提供了一个原则性的步骤。代码可在https://github.com/oonat/ezpc/获取。

Summary / 总结

This paper introduces EZPC, which combines the strengths of CLIP and Concept Bottleneck Models to provide interpretable explanations for CLIP's zero-shot image recognition predictions. EZPC projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. Experiments on five benchmark datasets show that EZPC maintains CLIP's strong zero-shot classification accuracy while offering meaningful concept-level explanations.

研究旨在通过利用人类定义的概念来增强CLIP零样本图像识别预测的可解释性。EZPC将CLIP的联合图像-文本嵌入投影到从语言描述中学到的概念空间中，从而在无需额外监督的情况下实现透明且忠实的解释。在五个基准数据集上的实验表明，EZPC保持了CLIP强大的零样本分类准确性，同时提供了有意义的概念级解释，推动了视觉语言模型的可解释性和可信性。

Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning

Authors: Tommaso Galliena, Stefano Rosa, Tommaso Apicella, Pietro Morerio, Alessio Del Bue, Lorenzo Natale

First: 2026-03-25T12:52:32+00:00 · Latest: 2026-03-30T09:01:07+00:00

Comments: 24 pages, 7 figures, 7 tables (including Supplementary Materials)

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at https://hsp-iit.github.io/epos-vlm/.

中文标题/摘要

标题：具有持久且语义一致对象描述能力的记忆增强视觉-语言代理

视觉-语言模型（VLMs）经常在不同视角下对同一对象给出不一致的描述，这妨碍了具身代理在时间上构建一致的语义表示的能力。先前的方法通过离线多视角聚合或拆分探索、数据关联和描述学习的多阶段管道来解决不一致性，但这些方法在推理已观察对象方面的能力有限。在本文中，我们提出了一种统一的记忆增强视觉-语言代理，该代理在单个自回归框架内同时处理数据关联、对象描述和探索策略。该模型处理当前的RGB观察、自上而下的探索地图以及序列化为对象级标记的对象级情景记忆，确保在长时间序列中保持对象身份和语义一致性。为了以自监督方式训练该模型，我们使用基于分歧的策略和一个伪描述模型在逼真的3D环境中收集数据集，该伪描述模型强制执行多视角描述历史的一致性。在手动标注的对象级测试集上的广泛评估表明，与基线模型相比，标准描述分数提高了最多11.86%，描述自我相似性提高了7.39%，同时通过紧凑的场景表示实现可扩展性能。代码、模型权重和数据可在https://hsp-iit.github.io/epos-vlm/获取。

Summary / 总结

This paper addresses the inconsistency in object descriptions across different viewpoints in Vision-Language Models (VLMs), which hinders the construction of consistent semantic representations by embodied agents. The authors propose a unified memory-augmented Vision-Language agent that integrates data association, object captioning, and exploration policy within an autoregressive framework. The model uses an explored map and object-level episodic memory to maintain persistent object identity and semantic consistency. The model is trained in a self-supervised manner using a dataset collected in photorealistic 3D environments. Experimental results show improvements of up to 11.86% in standard captioning scores and 7.39% in caption self-similarity compared to baseline models, while maintaining scalable performance through a compact scene representation.

本文通过引入一个统一的记忆增强视觉-语言代理解决了视觉-语言模型在不同视角下对同一物体描述不一致的问题。该模型在一个自回归框架内整合了数据关联、物体描述和探索策略，使用物体级别的 episodic 记忆。模型通过在照片级真实 3D 环境中收集的数据进行自监督训练。实验结果表明，与基线模型相比，该模型在标准描述评分上提高了最多 11.86%，在描述一致性上提高了 7.39%，同时通过紧凑的场景表示保持了可扩展性。

FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation

Authors: Huy Che, Vinh-Tiep Nguyen

Venue: Neurocomputing 660 (2026) 131844

First: 2025-06-29T16:41:41+00:00 · Latest: 2026-03-30T08:35:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code is available at https://github.com/chequanghuy/FA-Seg.

中文标题/摘要

标题：FA-Seg：一种基于扩散模型的快速准确开放词汇分割方法

开放词汇语义分割（OVSS）旨在从任意文本类别中分割对象，无需密集标注数据集。尽管对比学习模型能够实现零样本分割，但它们往往在像素级别失去精细的空间精度，由于全局表示偏差。相比之下，基于扩散模型自然通过注意力机制编码细粒度的空间特征，捕捉全局上下文和局部细节。然而，它们往往面临在计算成本和分割掩码质量之间平衡的挑战。在本文中，我们提出了FA-Seg，一种基于扩散模型的无需训练的快速准确开放词汇分割框架。FA-Seg仅使用预训练扩散模型的（1+1）步进行分割。此外，FA-Seg一次对所有类别进行分割，而不是多次运行。为了进一步提高分割质量，FA-Seg引入了三个关键组件：（i）一种判别性、类别感知的提示机制，用于提取注意力，（ii）一种层次注意力精炼方法（HARD），通过多分辨率注意力融合增强语义精度，以及（iii）一种测试时翻转（TTF）方案，旨在提高空间一致性。广泛实验表明，FA-Seg在PASCAL VOC、PASCAL Context和COCO Object基准测试中实现了最先进的无需训练性能，平均mIoU为43.8%，同时保持了优越的推理效率。我们的结果表明，FA-Seg为可扩展性提供了坚实的基础，弥合了分割质量和推理效率之间的差距。源代码可在https://github.com/chequanghuy/FA-Seg/ 获取。

Summary / 总结

FA-Seg is a fast and accurate training-free framework for open-vocabulary segmentation using diffusion models. It performs segmentation with a single step from a pretrained model and introduces a dual-prompt mechanism, Hierarchical Attention Refinement Method, and Test-Time Flipping to enhance segmentation quality. Experiments show that FA-Seg achieves state-of-the-art performance with 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining high inference efficiency.

FA-Seg 是一个基于扩散模型的快速且准确的无需训练框架，用于开放词汇分割。它通过单一步骤从预训练模型进行分割，并引入了双提示机制、层次注意力精炼方法和测试时翻转方案以提升分割质量。实验结果显示，FA-Seg 在 PASCAL VOC、PASCAL Context 和 COCO Object 基准上的平均 mIoU 达到 43.8%，同时保持了高效的推理效率。

Multimodal Graph Network Modeling for Human-Object Interaction Detection with PDE Graph Diffusion

Authors: Wenxuan Ji, Haichao Shi, Xiao-Yu Zhang

First: 2025-09-16T01:17:49+00:00 · Latest: 2026-03-30T08:23:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing GNN-based Human-Object Interaction (HOI) detection methods rely on simple MLPs to fuse instance features and propagate information. However, this mechanism is largely empirical and lack of targeted information propagation process. To address this problem, we propose Multimodal Graph Network Modeling (MGNM) for HOI detection with Partial Differential Equation (PDE) graph diffusion. Specifically, we first design a multimodal graph network framework that explicitly models the HOI detection task within a four-stage graph structure. Next, we propose a novel PDE diffusion mechanism to facilitate information propagation within this graph. This mechanism leverages multimodal features to propaganda information via a white-box PDE diffusion equation. Furthermore, we design a variational information squeezing (VIS) mechanism to further refine the multimodal features extracted from CLIP, thereby mitigating the impact of noise inherent in pretrained Vision-Language Models. Extensive experiments demonstrate that our MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method yields significant performance gains while maintaining an effective balance between rare and non-rare categories.

中文标题/摘要

标题：基于偏微分方程图扩散的人机物交互检测的多模态图网络建模

现有的基于GNN的人机物交互（HOI）检测方法依赖于简单的MLP来融合实例特征并传播信息。然而，这种机制主要是经验性的，并缺乏针对的信息传播过程。为了解决这一问题，我们提出了基于偏微分方程（PDE）图扩散的人机物交互检测的多模态图网络建模（MGNM）。具体来说，我们首先设计了一个多模态图网络框架，该框架在四阶段图结构中明确建模了HOI检测任务。接着，我们提出了一种新颖的PDE扩散机制，以促进该图中的信息传播。该机制利用多模态特征通过白盒PDE扩散方程传播信息。此外，我们设计了一种变分信息挤压（VIS）机制，以进一步细化从CLIP提取的多模态特征，从而减轻预训练视觉-语言模型中固有的噪声影响。广泛的实验表明，我们的MGNM在两个广泛使用的基准：HICO-DET和V-COCO上达到了最先进的性能。此外，当与更先进的对象检测器结合使用时，我们的方法在保持稀有类别和非稀有类别之间有效平衡的同时，还实现了显著的性能提升。

Summary / 总结

The paper proposes Multimodal Graph Network Modeling (MGNM) for Human-Object Interaction (HOI) detection using a PDE graph diffusion mechanism. It designs a four-stage graph structure to explicitly model HOI detection and introduces a novel PDE diffusion mechanism to propagate information. Additionally, a variational information squeezing (VIS) mechanism is used to refine multimodal features. Experiments show that MGNM outperforms existing methods on HICO-DET and V-COCO benchmarks and improves performance when integrated with advanced object detectors.

论文通过提出一种多模态图网络建模（MGNM）框架来解决现有基于图神经网络（GNN）的人-物交互（HOI）检测方法的局限性。该框架采用四阶段图结构和一种新颖的偏微分方程（PDE）扩散机制来促进目标导向的信息传播。此外，还引入了一种变分信息挤压（VIS）机制来细化从CLIP提取的多模态特征。实验表明，MGNM在HICO-DET和V-COCO基准上优于现有方法，并且在与高级物体检测器集成时提高了性能。

CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Authors: Siyuan Ma, Bo Gao, Zikai Xiao, Hailong Wang, Xinlei Yu, Rui Qian, Jiayu Qian, Luqi Gong, Yang Liu

First: 2026-03-30T07:59:47+00:00 · Latest: 2026-03-30T07:59:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent test-time reasoning methods improve performance by generating more candidate chains or searching over larger reasoning trees, but they typically lack explicit control over when to expand, what to prune, how to repair, and when to abstain. We introduce CoT2-Meta, a training-free metacognitive reasoning framework that combines object-level chain-of-thought generation with meta-level control over partial reasoning trajectories. The framework integrates four components: strategy-conditioned thought generation, tree-structured search, an online process oracle for step-level reasoning evaluation, and a meta-controller that allocates computation through expansion, pruning, repair, stopping, and fallback decisions. Under matched inference budgets, CoT2-Meta consistently outperforms strong single-path, sampling-based, and search-based baselines, including ReST-MCTS. On the default backbone, it achieves 92.8 EM on MATH, 90.4 accuracy on GPQA, 98.65 EM on GSM8K, 75.8 accuracy on BBEH, 85.6 accuracy on MMMU-Pro, and 48.8 accuracy on HLE, with gains over the strongest non-CoT2-Meta baseline of +3.6, +5.2, +1.15, +2.0, +4.3, and +4.3 points, respectively. Beyond these core results, the framework remains effective across a broader 15-benchmark suite spanning knowledge and QA, multi-hop reasoning, coding, and out-of-distribution evaluation. Additional analyses show better compute scaling, improved calibration, stronger selective prediction, targeted repair behavior, and consistent gains across backbone families. These results suggest that explicit metacognitive control is a practical design principle for reliable and compute-efficient test-time reasoning systems.

中文标题/摘要

标题：CoT2-Meta: 预算内的元认知控制测试时推理

最近的测试时推理方法通过生成更多的候选链或搜索更大的推理树来提高性能，但它们通常缺乏对何时扩展、什么内容剪枝、如何修复以及何时弃权的显式控制。我们引入了CoT2-Meta，这是一种无需训练的元认知推理框架，结合了对象级的思维链生成和对部分推理轨迹的元级控制。该框架整合了四个组件：策略条件下的思维生成、树状结构搜索、在线过程oracle进行步骤级推理评估以及一个元控制器，该控制器通过扩展、剪枝、修复、停止和回退决策分配计算。在匹配的推理预算下，CoT2-Meta 一致地优于强大的单路径、采样基于和搜索基于的基线，包括ReST-MCTS。在默认骨干网络上，它在MATH上达到92.8的EM，在GPQA上达到90.4的准确率，在GSM8K上达到98.65的EM，在BBEH上达到75.8的准确率，在MMMU-Pro上达到85.6的准确率，在HLE上达到48.8的准确率，分别比最强的非CoT2-Meta基线高出+3.6、+5.2、+1.15、+2.0、+4.3和+4.3分。除了这些核心结果，该框架在更广泛的15个基准套件中仍然有效，这些基准套件涵盖了知识和问答、多跳推理、编程和离分布评估。额外的分析显示了更好的计算扩展性、改进的校准、更强的选择性预测、目标修复行为以及跨骨干家族的一致收益。这些结果表明，显式的元认知控制是可靠且计算高效的测试时推理系统的实用设计原则。

Summary / 总结

CoT2-Meta is a metacognitive reasoning framework that enhances test-time performance by integrating object-level chain-of-thought generation with meta-level control over reasoning trajectories. It includes components for strategy-conditioned thought generation, tree-structured search, step-level evaluation, and a meta-controller for computation allocation. CoT2-Meta outperforms various baselines on multiple benchmarks, achieving significant improvements in metrics such as EM and accuracy, and demonstrating robust performance across a wide range of tasks.

CoT2-Meta 是一种结合对象级链推理生成和元级控制推理轨迹的元认知推理框架。它包括策略条件下的思想生成、树结构搜索、在线过程 oracle 以及用于决策的元控制器。CoT2-Meta 在多个基准测试中超越了各种基线，实现了在 EM 和准确率等指标上的显著提升，并在不同后端模型中展示了有效的计算扩展和改进的校准。

RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

Authors: Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette

First: 2025-12-19T09:47:54+00:00 · Latest: 2026-03-30T07:58:13+00:00

Comments: Preprint, 33 pages, 15 figures, 11 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

中文标题/摘要

标题：RadImageNet-VQA：用于CT和MRI影像放射学视觉问答的大规模数据集

在本工作中，我们介绍了RadImageNet-VQA，这是一个旨在推进CT和MRI影像放射学视觉问答（VQA）的大规模数据集。现有的医学VQA数据集在规模上有限，主要以X光成像或生物医学插图为主，并且经常容易出现基于文本的捷径。RadImageNet-VQA是从专家标注中构建的，提供了75万张图像配对750万组问题-答案样本。它涵盖了三个关键任务——异常检测、解剖结构识别和病理识别，涉及八个解剖区域和97个病理类别，并支持开放式、封闭式和多项选择问题。广泛的实验表明，最先进的视觉-语言模型在细粒度病理识别方面仍然存在困难，特别是在开放式设置中，即使经过微调也是如此。仅基于文本的分析进一步表明，在没有图像输入的情况下，模型性能会崩溃到近乎随机，这证实了RadImageNet-VQA没有语言捷径。完整的数据集和基准可以在https://huggingface.co/datasets/raidium/RadImageNet-VQA上公开获取。

Summary / 总结

The research introduces RadImageNet-VQA, a large-scale dataset for radiologic VQA on CT and MRI exams, addressing limitations of existing datasets. It includes 750K images with 7.5M question-answer pairs, covering three tasks and 97 pathology categories. Experiments show that state-of-the-art models struggle with fine-grained pathology identification, especially in open-ended questions, and text-only analysis confirms the necessity of image inputs. The dataset is publicly available for research purposes.

该研究介绍了RadImageNet-VQA，这是一个用于CT和MRI影像的放射学视觉问答的大规模数据集，解决了现有医学VQA数据集的局限性。它包含75万张图像和750万对问题-答案，涵盖了三个关键任务和八个解剖区域。实验表明，最先进的模型在细粒度病理识别方面存在困难，尤其是在开放式设置中，纯文本分析进一步证实了图像输入的必要性。该数据集已公开供研究使用。

History

20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553