arXiv 论文速递

Snapshot: 20260419_0358

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Authors: Fangxu Yu, Ziyao Lu, Liqiang Niu, Fandong Meng, Jie Zhou

Venue: ACL 2026

First: 2026-01-10T13:05:23+00:00 · Latest: 2026-04-16T17:52:47+00:00

Comments: Accepted to Findings of ACL 2026

Abstract

Grounding events in videos serves as a fundamental capability in video analysis. While Vision Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.

Summary / 总结

The paper proposes ArrowGEV, a reinforcement learning framework that addresses the limitation of existing Vision Language Models (VLMs) in capturing the temporal directionality of events in videos. By categorizing events into time-sensitive and time-insensitive types, ArrowGEV encourages VLMs to distinguish between forward and backward videos for time-sensitive events and ensures consistent grounding for time-insensitive events. The experiments show that ArrowGEV improves event grounding precision, temporal directionality recognition, and general video understanding and reasoning ability.

论文提出了ArrowGEV，这是一种强化学习框架，旨在解决现有视觉语言模型(VLMs)在捕捉视频中事件的时间方向性方面的局限性。通过将事件分为时间敏感和时间不敏感两类，ArrowGEV促使VLMs区分正向和反向视频中的时间敏感事件，并确保时间不敏感事件的一致性定位。实验表明，ArrowGEV提高了事件定位精度、时间方向性识别以及一般视频理解与推理能力。

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Authors: Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara, Steven McDonagh

First: 2026-04-16T17:49:58+00:00 · Latest: 2026-04-16T17:49:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

中文标题/摘要

标题：为什么视觉语言模型在识别人类情绪方面挣扎？

理解情绪是智能系统与人类交互的基本能力。视觉-语言模型（VLMs）在过去的几年中在许多视觉任务上取得了巨大进展，有可能为理解情绪提供一个有前景的解决方案。然而，令人惊讶的是，即使是最先进的当代VLMs在识别人类情绪或超越专门的视觉分类器方面也表现不佳。在本文中，我们提出了一个问题：“为什么VLMs在识别人类情绪方面挣扎？”我们观察到，面部表情识别（DFER）这一本质上连续且动态的任务揭示了VLM的两个关键漏洞。首先，情绪数据集自然具有长尾分布，用于预训练VLM的网络规模数据加剧了这一头部类别偏差，导致它们系统地将稀有的、未充分代表的情绪类别合并到常见类别中。我们提出了替代采样策略，以防止偏好常见概念。其次，时间信息对于理解情绪至关重要。然而，VLMs无法表示密集帧序列中的时间信息，因为它们受限于上下文大小和可以容纳在内存中的令牌数量，这为情绪识别提出了明确的挑战。我们证明，VLMs中使用的稀疏时间采样策略与微表情（0.25-0.5秒）的瞬时性质不一致，微表情往往是最重要的情感信号。作为诊断探针，我们提出了一种多阶段上下文增强策略，通过首先将“中间”帧转换为自然语言摘要来利用这些信息。增强的文本上下文作为输入提供给VLM，同时与稀疏关键帧一起提供，防止注意力因过多的视觉数据而分散，同时保留情感轨迹。

Summary / 总结

This paper investigates why vision-language models struggle to recognize human emotions. It identifies two critical vulnerabilities: first, emotion datasets are long-tailed, leading VLMs to collapse rare emotions into common categories due to the head-class bias from web-scale data. Second, VLMs are limited in representing temporal information over dense frame sequences, which is crucial for understanding emotions. The authors propose a multi-stage context enrichment strategy to address these issues by converting 'in-between' frames into natural language summaries, which are then provided to the VLM alongside sparse keyframes, improving emotion recognition.

本文探讨了为什么视觉语言模型（VLMs）难以识别人类情绪，发现了两个关键问题。首先，情绪数据集是长尾分布的，VLMs在训练时倾向于将稀有情绪归类为常见类别。其次，VLMs受限于上下文大小和内存容量，难以表示时间信息，这对于理解情绪至关重要。作者提出了一种多阶段上下文增强策略，通过将“中间”帧转换为自然语言摘要，然后与关键帧一起提供给VLMs，以保留情绪轨迹而不稀释注意力。

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Authors: Xuanyi Liu, Deyi Ji, Chunan Yu, Qi Zhu, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu

First: 2026-04-16T17:12:10+00:00 · Latest: 2026-04-16T17:12:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

中文标题/摘要

标题：StreamCacheVGGT: 基于稳健评分和混合缓存压缩的流式视觉几何变换器

从连续视频流中重建密集的3D几何形状需要在恒定的内存预算下保持稳定的推理。现有的O(1)框架主要依赖于“纯淘汰”范式，这由于二元标记删除和局部单层评分的评估噪声而导致了显著的信息破坏。为了解决这些瓶颈，我们提出了一种无需训练的StreamCacheVGGT框架，通过两个协同模块重新构想缓存管理：跨层一致性增强评分（CLCES）和混合缓存压缩（HCC）。CLCES通过在Transformer层次结构中跟踪标记的重要性轨迹来减轻激活噪声，利用顺序统计分析来识别持续的几何显著性。利用这些稳健的评分，HCC超越了简单的淘汰策略，通过在键向量流形上进行最近邻分配，引入了三级分诊策略，将适度重要的标记合并为保留的锚点。这种方法保留了本会被丢弃的重要几何上下文。在七个基准测试（7-Scenes、NRGBD、ETH3D、Bonn和KITTI）上的广泛评估表明，StreamCacheVGGT设定了新的最先进的水平，提供了更高的重建准确性和长期稳定性，同时严格遵守恒定成本约束。

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Authors: Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

First: 2026-04-16T17:09:30+00:00 · Latest: 2026-04-16T17:09:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

Summary / 总结

RadAgent is an AI agent that generates stepwise and interpretable CT reports, providing clinicians with a traceable reasoning process. Compared to CT-Chat, RadAgent improves clinical accuracy by 6.0 points in macro-F1 and 5.4 points in micro-F1, enhances robustness by 24.7 points under adversarial conditions, and achieves 37.0% faithfulness, a new capability not present in CT-Chat. This tool-augmented approach brings transparency and reliability to AI in radiology.

RadAgent 是一种生成逐步和可解释 CT 报告的 AI 代理，为临床医生提供可追溯的推理过程。与 CT-Chat 相比，RadAgent 在宏观 F1 得分上提高了 6.0 分，在微观 F1 得分上提高了 5.4 分，增强了在对抗条件下的鲁棒性 24.7 分，并实现了 37.0% 的忠诚度，这是 CT-Chat 完全不具备的新能力。这种工具增强的方法为放射学中的透明性和可靠性带来了进步。

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Authors: Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao

Venue: ACL 2026

First: 2026-01-12T15:47:35+00:00 · Latest: 2026-04-16T16:46:40+00:00

Comments: ACL 2026 Findings. Source code available at https://github.com/TANIGUCHIREI/ASL

Abs · PDF · Code1 · Code2 · Code3

Abstract

Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that ASL, equipped with one-shot token selection, adaptively trades inference speed for accuracy, outperforming state-of-the-art layer-wise token pruning methods in difficult tasks.

Summary / 总结

This paper addresses the issue of key-value (KV) cache reduction in large language model inference, focusing on layer-wise token pruning. The authors propose ASL, a training-free method that adaptively selects the layer for token selection based on the variance of token ranks ordered by attention score. ASL improves performance across different tasks while meeting the user-specified KV budget requirement and can be used with existing KV cache reduction methods. Experimental results on InfiniteBench, RULER, and NIAH benchmarks demonstrate that ASL outperforms state-of-the-art layer-wise token pruning methods in difficult tasks by adaptively trading inference speed for accuracy.

本文针对大型语言模型（LLM）推理中的关键值（KV）缓存减少问题，关注层间令牌剪枝。所提出的ASL方法根据注意力分数排序下的令牌排名的方差自适应选择剪枝层，从而在各种任务中提高性能并满足用户指定的KV预算要求。在基准测试上的实验结果表明，ASL通过自适应地在推理速度和准确性之间进行权衡，在困难任务中优于现有方法。

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

Authors: Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng, Jiaxin Ding, Luoyi Fu, Xinbing Wang

First: 2026-04-16T16:21:05+00:00 · Latest: 2026-04-16T16:21:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.

Summary / 总结

The research aims to optimize visual token pruning configurations in vision-language models to improve computational efficiency without compromising performance. The method formulates the pruning problem as a Pareto optimization task and uses continuous relaxation and gradient-based search to find optimal configurations. Experiments across eight visual benchmarks show that the approach effectively approximates the Pareto frontier and generalizes well across different pruning methods and model architectures. Additionally, the study reveals that multi-step progressive pruning outperforms single-layer pruning in terms of accuracy-efficiency trade-offs.

研究旨在通过优化视觉标记剪枝配置来提高视觉语言模型的计算效率，同时不牺牲性能。方法将剪枝问题表述为帕累托优化任务，并使用连续松弛和梯度搜索来找到最优配置。实验结果显示，该方法有效地逼近了通过网格搜索获得的经验帕累托前沿，并在不同的剪枝方法和模型架构上表现出良好的泛化能力。此外，研究还发现多步渐进剪枝在准确性和效率权衡上优于单层剪枝。

Agent-Aided Design for Dynamic CAD Models

Authors: Mitch Adler, Matthew Russo, Michael Cafarella

First: 2026-04-16T16:15:23+00:00 · Latest: 2026-04-16T16:15:23+00:00

Comments: 6 pages, 3 figures, to be published in CAIS'26

Abs · PDF · Code1 · Code2

Abstract

In the past year, researchers have started to create agentic systems that can design real-world CAD-style objects in a training-free setting, a new variety of system that we call Agent-Aided Design. Generally speaking, these systems place an agent in a feedback loop in which it can write code, compile that code to an assembly of CAD model(s), visualize the model, and then iteratively refine its code based on visual and other feedback. Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors. In order for Agent-Aided Design to make a real impact in industrial manufacturing, we need a system that is capable of generating such 3D assemblies. In this paper we present a prototype of AADvark, an agentic system designed for this task. Unlike previous state-of-the-art systems, AADvark captures the dynamic part interactions with one or more degrees-of-freedom. This design decision allows AADvark to reason directly about assemblies with moving parts and can thereby achieve cross-cutting goals, including but not limited to mechanical movements. Unfortunately, current LLMs are imperfect spatial reasoners, a problem that AADvark addresses by incorporating external constraint solver tools with a specialized visual feedback mechanism. We demonstrate that, by modifying the agent's tools (FreeCAD and the assembly solver), we are able to create a strong verification signal which enables our system to build 3D assemblies with movable parts.

中文标题/摘要

标题：基于代理的动态CAD模型辅助设计

在过去一年中，研究人员开始创建无需训练即可设计现实世界CAD风格对象的代理系统，我们称之为基于代理的辅助设计。这些系统通常将代理置于一个反馈循环中，它可以编写代码，将代码编译为一个或多个CAD模型的组合，可视化模型，然后根据视觉和其他反馈迭代优化其代码。尽管取得了快速进展，但一个关键问题仍然存在：这些系统都无法构建具有移动部件的复杂3D装配体。例如，目前没有任何系统能够构建活塞、摆锤或甚至是一把剪刀。为了使基于代理的辅助设计在工业制造中产生实际影响，我们需要一个能够生成此类3D装配体的系统。在本文中，我们介绍了AADvark的原型，这是一种专为此任务设计的代理系统。与之前的最先进的系统不同，AADvark捕捉了一个或多个自由度的动态部件交互。这一设计决策使AADvark能够直接推理具有移动部件的装配体，并能够实现跨切面目标，包括但不限于机械运动。不幸的是，当前的LLM是不完美的空间推理者，AADvark通过结合外部约束求解工具和专门的视觉反馈机制解决了这一问题。我们通过修改代理的工具（FreeCAD和装配求解器），能够创建一个强大的验证信号，使我们的系统能够构建具有可移动部件的3D装配体。

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Authors: Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang

Venue: NeurIPS 2025

First: 2026-04-15T17:59:52+00:00 · Latest: 2026-04-16T15:48:38+00:00

Comments: Appear in the proceedings of NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards one token per frame at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into learnable and progressive modules for token-level compression (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate frame-level compression, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named question-conditioned compression (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, i.e., the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined token-level and frame-level leads to an extreme compression model for long video understanding, named XComp, achieving a significantly larger compression ratio and enabling denser frame sampling. Our XComp is finetuned from VideoChat-Flash with a data-efficient supervised compression tuning stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.

中文标题/摘要

标题：每个高度选择性帧一个令牌：向长视频理解的极端压缩迈进

长视频理解对视觉-语言模型（VLMs）来说固然是具有挑战性的，因为帧的数量非常庞大。由于每个视频帧通常会扩展成数十或数百个令牌，大型语言模型（LLMs）有限的上下文长度迫使VLMs稀疏地感知帧，从而丢失时间信息。为了解决这个问题，我们探索了在最终LLM层进行极端视频令牌压缩，目标是每个帧一个令牌。我们的关键洞察是，先前方法广泛采用的基于启发式的压缩容易导致信息丢失，因此需要监督LLM层进入可学习和渐进的模块进行令牌级压缩（LP-Comp）。这种压缩使我们的VLM能够消化2-4倍更多的帧，同时提高性能。为了进一步提高令牌效率，我们研究了帧级压缩，通过LLM层的内部注意力分数选择与查询最相关的帧，称为问题条件压缩（QC-Comp）。与先前研究的一个显著区别是，我们通过将长视频分割成短片段并使用局部注意力来缓解LLM注意力在长上下文中的位置偏差，即序列的过度集中在开头和结尾。综合而言，我们的结合了令牌级和帧级压缩的方法为长视频理解提供了一个极端压缩模型，称为XComp，实现了显著更大的压缩比，并允许更密集的帧采样。我们的XComp是从VideoChat-Flash微调而来的，仅需2.5%的监督微调数据，就能在LVBench上将准确率从42.9%提升到46.2%，并增强多个其他长视频基准。

IROSA: Interactive Robot Skill Adaptation using Natural Language

Authors: Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp, João Silvério

Venue: IEEE Robotics and Automation Letters (RA-L), 2026

First: 2026-03-04T09:54:09+00:00 · Latest: 2026-04-16T15:37:03+00:00

Comments: Accepted IEEE Robotics and Automation Letters (RA-L) journal, 8 pages, 5 figures, 3 tables, 1 listing. Code available: https://github.com/DLR-RM/IROSA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.

中文标题/摘要

标题：IROSA：使用自然语言的交互式机器人技能适应

基础模型在多个领域展示了令人印象深刻的性能，而模仿学习为从有限数据中通过原理方法适应机器人技能提供了方法。将这两种方法结合起来在直接应用于机器人技术方面具有巨大潜力，但这种结合在工业部署方面受到的关注有限。我们提出了一种新的框架，通过基于工具的架构实现开放词汇量的技能适应，保持语言模型与机器人硬件之间的保护抽象层。我们的方法利用预训练的大规模语言模型来选择和参数化特定工具，以适应机器人技能，而无需微调或直接模型到机器人交互。我们在一个7自由度扭矩控制机器人上演示了该框架，该机器人执行工业轴承环插入任务，通过自然语言命令成功实现了技能适应，包括速度调整、轨迹校正和障碍物避免，同时保持了安全、透明和可解释性。

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

Authors: Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, Anh Tuan Luu, Jianbing Zhang, Lewei Lu, Dahua Lin

First: 2026-04-16T14:53:08+00:00 · Latest: 2026-04-16T14:53:08+00:00

Comments: Work in progress

Abs · PDF · Code1 · Code2 · Project1

Abstract

Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.

中文标题/摘要

标题：OpenMobile：使用任务和轨迹合成构建开放移动代理

由视觉-语言模型驱动的移动代理展示了在自动化移动任务方面令人印象深刻的性能，最近的领先模型在AndroidWorld上的成功率接近70%。然而，这些系统保留了其训练数据的封闭性，并且对其任务和轨迹合成的方法保持不透明。我们提出了OpenMobile，这是一个开源框架，用于合成高质量的任务指令和代理轨迹，包含两个关键组件：(1) 首先是可扩展的任务合成流水线，从探索中构建全局环境记忆，然后利用它生成多样且具体的指令；(2) 轨迹展开策略中的策略切换。通过在学习者模型和专家模型之间交替，它捕获了标准模仿学习中经常缺失的重要错误恢复数据。在我们数据集上训练的代理在三个动态移动代理基准测试中取得了竞争力的结果：特别是，我们微调的Qwen2.5-VL和Qwen3-VL在AndroidWorld上分别达到了51.7%和64.7%，远超现有开放数据方法。此外，我们对合成指令与基准测试集的重叠进行了透明分析，并验证了性能提升来自于广泛的功能覆盖而非基准过拟合。我们将在https://njucckevin.github.io/openmobile/发布数据和代码，以弥合数据缺口并促进更广泛的移动代理研究。

Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

Authors: Jiaxuan Li, Xin Wen, Zhihang Li

First: 2026-04-16T14:49:30+00:00 · Latest: 2026-04-16T14:49:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.

中文标题/摘要

标题：超越视觉线索：基于语义的标记过滤和专家路由以实现任意时间行人重识别

任意时间行人重识别（AT-ReID）需要在任意条件下稳健地检索目标个体，包括模态转换（白天和夜晚）和广泛的着装变化场景，从短期到长期不等。然而，现有方法高度依赖纯视觉特征，这些特征容易因环境和时间因素而变化，导致在涉及照明引起的模态转换或着装变化的场景中性能显著下降。在本文中，我们提出了一种新颖的框架——基于语义的标记过滤和专家路由（STFER），该框架利用大型视觉-语言模型（LVLM）生成身份一致性文本的能力，提供对着装变化和RGB与IR之间跨模态转换具有鲁棒性的身份区分特征。具体而言，我们使用指令引导LVLM生成包含生物特征常数的身份内在语义文本，以驱动语义模型。文本标记进一步用于基于语义的视觉标记过滤（SVTF），以增强信息性视觉区域并抑制冗余背景噪声。同时，文本标记也用于基于语义的专家路由（SER），将语义文本整合到专家路由中，从而实现更鲁棒的多场景门控。在Any-Time ReID数据集（AT-USTC）上的广泛实验表明，我们的模型达到了最先进的结果。此外，该模型在AT-USTC上训练，并在5个广泛使用的行人重识别基准上进行了评估，展示了出色的泛化能力，取得了极具竞争力的结果。我们的代码将很快开源。

Summary / 总结

The research addresses the challenge of robust person re-identification (ReID) under varying conditions, proposing STFER, a framework that uses Large Vision-Language Models to generate identity-consistent text. This text is used for semantic-driven visual token filtering and expert routing, enhancing the model's robustness to clothing changes and cross-modality shifts. Experiments show that STFER outperforms existing methods on the AT-USTC dataset and demonstrates strong generalization across multiple ReID benchmarks.

研究针对在不同条件下鲁棒的人再识别（ReID）挑战，提出了一种STFER框架，利用大型视觉-语言模型生成身份一致的文本。该文本用于语义驱动的视觉令牌过滤和专家路由，增强模型对服装变化和跨模态转换的鲁棒性。实验表明，STFER在AT-USTC数据集上优于现有方法，并在多个ReID基准测试中展示了强大的泛化能力。

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath

First: 2025-11-27T15:00:58+00:00 · Latest: 2026-04-16T14:40:49+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Dataset and implementation: https://github.com/ahmad-shirazi/DocVAL

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Authors: Jun Wang, Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao, Ziyong Feng, Kaicheng Yang, Cewu Lu

First: 2026-04-16T13:03:32+00:00 · Latest: 2026-04-16T13:03:32+00:00

Comments: 17 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

中文标题/摘要

标题：UniDoc-RL：从粗到细的视觉RAG，具有层次化动作和密集奖励

检索增强生成（RAG）通过外部视觉知识扩展了大型视觉-语言模型（LVLMs）。然而，现有的视觉RAG系统通常依赖于通用的检索信号，忽视了复杂推理中至关重要的细粒度视觉语义。为了解决这一局限性，我们提出了一种统一的强化学习框架UniDoc-RL，其中LVLM代理联合执行检索、重排序、主动视觉感知和推理。UniDoc-RL将视觉信息获取建模为具有层次化动作空间的顺序决策问题。具体而言，它逐步从粗粒度的文档检索细化到细粒度的图像选择和主动区域裁剪，使模型能够抑制无关内容并关注信息密集区域。为了实现有效的端到端训练，我们引入了一种密集的多奖励方案，为每个动作提供任务感知的监督。基于组相对策略优化（GRPO），UniDoc-RL使代理行为与多个目标对齐，而无需依赖单独的价值网络。为了支持这种训练范式，我们收集了一个包含细粒度动作注释的高质量推理轨迹综合数据集。在三个基准上的实验表明，UniDoc-RL始终超越了最先进的基线方法，相对于先前的基于RL的方法，可获得高达17.7%的提升。

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

Authors: Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, Zhou Zhao

First: 2026-04-16T12:03:50+00:00 · Latest: 2026-04-16T12:03:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.

Summary / 总结

The research aims to improve full-duplex spoken dialogue models by addressing the challenge of achieving human-like interaction. It proposes a Dual-Axis Generative Reward Model trained to evaluate both semantic quality and interaction timing, providing a reliable reward signal for reinforcement learning. The model outperforms existing methods in assessing interaction quality across various datasets.

研究旨在通过解决实现人性化交互的挑战来提升全双工语音对话模型。提出了一个双轴生成奖励模型，用于评估语义质量和交互时机，提供可靠的强化学习奖励信号。该模型在各种数据集上的交互质量评估中表现出色，超越了现有方法。

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

Authors: Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen, Min Sun, Winston Hsu

First: 2026-04-16T11:46:30+00:00 · Latest: 2026-04-16T11:46:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

Summary / 总结

The research aims to improve the adaptability of intelligent embodied agents in dynamic environments by addressing the limitation of existing methods that do not consider object affordances. The study introduces DynAfford, a benchmark for evaluating agents in dynamic settings where object affordances can change. ADAPT, a module that enhances existing planners with explicit affordance reasoning, is proposed. Experiments show that ADAPT improves robustness and task success in both seen and unseen environments. Additionally, a domain-adapted, LoRA-finetuned vision-language model outperforms GPT-4o in affordance inference, emphasizing the importance of task-aligned grounding.

研究旨在通过解决现有方法不考虑物体功能的问题，提高智能实体代理在动态环境中的适应性。研究引入了DynAfford，一个评估代理在动态环境中表现的基准，其中物体功能可能随时间变化。ADAPT模块增强了现有规划器的功能，使其能够进行显式的功能推理。实验表明，ADAPT在已见和未见环境中均提高了鲁棒性和任务成功率。此外，一个领域适应的、LoRA微调的视觉语言模型在功能推理方面优于GPT-4o，突显了任务对齐的功能接地的重要性。

Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

Authors: Danae Sánchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott

First: 2026-04-16T11:28:53+00:00 · Latest: 2026-04-16T11:28:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.

RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

Authors: Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, Hai Zhao

Venue: ACL 2026

First: 2026-04-16T11:23:55+00:00 · Latest: 2026-04-16T11:23:55+00:00

Comments: Accepted to Findings of ACL 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$.

中文标题/摘要

标题：RACER：检索增强的上下文快速推测解码

大型语言模型（LLMs）中的自回归解码每次生成一个标记，导致高推理延迟。推测解码（SD）通过猜测和验证策略来缓解这一问题，但现有的无训练版本存在权衡：基于检索的草稿在没有完全匹配时会失效，而基于logits的草稿缺乏结构指导。我们提出了一种轻量级且无训练的方法——RACER（Retrieval-Augmented Contextual Rapid Speculative Decoding），该方法将检索到的精确模式与logits驱动的未来线索结合起来。这种结合提供了可靠的锚点和灵活的外推，生成更丰富的推测草稿。在Spec-Bench、HumanEval和MGSM-ZH上的实验表明，RACER能够一致地加速推理，比自回归解码快2倍以上，并优于先前的无训练方法，提供了一种可扩展且即插即用的高效LLM解码解决方案。我们的源代码可在https://github.com/hkr04/RACER 获取。

Summary / 总结

RACER is a lightweight and training-free method that combines retrieved exact patterns with logit-driven future cues to improve speculative decoding in Large Language Models (LLMs). This approach accelerates inference by more than 2 times compared to autoregressive decoding and outperforms previous training-free methods on Spec-Bench, HumanEval, and MGSM-ZH benchmarks, providing a scalable solution for efficient LLM decoding.

RACER 是一种轻量级且无需训练的方法，结合检索到的精确模式和基于 logits 的未来线索来改进大型语言模型（LLMs）的推测性解码。该方法将推理速度比自回归解码加快超过 2 倍，并在 Spec-Bench、HumanEval 和 MGSM-ZH 基准测试中优于之前的无需训练的方法，提供了一种可扩展的高效 LLM 解码解决方案。

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Authors: Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu, Yue Han, James Kit Hon Tsoi, Gui-Song Xia, Cui Huang

Venue: Journal of Dental Research, p.00220345261424242 (2026)

First: 2026-04-16T10:56:54+00:00 · Latest: 2026-04-16T10:56:54+00:00

Comments: Project website: https://menxli.github.io/metadent

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

Summary / 总结

The research aims to address the lack of fine-grained, annotated datasets for vision-language models in dentistry. MetaDent is introduced, which includes a large-scale dentistry image dataset, a semi-structured annotation framework, and comprehensive benchmark suites. The dataset comprises 60,669 dental images, with 2,588 images annotated using a meta-labeling scheme combining high-level summaries and detailed descriptions. The benchmarks include 15K Visual Question Answering pairs and an 18-class multi-label classification dataset. Experimental results show that state-of-the-art models perform moderately in fine-grained understanding and produce inconsistent captions. The dataset and tools are publicly released to promote reproducible research and advance dental applications of vision-language systems.

研究旨在解决牙科领域中缺乏细粒度标注数据集的问题。MetaDent 包含一个大规模的牙科图像数据集、一个半结构化的标注框架和全面的基准测试套件。该数据集包含 60,669 张牙科图像，其中 2,588 张图像使用结合高级摘要和详细描述的元标注方案进行标注。基准测试包括 15K 视觉问答对和一个包含 18 个类别的多标签分类数据集。实验结果显示，最先进的模型在细粒度理解方面表现一般，并且在图像描述中产生不一致的结果。该数据集和工具已公开发布，以促进可重复研究并推动牙科应用中的视觉语言系统的开发。

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Authors: Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

First: 2026-04-13T14:03:18+00:00 · Latest: 2026-04-16T10:51:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.

Summary / 总结

The study revisits the compositionality issue in dual-encoder VLMs like CLIP, suggesting that the poor performance on compositional benchmarks may not be due to inadequate representations but rather the standard inference protocol. By enforcing fine-grained region-segment alignment during inference and introducing a lightweight transformer to learn such alignments from frozen embeddings, the research shows that these methods improve in-domain retrieval but do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while significantly enhancing compositional benchmarks, indicating that global embedding matching is a critical bottleneck in dual-encoder VLMs and emphasizing the need for alignment mechanisms for robust compositional generalization.

研究重新审视了像CLIP这样的双编码器视觉-语言模型在组成性基准上的表现问题，认为其表现不佳主要是由于推理方法而非表示能力。通过在推理过程中强制执行细粒度的区域-片段对齐，并引入一种轻量级的变压器直接从冻结的片段和标记嵌入中学习这种对齐，研究提高了组成性表现且无需更新预训练编码器。研究发现，虽然全面微调和端到端组成性训练方法在领域内检索上有所提升，但它们在领域外任务上的改进并不一致。相比之下，学习冻结表示上的局部对齐在领域内检索上达到与全面微调相当的表现，并在控制的领域外组成性基准上取得了显著改进，突显了对齐机制对于稳健组成性泛化的重要性。

Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems

Authors: Haileab Yagersew

First: 2026-04-16T10:32:20+00:00 · Latest: 2026-04-16T10:32:20+00:00

Comments: 16 pages, 3 figures, Code to be released at https://github.com/xHaileab/Paza-AI

Abs · PDF · Code1 · Code2 · Code3

Abstract

Retail theft costs the global economy over \$100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \$200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \$50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza-AI.

Summary / 总结

The paper presents Paza, a zero-shot retail theft detection framework that uses a layered pipeline of existing models to achieve practical concealment detection without custom model training. The framework reduces the need for expensive VLM invocations by 240x through a multi-signal suspicion pre-filter, enabling a single GPU to serve 10-20 stores. The VLM component, which accepts any OpenAI-compatible endpoint, achieves 89.5% precision and 92.8% specificity at 59.3% recall, with a cost model showing a viability of $50-100/month per store, significantly cheaper than commercial alternatives.

论文介绍了Paza，这是一种零样本零售盗窃检测框架，使用现有模型的分层管道来实现实际的藏匿检测，无需进行定制模型训练。通过多信号疑虑预过滤器减少对昂贵VLM调用240倍，使单个GPU能够服务10-20家商店。VLM组件接受任何OpenAI兼容端点，实现89.5%的精确度和92.8%的特异性，召回率为59.3%，成本模型显示每家商店的可行性为$50-100/月，远低于商业替代方案。

POP: Prefill-Only Pruning for Efficient Large Model Inference

Authors: Junhui He, Zhihui Fu, Jun Wang, Qingan Li

First: 2026-02-03T09:22:26+00:00 · Latest: 2026-04-16T10:22:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.

Summary / 总结

This paper addresses the computational challenges of deploying large language models and vision-language models by proposing Prefill-Only Pruning (POP), a stage-aware inference strategy. POP identifies the critical role of deep layers in the decode stage and their redundancy in the prefill stage, leading to a significant speedup of up to 1.37 times in prefill latency with minimal performance loss. This method overcomes the accuracy-efficiency trade-off limitations of existing structured pruning techniques.

本文提出了一种阶段感知的推理策略——Prefill-Only Pruning (POP)，以解决大规模语言模型和视觉-语言模型的计算挑战。POP 识别了深层层在解码阶段的关键作用及其在编码阶段的冗余性，实现了最多 1.37 倍的预填充延迟加速，同时保持了最小的性能损失，从而克服了现有结构化剪枝技术的准确性和效率之间的权衡限制。

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

Authors: Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg

First: 2025-05-26T15:23:18+00:00 · Latest: 2026-04-16T10:15:26+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes. We assess the performance of various vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. We find that these VLMs exhibit weak ME bias, while showing some ability to leverage extra spatial context to resolve ambiguity in multiple novel object settings. Project page: http://mebench.github.io/.

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

Authors: Nishanth Madhusudhan, Vikas Yadav, Alexandre Lacoste

First: 2026-04-16T09:23:22+00:00 · Latest: 2026-04-16T09:23:22+00:00

Comments: 10 pages and 4 figures (excluding appendix)

Abs · PDF · Code1 · Code2

Abstract

Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

中文标题/摘要

标题：何时不应作答：评估多模态推理系统的弃答

有效的弃答（EA），识别证据不足并避免作答，对于可靠的多模态系统至关重要。然而，现有的视觉-语言模型（VLMs）和多智能体系统（MAS）的评估范式假设可作答性，促使模型总是回应。弃答在纯文本环境中已有研究，但在多模态环境中仍被忽视；当前的基准要么忽略不可回答性，要么依赖粗略的方法，无法捕捉到现实中的失败模式。我们引入了MM-AQA，这是一种基准，通过沿两个轴进行转换从可回答实例构建不可回答实例：视觉模态依赖性和证据充足性。评估三个前沿的VLMs，包括闭源和开源模型，以及两个MAS架构的2079个样本，我们发现：（1）在标准提示下，VLMs很少弃答；即使简单的置信度基线也优于此设置，（2）MAS提高了弃答能力但引入了准确性和弃答之间的权衡，（3）序列设计匹配或超过了迭代变体，表明瓶颈在于校准不当而非推理深度，（4）当图像或文本证据缺失时，模型会弃答，但在降级或矛盾证据下尝试调和。有效的多模态弃答需要弃答意识的训练，而不是更好的提示或更多的智能体。

Summary / 总结

The paper addresses the critical need for effective abstention in multimodal systems, where models recognize insufficient evidence and refrain from answering. It introduces MM-AQA, a benchmark that transforms answerable instances into unanswerable ones by varying visual modality dependency and evidence sufficiency. Evaluations on three VLMs and two MAS architectures show that VLMs rarely abstain under standard prompting, even simple confidence baselines outperform this setup, and MAS improves but introduces an accuracy-abstention trade-off. The study suggests that the bottleneck is miscalibration rather than reasoning depth, and models abstain when evidence is absent but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

论文探讨了多模态系统中模型识别证据不足并避免回答的重要性。引入了MM-AQA基准，通过改变视觉模态依赖性和证据充足性将可回答实例转化为不可回答实例。对三个VLM和两个MAS架构的评估显示，标准提示下VLMs很少避免回答，即使简单的置信度基线也优于此设置，MAS提高了避免回答的能力但引入了准确性和避免回答之间的权衡。研究指出瓶颈在于校准不足而非推理深度，模型在证据缺失时避免回答，但在退化或矛盾证据下尝试调和。有效的多模态避免回答需要避免回答的训练而非更好的提示或更多代理。

AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

Authors: Peifeng Zhang, Zice Qiu, Donghua Yu, Shilei Cao, Juepeng Zheng, Yutong Lu, Haohuan Fu

Venue: ACM MM 2026

First: 2026-04-16T08:39:02+00:00 · Latest: 2026-04-16T08:39:02+00:00

Comments: 18 pages, 9 figures. Submitted to ACM MM 2026

Abs · PDF · Code1 · Code2

Abstract

In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

Summary / 总结

The research aims to address the issue of catastrophic forgetting in continual visual question answering (VQA) by focusing on the asymmetric architecture of modern Vision-Language Models (VLMs). The proposed method, Asymmetric Information Masking (AIM), introduces targeted masks based on modality-specific sensitivity to balance stability and plasticity. Experiments demonstrate that AIM outperforms existing methods in both Average Performance (AP) and Average Forgetting (AF), while maintaining generalization to new skill-concept compositions.

研究旨在通过关注现代视觉-语言模型（VLMs）的不对称结构来解决持续视觉问答（VQA）中的灾难性遗忘问题。提出的不对称信息掩蔽（AIM）方法基于模态特异性敏感性引入了目标掩蔽，以平衡稳定性和可塑性。实验表明，AIM在平均性能（AP）和平均遗忘（AF）方面均优于现有方法，同时保持对新技能概念组合的一般化。

Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

Authors: Zijian Zhao, Dian Jin, Zijing Zhou

First: 2025-09-26T14:07:29+00:00 · Latest: 2026-04-16T08:15:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recently, Image-to-Music (I2M) generation has garnered significant attention, with potential applications in fields such as gaming, advertising, and multi-modal art creation. However, due to the ambiguous and subjective nature of I2M tasks, most end-to-end methods lack interpretability, leaving users puzzled about the generation results. Even methods based on emotion mapping face controversy, as emotion represents only a singular aspect of art. Additionally, most learning-based methods require substantial computational resources and large datasets for training, hindering accessibility for common users. To address these challenges, we propose the first Vision Language Model (VLM)-based I2M framework that offers high interpretability and low computational cost. Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language. We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training. Furthermore, we leverage the generated motivations in text and the attention maps from the VLM to provide explanations for the generated results in both text and image modalities. To validate our method, we conduct both human studies and machine evaluations, where our method outperforms others in terms of music quality and music-image consistency, indicating promising results. Our code is available at https://github.com/RS2002/Image2Music .

中文标题/摘要

标题：零努力图像到音乐生成：一种可解释的RAG基视觉语言模型方法

近年来，图像到音乐（I2M）生成引起了广泛关注，其潜在应用领域包括游戏、广告和多模态艺术创作。然而，由于I2M任务的模糊性和主观性，大多数端到端方法缺乏可解释性，使用户对生成结果感到困惑。即使基于情绪映射的方法也存在争议，因为情绪仅代表艺术的一个方面。此外，大多数基于学习的方法需要大量的计算资源和大规模数据集进行训练，这阻碍了普通用户的使用。为了解决这些挑战，我们提出了第一个基于视觉语言模型（VLM）的I2M框架，该框架具有高可解释性和低计算成本。具体而言，我们利用ABC符号来连接文本和音乐模态，使VLM能够使用自然语言生成音乐。然后，我们应用多模态检索增强生成（RAG）和自我精炼技术，使VLM能够在无需外部训练的情况下生成高质量的音乐。此外，我们利用生成的动机和VLM的注意力图来在文本和图像模态中为生成结果提供解释。为了验证我们的方法，我们进行了人类研究和机器评估，结果显示我们的方法在音乐质量和音乐-图像一致性方面优于其他方法，显示出有希望的结果。我们的代码可在https://github.com/RS2002/Image2Music 获取。

Summary / 总结

The paper addresses the challenges of generating music from images by proposing an interpretable Vision Language Model (VLM) approach. It uses ABC notation to bridge text and music modalities and applies multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to produce high-quality music without external training. The method provides explanations through generated motivations and attention maps, and it outperforms other methods in terms of music quality and consistency with images in human and machine evaluations.

论文提出了一种可解释的Vision Language Model (VLM) 方法来解决从图像生成音乐的挑战。该方法使用ABC符号来连接文本和音乐模态，并应用多模态的Retrieval-Augmented Generation (RAG) 和自我精炼技术来生成高质量的音乐，无需外部训练。该方法通过生成的动机和VLM的注意力图来提供解释，并在人类和机器评估中在音乐质量和与图像的一致性方面优于其他方法。

SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval

Authors: Xin Xie, Dongyun Xue, Wuguannan Yao, Mingxiao Feng, Wengang Zhou, Xiang Qi, Houqiang Li, Peng Zhang

First: 2026-04-16T07:22:36+00:00 · Latest: 2026-04-16T07:22:36+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.

中文标题/摘要

标题：SGA-MCTS：通过训练-free 原子经验检索解耦规划与执行

LLM驱动的系统需要复杂的多步决策能力来解决实际任务，但当前的规划方法在推理时搜索的高延迟和监督微调的有限泛化之间存在权衡。为了解决这一限制，我们引入了**SGA-MCTS**框架，将LLM规划视为非参数检索。离线时，我们利用蒙特卡洛树搜索（MCTS）探索解空间，并提炼高保真轨迹为状态-目标-动作（SGA）原子。这些原子是去词汇化的原始元素，将具体的实体抽象为符号槽，保留可重用的因果逻辑，同时丢弃领域特定的噪声。在线时，检索增强的代理使用混合符号-语义机制检索相关SGA，并将其重新定位到当前上下文作为软推理提示。在复杂基准上的实验证明，这种范式使冻结、开放权重模型能够不进行任务特定微调就达到SOTA系统的性能（例如，GPT-5）。通过有效分摊搜索的高昂计算成本，SGA-MCTS实现了系统2级的推理深度和系统1级的推理速度，使自主规划既可扩展又实时可行。

Summary / 总结

SGA-MCTS is designed to enhance the decision-making capabilities of LLM-powered systems by decoupling planning from execution. It uses offline Monte Carlo Tree Search (MCTS) to explore the solution space and extract high-fidelity trajectories as State-Goal-Action (SGA) atoms, which are then used by an online retrieval-augmented agent to provide reasoning hints. The results show that SGA-MCTS can match the performance of state-of-the-art systems like GPT-5 without requiring task-specific fine-tuning, achieving efficient and scalable autonomous planning.

SGA-MCTS 通过将规划与执行解耦来提升LLM系统的决策能力。它利用离线的蒙特卡洛树搜索（MCTS）探索解空间并提取高保真轨迹作为状态-目标-动作（SGA）原子，然后由在线检索增强的代理使用混合符号-语义机制获取相关SGA并重新定位到当前上下文作为推理提示。实验结果表明，SGA-MCTS 可以在无需特定任务微调的情况下匹配GPT-5等最先进的系统的性能，实现高效且可扩展的自主规划。

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

Authors: Jiyoung Lim, Heejae Yang, Jee-Hyong Lee

Venue: CVPR 2026

First: 2026-04-16T07:21:21+00:00 · Latest: 2026-04-16T07:21:21+00:00

Comments: CVPR 2026 Accepted

Abs · PDF · Code1 · Code2 · Code3

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.

中文标题/摘要

标题：G-MIXER：基于测地线Mixup的隐式语义扩展和显式语义重排序在零样本组合图像检索中的应用

组合图像检索（CIR）旨在通过将参考图像与相应的修改文本结合起来检索目标图像。CIR 需要同时考虑查询中明确指定的语义和其双模态组合中嵌入的隐式语义。最近的无训练零样本CIR（ZS-CIR）方法利用多模态大型语言模型（MLLMs）生成详细的目标描述，将隐式信息转换为显式的文本表达。然而，这些方法严重依赖于文本模态，无法捕捉到需要考虑候选者多种组合的模糊检索特性，这导致检索结果的多样性和准确性降低。为了解决这一局限性，我们提出了一种新的无训练方法，基于测地线Mixup的隐式语义扩展和显式语义重排序在零样本组合图像检索中的应用（G-MIXER）。G-MIXER 通过在一系列混合比例上应用测地线Mixup构建反映参考图像-文本对隐式语义的组合查询特征，并构建一个多样化的候选集。生成的候选集然后使用从MLLMs中提取的显式语义进行重排序，从而提高检索的多样性和准确性。我们提出的G-MIXER在多个零样本组合图像检索基准测试中达到了最先进的性能，有效地处理了隐式和显式语义，无需额外训练。我们的代码将在https://github.com/maya0395/gmixer/上提供。

SAM3-I: Segment Anything with Instructions

Authors: Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Wei Ji, Qi Bi, Yongri Piao, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Huchuan Lu, Li Cheng

First: 2025-12-04T09:00:25+00:00 · Latest: 2026-04-16T07:12:40+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3's vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Authors: Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan

First: 2025-05-23T17:41:14+00:00 · Latest: 2026-04-16T06:56:57+00:00

Comments: Technical Report

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL) is becoming an important direction for post-training vision-language models (VLMs), but public training methodologies for unified multimodal RL remain much less mature, especially for heterogeneous reasoning and perception-heavy tasks. We propose V-Triune, a Visual Triple Unified Reinforcement Learning methodology for unified multimodal RL. It organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. Within this methodology, Dynamic IoU provides localization-specific reward shaping that avoids reward ambiguity under loose thresholds and reward sparsity under strict ones. Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI/One-RL-to-See-Them-All.

中文标题/摘要

标题：One RL 见所有：视觉三元统一强化学习

强化学习（RL）已成为后训练视觉-语言模型（VLMs）的重要方向，但统一多模态RL的公共训练方法仍相对不成熟，尤其是在异构推理和感知密集型任务方面。我们提出了V-Triune，一种用于统一多模态RL的视觉三元统一强化学习方法。该方法围绕三个协调的抽象组织训练：样本级奖励路由、验证器级结果验证和来源级诊断。在此方法中，动态IoU提供了针对局部化的奖励塑造，避免了宽松阈值下的奖励模糊和严格阈值下的奖励稀疏。基于V-Triune，我们开发了Orsta（7B, 32B）模型，该模型在八个推理和感知任务上联合训练。在匹配的预算下，统一训练匹配或超越了专家混合模型。最终的Orsta模型在MEGA-Bench上优于其基础模型，与强大的多任务RL-VLM基线相比表现良好，并将这些收益转移到一系列下游基准上。这些结果表明，统一RL可以在单一VLM RL管道中同时提高推理和感知。V-Triune系统及其Orsta模型已公开发布在https://github.com/MiniMax-AI/One-RL-to-See-Them-All/。

Summary / 总结

The research aims to develop a unified multimodal reinforcement learning (RL) methodology for vision-language models (VLMs) to address the lack of mature training methods for heterogeneous reasoning and perception tasks. V-Triune, the proposed methodology, consists of three abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. The Dynamic IoU technique is used for reward shaping, improving localization accuracy. The Orsta models, trained on eight tasks, show that unified training can match or outperform specialist models under similar resource constraints. These models improve on the MEGA-Bench and outperform strong multi-task RL-VLM baselines, demonstrating the effectiveness of unified RL in enhancing both reasoning and perception within a single VLM pipeline.

研究旨在开发一种统一的多模态强化学习方法，以解决异构推理和感知任务缺乏成熟训练方法的问题。提出了Visual Triple Unified Reinforcement Learning（V-Triune）方法，包括样本级奖励路由、验证器级结果验证和来源级诊断。使用Dynamic IoU技术进行奖励塑造。Orsta模型在八个任务上联合训练，显示统一训练在相似预算下可以匹配或超越专门模型，并在MEGA-Bench和其他下游基准上有所改进。

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

Authors: Bo Qian, Dahu Shi, Xing Wei

Venue: ICLR 2026

First: 2026-04-16T06:40:44+00:00 · Latest: 2026-04-16T06:40:44+00:00

Comments: Published as a conference paper at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

Summary / 总结

DETR-ViP is proposed to enhance visual prompted object detection by addressing the lack of global discriminability in visual prompts. It incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations, and uses a selective fusion strategy for stable and robust detection. Experiments show that DETR-ViP outperforms other state-of-the-art methods on COCO, LVIS, ODinW, and Roboflow100 datasets. Ablation studies confirm the effectiveness of these improvements.

DETR-ViP 提出了一种增强视觉提示对象检测的方法，通过解决视觉提示中缺乏全局可区分性的问题。它结合了全局提示集成和视觉-文本提示关系的提炼，以学习更具区分性的提示表示，并采用选择性融合策略以确保稳定和稳健的检测。实验表明，DETR-ViP 在 COCO、LVIS、ODinW 和 Roboflow100 数据集上优于其他最先进的方法。消融研究进一步验证了这些改进的有效性。

History

20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553