arXiv 论文速递

Snapshot: 20260420_0359

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Authors: Fangxu Yu, Ziyao Lu, Liqiang Niu, Fandong Meng, Jie Zhou

Venue: ACL 2026

First: 2026-01-10T13:05:23+00:00 · Latest: 2026-04-16T17:52:47+00:00

Comments: Accepted to Findings of ACL 2026

Abstract

Grounding events in videos serves as a fundamental capability in video analysis. While Vision Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.

Summary / 总结

ArrowGEv is a reinforcement learning framework designed to improve the inherent temporal direction in in video events, which is a fundamental capability in video analysis. existing approaches focus on associating events with timestamps in the forward video only, limiting the robustness and generalization of Vlm models. the arrow of time in physics characteringizes the intrinsic temporal direction direction of of temporal processes. Arrowgev explicitly models this direction direction in in events to improve improve the temporal direction direction direction of in vlm models improving precision and temporal direction direction direction and enhancing generalizationization and temporal control ability.

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Authors: Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara, Steven McDonagh

First: 2026-04-16T17:49:58+00:00 · Latest: 2026-04-16T17:49:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

中文标题/摘要

标题：为什么视觉语言模型在识别人类情绪方面挣扎？

理解情绪是智能系统能够与人类互动的基本能力。视觉语言模型（VLMs）在过去几年中在许多视觉任务上取得了巨大进展，可能为理解情绪提供了一个有希望的解决方案。然而，令人惊讶的是，即使是最先进的当代VLMs在识别人类情绪或超越专门的视觉分类器方面也表现不佳。在本文中，我们提出了一个问题：“为什么VLMs在识别人类情绪方面挣扎？”我们观察到，面部表情识别（DFER）这一本质上连续且动态的任务揭示了VLM的两个关键漏洞。首先，情绪数据集自然呈长尾分布，用于预训练VLM的网络规模数据加剧了这一头部类别偏差，导致它们系统地将稀有、未充分代表的情绪类别合并到常见类别中。我们提出了替代采样策略，以防止偏好常见概念。其次，时间信息对于理解情绪至关重要。然而，VLMs无法表示密集帧序列中的时间信息，因为它们受限于上下文大小和可以容纳在内存中的令牌数量，这为情绪识别提出了明确的挑战。我们证明，VLMs中使用的稀疏时间采样策略与微表情（0.25-0.5秒）的瞬时性质不一致，微表情往往是最重要的情感信号。作为诊断探针，我们提出了一种多阶段上下文增强策略，通过首先将“中间”帧转换为自然语言摘要来利用这些信息。增强的文本上下文作为输入提供给VLM，同时与稀疏关键帧一起提供，防止注意力因过多的视觉数据而分散，同时保留情感轨迹。

Summary / 总结

This paper investigates why vision-language models struggle to recognize human emotions, highlighting two critical vulnerabilities. First, emotion datasets are long-tailed, and web-scale data exacerbates this bias, causing models to collapse rare emotions into common categories. Second, VLMs are limited by context size and cannot effectively represent temporal information over dense frame sequences, which is crucial for understanding emotions. The authors propose a multi-stage context enrichment strategy to address these issues by converting 'in-between' frames into natural language summaries and providing them as input to the VLM, thus preserving the emotional trajectory without diluting attention.

本文探讨了为什么视觉语言模型在识别人类情绪方面存在困难，指出两个关键问题。首先，情绪数据集是长尾分布的，大规模数据加剧了这一偏差，导致模型将稀有的情绪类别合并为常见的类别。其次，视觉语言模型受限于上下文大小，无法有效表示密集帧序列中的时间信息，这对理解情绪至关重要。作者提出了一种多阶段上下文增强策略，通过将“中间”帧转换为自然语言摘要，并将其作为输入提供给视觉语言模型，从而保留情绪轨迹，同时避免因过多视觉数据而导致注意力分散。

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Authors: Xuanyi Liu, Deyi Ji, Chunan Yu, Qi Zhu, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu

First: 2026-04-16T17:12:10+00:00 · Latest: 2026-04-16T17:12:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Authors: Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

First: 2026-04-16T17:09:30+00:00 · Latest: 2026-04-16T17:09:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Authors: Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao

Venue: ACL 2026

First: 2026-01-12T15:47:35+00:00 · Latest: 2026-04-16T16:46:40+00:00

Comments: ACL 2026 Findings. Source code available at https://github.com/TANIGUCHIREI/ASL

Abs · PDF · Code1 · Code2 · Code3

Abstract

Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that ASL, equipped with one-shot token selection, adaptively trades inference speed for accuracy, outperforming state-of-the-art layer-wise token pruning methods in difficult tasks.

中文标题/摘要

标题：适应性层选择在LLM推理中按层裁剪标记

由于大型语言模型（LLMs）的普及，LLM推理中的关键值（KV）缓存减少受到了显著关注。近年来，提出的各种方法中，按层裁剪标记的方法是最受欢迎的方案之一。这些方法主要采用一组预定义的层，在这些层上选择标记并剪裁其他标记。这种设计在灵活性方面存在局限性，因为其准确率在不同任务中差异显著，在如KV检索等更难的任务中会下降。在本文中，我们提出了一种无需训练的方法ASL，该方法能够自适应地选择KV缓存减少的层，利用按注意力分数排序的标记排名的方差。该方法在满足用户指定的KV预算要求的同时，平衡了不同任务的性能。ASL在预填充阶段运行，并可以与现有的KV缓存减少方法（如SnapKV）联合使用，以优化解码阶段。通过在InfiniteBench、RULER和NIAH基准上的评估，我们展示了ASL通过一次裁剪标记的选择，自适应地在推理速度和准确率之间进行权衡，优于最先进的按层裁剪标记方法在困难任务中的表现。

Summary / 总结

The proposes The study proposes ASL, an adaptive method for layer-wise token pruning in LLM inference to reduce KV cache reduction. This method method method balances across various tasks while adhering to a defined KV budget and can be jointly optimized with existing KV cache reduction methods such as SnapKV. Experimental evaluations on benchmarks show that ASL on performs well on difficult tasks by adaptive trade trading trading trading trading trading trading on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

Authors: Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng, Jiaxin Ding, Luoyi Fu, Xinbing Wang

First: 2026-04-16T16:21:05+00:00 · Latest: 2026-04-16T16:21:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.

Summary / 总结

The research aims to optimize visual token pruning configurations in vision-language models to balance computational efficiency and performance. The proposed VisPCO framework formulates the pruning problem as a Pareto optimization task, using continuous relaxation and gradient-based search to find optimal configurations. Experiments across eight visual benchmarks show that VisPCO effectively approximates the Pareto frontier and generalizes well across different pruning methods and model architectures, demonstrating superior accuracy-efficiency trade-offs with multi-step progressive pruning compared to single-layer approaches.

研究旨在优化视觉语言模型中的视觉标记剪枝配置，以平衡计算效率和性能。提出的VisPCO框架将剪枝问题表述为帕累托优化任务，使用连续松弛和梯度搜索来找到最优配置。在八个视觉基准上的实验表明，VisPCO有效地逼近了通过网格搜索获得的经验帕累托前沿，并在不同的剪枝方法和模型架构上表现出良好的泛化能力，多步渐进剪枝方法相比单层方法实现了更优的准确性和效率折中。

Agent-Aided Design for Dynamic CAD Models

Authors: Mitch Adler, Matthew Russo, Michael Cafarella

First: 2026-04-16T16:15:23+00:00 · Latest: 2026-04-16T16:15:23+00:00

Comments: 6 pages, 3 figures, to be published in CAIS'26

Abs · PDF · Code1 · Code2

Abstract

In the past year, researchers have started to create agentic systems that can design real-world CAD-style objects in a training-free setting, a new variety of system that we call Agent-Aided Design. Generally speaking, these systems place an agent in a feedback loop in which it can write code, compile that code to an assembly of CAD model(s), visualize the model, and then iteratively refine its code based on visual and other feedback. Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors. In order for Agent-Aided Design to make a real impact in industrial manufacturing, we need a system that is capable of generating such 3D assemblies. In this paper we present a prototype of AADvark, an agentic system designed for this task. Unlike previous state-of-the-art systems, AADvark captures the dynamic part interactions with one or more degrees-of-freedom. This design decision allows AADvark to reason directly about assemblies with moving parts and can thereby achieve cross-cutting goals, including but not limited to mechanical movements. Unfortunately, current LLMs are imperfect spatial reasoners, a problem that AADvark addresses by incorporating external constraint solver tools with a specialized visual feedback mechanism. We demonstrate that, by modifying the agent's tools (FreeCAD and the assembly solver), we are able to create a strong verification signal which enables our system to build 3D assemblies with movable parts.

中文标题/摘要

标题：基于代理的动态CAD模型辅助设计

在过去一年中，研究人员开始创建无需训练即可设计现实世界CAD风格对象的代理系统，我们称之为基于代理的辅助设计。这些系统通常将代理置于一个反馈循环中，它可以编写代码，将代码编译为CAD模型的装配体，可视化模型，然后根据视觉和其他反馈迭代优化其代码。尽管取得了快速进展，但一个关键问题仍然存在：这些系统都无法构建具有移动部件的复杂3D装配体。例如，目前没有任何系统能够构建活塞、摆锤或甚至是一把剪刀。为了使基于代理的辅助设计在工业制造中产生实际影响，我们需要一个能够生成此类3D装配体的系统。在本文中，我们介绍了AADvark的原型，这是一种专为此任务设计的代理系统。与之前的最先进的系统不同，AADvark捕捉了一个或多个自由度的动态部件交互。这一设计决策使AADvark能够直接推理具有移动部件的装配体，并能够实现跨切面目标，包括但不限于机械运动。不幸的是，当前的LLM是不完美的空间推理者，AADvark通过结合外部约束求解工具和专门的视觉反馈机制解决了这一问题。我们通过修改代理的工具（FreeCAD和装配求解器），能够创建一个强大的验证信号，使我们的系统能够构建具有可移动部件的3D装配体。

Summary / 总结

The research aims to develop an agentic system capable of designing complex 3D assemblies with moving parts, which is essential for industrial manufacturing. AADvark, a prototype system, captures dynamic part interactions and uses a specialized visual feedback mechanism along with external constraint solvers to address the limitations of current language models. The system successfully builds 3D assemblies with movable parts by modifying the agent's tools (FreeCAD and the assembly solver).

本文介绍了AADvark，这是一种能够生成具有移动部件的复杂3D装配体的设计系统，解决了现有系统无法生成此类装配体的问题。通过引入外部约束求解器和专门的视觉反馈机制，AADvark能够设计如活塞、摆锤和剪刀等物体。该系统利用FreeCAD和装配求解器等工具，提供强大的验证信号，从而能够创建具有一个或多个自由度的动态CAD模型。

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Authors: Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang

Venue: NeurIPS 2025

First: 2026-04-15T17:59:52+00:00 · Latest: 2026-04-16T15:48:38+00:00

Comments: Appear in the proceedings of NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards one token per frame at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into learnable and progressive modules for token-level compression (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate frame-level compression, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named question-conditioned compression (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, i.e., the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined token-level and frame-level leads to an extreme compression model for long video understanding, named XComp, achieving a significantly larger compression ratio and enabling denser frame sampling. Our XComp is finetuned from VideoChat-Flash with a data-efficient supervised compression tuning stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.

中文标题/摘要

标题：每个高度选择性帧一个令牌：向长视频理解的极端压缩迈进

长视频理解对视觉-语言模型（VLMs）来说固然是具有挑战性的，因为帧的数量非常庞大。每个视频帧通常会扩展成数十或数百个令牌，大型语言模型（LLMs）有限的上下文长度迫使VLMs稀疏地感知帧，从而丢失时间信息。为了解决这个问题，我们探索了在最终LLM层进行极端视频令牌压缩，目标是每个帧一个令牌。我们的关键洞察是，先前方法广泛采用的基于启发式的压缩容易导致信息丢失，因此需要监督LLM层进入可学习和渐进的模块进行令牌级压缩（LP-Comp）。这种压缩使我们的VLM能够消化2-4倍更多的帧，同时提高性能。为了进一步提高令牌效率，我们研究了帧级压缩，通过LLM层内部的注意力分数选择与查询最相关的帧，称为问题条件压缩（QC-Comp）。与先前研究的一个显著区别是，我们通过将长视频分割成短片段并使用局部注意力来缓解LLM注意力在长上下文中的位置偏差，即序列的过度集中在开头和结尾。结合我们的令牌级和帧级压缩，我们提出了一个名为XComp的极端压缩模型，用于长视频理解，实现了显著更大的压缩比，并允许更密集的帧采样。我们的XComp是从VideoChat-Flash微调而来，仅需2.5%的监督微调数据，就能在LVBench上将准确率从42.9%提升到46.2%，并增强多个其他长视频基准。

Summary / 总结

The paper addresses the challenge of long video understanding by proposing a method to compress video tokens to one per frame at the final layer of large language models (LLMs). It introduces a learnable and progressive compression module (LP-Comp) to mitigate information loss and a question-conditioned compression (QC-Comp) to select relevant frames. The combined approach, named XComp, significantly increases the number of frames processed while improving performance on long video benchmarks.

论文提出了一种在大型语言模型（LLM）最终层将视频令牌压缩为每个帧一个的方法，以解决长视频理解的挑战。它引入了一个可学习和渐进的压缩模块（LP-Comp）来减轻信息丢失，并通过内部注意力分数选择相关帧的方法（QC-Comp）。结合这两种方法，名为XComp的模型显著增加了处理的帧数并提高了长视频基准上的性能。

IROSA: Interactive Robot Skill Adaptation using Natural Language

Authors: Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp, João Silvério

Venue: IEEE Robotics and Automation Letters (RA-L), 2026

First: 2026-03-04T09:54:09+00:00 · Latest: 2026-04-16T15:37:03+00:00

Comments: Accepted IEEE Robotics and Automation Letters (RA-L) journal, 8 pages, 5 figures, 3 tables, 1 listing. Code available: https://github.com/DLR-RM/IROSA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.

Summary / 总结

The paper introduces IROSA, a framework for interactive robot skill adaptation using natural language. It combines foundation models and imitation learning to adapt robot skills with limited data, maintaining safety and interpretability. The framework uses pre-trained language models to select and parameterize tools for skill adaptation without requiring fine-tuning or direct interaction with the robot. Experiments on a 7-DoF robot showed successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance, while ensuring safety and transparency.

论文介绍了IROSA框架，利用自然语言进行交互式的机器人技能适应。该框架结合了基础模型和模仿学习，使用预训练的语言模型选择和参数化工具来适应机器人技能，无需进行微调或直接与机器人交互。实验表明，该框架能够通过自然语言命令成功调整机器人速度、轨迹校正和避障，同时确保安全性和可解释性。

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

Authors: Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, Anh Tuan Luu, Jianbing Zhang, Lewei Lu, Dahua Lin

First: 2026-04-16T14:53:08+00:00 · Latest: 2026-04-16T14:53:08+00:00

Comments: Work in progress

Abs · PDF · Code1 · Code2 · Project1

Abstract

Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.

Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

Authors: Jiaxuan Li, Xin Wen, Zhihang Li

First: 2026-04-16T14:49:30+00:00 · Latest: 2026-04-16T14:49:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.

Summary / 总结

The paper addresses the challenge of robust person re-identification (ReID) under varying conditions, such as lighting changes and clothing variations. It proposes STFER, a framework that uses Large Vision-Language Models to generate identity-consistent text, which enhances visual features and improves robustness. Experiments show that STFER outperforms existing methods on the AT-USTC dataset and demonstrates strong generalization across multiple benchmarks.

论文提出了一种名为STFER的新框架，利用大型视觉-语言模型生成身份一致的文本，增强对服装变化和跨模态变化的鲁棒性。该方法包括语义驱动的视觉令牌过滤和语义驱动的专家路由，使其在AT-USTC数据集上达到最先进的效果，并在多个ReID基准测试中表现出强大的泛化能力。

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath

First: 2025-11-27T15:00:58+00:00 · Latest: 2026-04-16T14:40:49+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Dataset and implementation: https://github.com/ahmad-shirazi/DocVAL

中文标题/摘要

标题：DocVAL：经过验证的链式推理蒸馏方法用于基于文档的视觉问答

文档视觉问答不仅要求模型能够正确回答问题，还需要精确地在复杂的文档布局中定位答案。虽然大型视觉-语言模型（VLM）在空间定位方面表现出色，但其推理成本和延迟限制了其实用部署。紧凑型VLM更高效，但在标准微调或蒸馏下，它们往往在定位方面遭受显著退化。为了解决这一差距，我们提出了DocVAL，这是一种经过验证的链式推理（CoT）蒸馏框架，将大型教师模型中的显式空间推理转移到紧凑且可部署的学生VLM中。DocVAL结合了（1）教师生成的空间CoT监督，（2）基于规则的双模式验证器，该验证器过滤低质量的训练信号并提供细粒度的像素级纠正反馈，以及（3）以验证为导向的两阶段训练程序，其中包含迭代细化。文本检测仅作为监督和验证的训练时支架，使最终的学生模型能够在推理时无需OCR或检测操作。在多个文档理解基准测试中，DocVAL在与之相当的紧凑型VLM上实现了高达6-7个ANLS点的一致改进。我们进一步引入了平均精确度（mAP）作为文档问答的定位指标，并在这一新评估中报告了强大的空间定位性能。我们发布了95K验证器验证的CoT轨迹，表明高质量、经过验证的监督比扩大未经筛选的数据更有效，从而实现高效的、可信赖的文档定位。数据集和实现：https://github.com/ahmad-shirazi/DocVAL

Summary / 总结

DocVAL is a validated chain-of-thought distillation framework designed to improve the spatial grounding of compact vision-language models for document visual question answering. It uses teacher-generated spatial reasoning, a rule-based validator, and a two-stage training procedure to enhance localization accuracy. DocVAL achieves up to 6-7 ANLS points improvement over comparable compact models and introduces mAP as a new localization metric, demonstrating strong spatial grounding performance. The framework enables efficient and trustworthy document grounding without requiring OCR or detection at inference time.

DocVAL 是一种验证过的链式思考蒸馏框架，旨在提高紧凑型视觉-语言模型在文档视觉问答中的空间定位能力。它利用教师生成的空间推理、基于规则的验证器和两阶段训练程序来增强定位准确性。DocVAL 在多个文档理解基准测试中实现了高达 6-7 ANLS 点的改进，并引入了 mAP 作为新的定位指标，展示了强大的空间定位性能。该框架在推理时不需 OCR 或检测，从而实现高效且可信赖的文档定位。

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Authors: Jun Wang, Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao, Ziyong Feng, Kaicheng Yang, Cewu Lu

First: 2026-04-16T13:03:32+00:00 · Latest: 2026-04-16T13:03:32+00:00

Comments: 17 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

Authors: Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, Zhou Zhao

First: 2026-04-16T12:03:50+00:00 · Latest: 2026-04-16T12:03:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.

中文标题/摘要

标题：面向语义稳健性和轮流对话的双轴生成奖励模型

实现无缝、类人交互仍然是全双工语音对话模型（SDMs）的关键挑战。强化学习（RL）显著提升了文本和视觉语言模型，而精心设计的奖励信号对于RL的性能至关重要。我们认为RL是解决SDMs关键挑战的一种有前途的策略。然而，一个根本性的障碍仍然存在：现有的评估交互质量的自动化指标依赖于表面的代理指标，如行为统计或时间预测准确性，无法为RL提供可靠的奖励信号。另一方面，尽管人类评估具有丰富的信息，但仍然成本高昂、不一致且难以扩展。我们通过提出一种双轴生成奖励模型来应对这一关键障碍，该模型使用详细的分类体系和标注数据集来理解复杂的交互动态，生成单一评分，并且最关键的是，分别对语义质量和交互时间进行评估。这种双重输出为SDMs提供精确的诊断反馈，并提供一种可靠且有指导意义的奖励信号，适用于在线强化学习。我们的模型在广泛的数据集上实现了交互质量评估的最新性能，涵盖了合成对话和复杂的现实世界交互。

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

Authors: Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen, Min Sun, Winston Hsu

First: 2026-04-16T11:46:30+00:00 · Latest: 2026-04-16T11:46:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

Summary / 总结

The research aims to improve the adaptability of intelligent embodied agents by addressing the limitations of existing methods that do not consider object affordances. ADAPT, a plug-and-play module, is introduced to augment existing planners with explicit affordance reasoning. Experiments show that incorporating ADAPT enhances robustness and task success in both seen and unseen environments. Additionally, a domain-adapted, LoRA-finetuned vision-language model outperforms GPT-4o in affordance inference, emphasizing the importance of task-aligned grounding.

研究针对现有方法不考虑物体功能，导致在具有意外条件的现实环境中失败的问题。提出了DynAfford，一个评估在动态环境中物体功能可能变化的环境中的代理性能的基准。提出了ADAPT模块，该模块增强规划器以进行显式的功能推理，并证明其可以显著提高鲁棒性和任务成功率。此外，一个领域适应的、LoRA微调的视觉-语言模型在功能推理方面优于GPT-4o，突显了任务对齐的功能接地的重要性。

Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

Authors: Danae Sánchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott

First: 2026-04-16T11:28:53+00:00 · Latest: 2026-04-16T11:28:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.

Summary / 总结

The study investigates the reasoning dynamics in 18 vision-language models, focusing on how these models integrate visual and textual information. By tracking confidence over Chain-of-Thought (CoT) and evaluating the corrective effect of reasoning, the research finds that models often exhibit answer inertia, reinforcing early predictions rather than revising them. Reasoning-trained models show stronger corrective behavior, but their effectiveness varies with modality conditions. The study also reveals that models are influenced by misleading textual cues, even when visual evidence is sufficient, and that this influence is not always detectable in the CoT, highlighting the limitations of monitoring modality reliance in VLMs.

研究分析了18个视觉语言模型的推理动态，关注这些模型如何整合视觉和文本信息。通过跟踪推理过程中的置信度和评估推理的矫正效果，研究发现模型往往表现出答案惯性，即早期预测被强化而不是被修正。推理训练的模型显示出更强的矫正行为，但其效果在不同模态条件下有所不同。研究还发现，模型即使在有足够的视觉证据时也会受到误导性文本提示的影响，而这种影响在推理过程中的链式思考（CoT）中并不总是可检测的，这突显了监测不同模态依赖性的局限性。

RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

Authors: Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, Hai Zhao

Venue: ACL 2026

First: 2026-04-16T11:23:55+00:00 · Latest: 2026-04-16T11:23:55+00:00

Comments: Accepted to Findings of ACL 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$.

中文标题/摘要

标题：RACER：检索增强的上下文快速推测解码

大型语言模型（LLMs）中的自回归解码每次生成一个标记，导致高推理延迟。推测解码（SD）通过猜测和验证策略缓解了这一问题，但现有的无训练版本存在权衡：基于检索的草稿在没有完全匹配时会失效，而基于logits的草稿缺乏结构指导。我们提出了一种轻量级且无训练的方法——RACER（Retrieval-Augmented Contextual Rapid Speculative Decoding），该方法将检索到的精确模式与logits驱动的未来线索结合起来。这种结合提供了可靠的锚点和灵活的外推，生成更丰富的推测草稿。在Spec-Bench、HumanEval和MGSM-ZH上的实验表明，RACER一致地加速了推理，比自回归解码快2倍以上，并优于之前的无训练方法，提供了一种可扩展的、即插即用的高效LLM解码解决方案。我们的源代码可在https://github.com/hkr04/RACER 获取。

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Authors: Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu, Yue Han, James Kit Hon Tsoi, Gui-Song Xia, Cui Huang

Venue: Journal of Dental Research, p.00220345261424242 (2026)

First: 2026-04-16T10:56:54+00:00 · Latest: 2026-04-16T10:56:54+00:00

Comments: Project website: https://menxli.github.io/metadent

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

中文标题/摘要

标题：MetaDent：牙科临床图像的视觉-语言模型标注

视觉-语言模型（VLMs）在医学图像分析中展现了显著的潜力，但在牙科临床摄影中的应用仍因缺乏细粒度的标注数据集和全面的基准测试而未得到充分探索。为解决这一问题，我们提出了MetaDent，一个综合资源，包括（1）从临床、公共和网络来源收集的新型大规模牙科图像数据集；（2）一种半结构化的标注框架，用于捕捉牙科摄影的层次化和临床细微差别；以及（3）用于评估最先进的VLMs在临床图像理解上的基准测试套件。我们的标注方法结合了高层次的图像摘要和逐点的自由文本描述异常。这种方法能够提供丰富、可扩展且任务无关的表示。我们从多种来源中精选了60,669张牙科图像，并使用这种元标注方案标注了其中的2,588张代表性图像。利用大型语言模型（LLMs），我们推导出标准化基准：约15,000个视觉问答（VQA）对和一个包含18个类别的多标签分类数据集，我们通过人工审查和错误分析验证了LLM驱动的过渡可靠地保持了准确性和语义准确性。然后，我们在视觉问答、分类和图像描述任务上评估了最先进的VLMs。定量结果显示，即使是最先进的模型在细粒度理解牙科场景方面也存在困难，图像描述任务中产生了中等准确度和不一致或不完整的描述。我们公开发布我们的数据集、标注和工具，以促进可重复研究并加速视觉-语言系统在牙科应用中的发展。

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Authors: Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

First: 2026-04-13T14:03:18+00:00 · Latest: 2026-04-16T10:51:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.

中文标题/摘要

标题：重新审视双编码器视觉-语言模型中的组合性：推理的作用

双编码器视觉-语言模型（VLMs）如CLIP常被描述为袋词系统，因为它们在组合性基准测试中的表现不佳。我们认为，这一局限性可能并非源于不足的表示，而是基于全局余弦相似度的标准推理协议。首先，通过受控的诊断实验，我们表明，在推理时明确强制细粒度的区域分割对齐显著提高了组合性性能，而无需更新预训练编码器。然后，我们引入了一个轻量级的变压器，它可以直接从冻结的块和标记嵌入中学习这些对齐。与完全微调和先前的端到端组合性训练方法相比，我们发现，尽管这些方法在领域内检索方面有所改进，但它们的收益在分布转移时并不一致。相比之下，学习冻结表示的局部对齐在领域内检索方面与完全微调相当，同时在受控的领域外组合性基准测试中取得了显著改进。这些结果将全局嵌入匹配识别为双编码器VLMs中的关键瓶颈，并强调了对齐机制对于稳健的组合性泛化的重要性。

Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems

Authors: Haileab Yagersew

First: 2026-04-16T10:32:20+00:00 · Latest: 2026-04-16T10:32:20+00:00

Comments: 16 pages, 3 figures, Code to be released at https://github.com/xHaileab/Paza-AI

Abs · PDF · Code1 · Code2 · Code3

Abstract

Retail theft costs the global economy over \$100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \$200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \$50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza-AI.

Summary / 总结

The paper presents Paza, a zero-shot retail theft detection framework that uses a layered pipeline of existing models to achieve practical concealment detection without custom training. The approach involves continuous running of cheap object detection and pose estimation models, with an expensive vision-language model invoked only when behavioral pre-filters are triggered. This reduces VLM invocations by 240x, enabling a single GPU to serve 10-20 stores. The VLM component, which can accept any OpenAI-compatible endpoint, achieves 89.5% precision and 92.8% specificity at 59.3% recall, making the system cost-effective at $50-100/month per store, significantly cheaper than commercial alternatives. The framework also includes a privacy-preserving design that obfuscates faces in the detection pipeline.

论文介绍了Paza，这是一种零样本零售盗窃检测框架，使用现有模型的分层管道来实现实际的藏匿检测，无需进行定制训练。该方法包括连续运行廉价的对象检测和姿态估计模型，仅在行为预过滤器触发时才调用昂贵的视觉语言模型。这将VLM调用次数减少了240倍，使单个GPU能够为10-20家商店提供服务。VLM组件可以接受任何OpenAI兼容的端点，实现了89.5%的精确度和92.8%的特异性，召回率为59.3%，使系统在每家商店50-100美元/月的成本下具有经济性，远低于商业替代品。该框架还包括一种保护隐私的设计，可以在检测管道中模糊人脸。

POP: Prefill-Only Pruning for Efficient Large Model Inference

Authors: Junhui He, Zhihui Fu, Jun Wang, Qingan Li

First: 2026-02-03T09:22:26+00:00 · Latest: 2026-04-16T10:22:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.

中文标题/摘要

标题：POP：仅预填充剪枝以提高大型模型推理效率

大型语言模型（LLMs）和视觉-语言模型（VLMs）展现了显著的能力。然而，它们的部署受到显著计算成本的阻碍。现有的结构化剪枝方法虽然在硬件效率方面表现出色，但往往会导致显著的准确率下降。在本文中，我们提出这种失败源于一种不区分阶段的剪枝方法，忽视了预填充和解码阶段之间的不对称作用。通过引入虚拟门机制，我们的重要性分析表明，深层层对于下一个标记的预测（解码）至关重要，但在上下文编码（预填充）中则几乎冗余。利用这一洞察，我们提出了仅预填充剪枝（POP），这是一种阶段感知的推理策略，在计算密集的预填充阶段安全地省略深层层，而在敏感的解码阶段保留完整模型。为了在阶段之间实现过渡，我们引入了独立的键-值（KV）投影以保持缓存的完整性，并采用边界处理策略以确保生成的第一个标记的准确性。在Llama-3.1、Qwen3-VL和Gemma-3等不同模态下的广泛实验表明，POP在预填充延迟上实现了高达1.37倍的加速，同时保持了最小的性能损失，有效地克服了现有结构化剪枝方法的准确率-效率权衡限制。

Summary / 总结

This paper addresses the computational challenges of deploying large language models and vision-language models by proposing Prefill-Only Pruning (POP), a stage-aware inference strategy. POP identifies that deep layers are crucial for the decode stage but redundant for the prefill stage, and thus safely omits them during the prefill stage while retaining the full model for the decode stage. The method introduces independent Key-Value projections and a boundary handling strategy to maintain cache integrity and ensure accuracy. Experiments show that POP achieves up to 1.37 times speedup in prefill latency with minimal performance loss.

本文提出了一种阶段感知的推理策略Prefill-Only Pruning (POP)，以解决大规模语言模型和视觉-语言模型的计算挑战。POP发现深层层在解码阶段至关重要但在编码阶段冗余，因此在编码阶段安全地省略它们，而在解码阶段保留完整模型。该方法引入了独立的Key-Value投影和边界处理策略以保持缓存完整性和确保准确性。实验表明，POP在预编码延迟上最多可实现1.37倍的加速，同时保持最小的性能损失。

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

Authors: Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg

First: 2025-05-26T15:23:18+00:00 · Latest: 2026-04-16T10:15:26+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes. We assess the performance of various vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. We find that these VLMs exhibit weak ME bias, while showing some ability to leverage extra spatial context to resolve ambiguity in multiple novel object settings. Project page: http://mebench.github.io/.

中文标题/摘要

标题：MEBench：一种理解视觉语言模型中互斥偏见的新基准

本文介绍了MEBench，这是一种用于评估互斥（ME）偏见的新基准，ME偏见是儿童在词汇学习过程中观察到的一种认知现象。与传统的ME任务不同，MEBench进一步结合了空间推理，以创建更具挑战性和现实性的评估环境。为了便于控制实验，我们还提出了一种灵活且可扩展的数据生成管道，支持构建多样化的标注场景。我们使用新颖的评估指标来评估各种视觉语言模型（VLMs）在该基准上的性能，这些指标捕捉了ME推理的关键方面。我们发现这些VLMs表现出较弱的ME偏见，但在解决多个新物体设置中的歧义时能够利用额外的空间上下文。项目页面：http://mebench.github.io/

Summary / 总结

MEBench is a new benchmark designed to evaluate mutual exclusivity (ME) bias in vision-language models, incorporating spatial reasoning to create more realistic evaluation scenarios. It includes a flexible data generation pipeline for diverse annotated scenes. The study finds that vision-language models show weak ME bias but can use spatial context to resolve ambiguities in multiple novel object settings.

MEBench 是一个新的基准，用于评估视觉-语言模型中的互斥性（ME）偏差，结合了空间推理以创建更现实的评估场景。它包含一个灵活的数据生成管道，用于构建多样化的标注场景。研究发现，视觉-语言模型在ME偏差方面表现较弱，但可以利用空间上下文来解决多个新物体设置中的歧义。

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

Authors: Nishanth Madhusudhan, Vikas Yadav, Alexandre Lacoste

First: 2026-04-16T09:23:22+00:00 · Latest: 2026-04-16T09:23:22+00:00

Comments: 10 pages and 4 figures (excluding appendix)

Abs · PDF · Code1 · Code2

Abstract

Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

Authors: Peifeng Zhang, Zice Qiu, Donghua Yu, Shilei Cao, Juepeng Zheng, Yutong Lu, Haohuan Fu

Venue: ACM MM 2026

First: 2026-04-16T08:39:02+00:00 · Latest: 2026-04-16T08:39:02+00:00

Comments: 18 pages, 9 figures. Submitted to ACM MM 2026

Abs · PDF · Code1 · Code2

Abstract

In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

Authors: Zijian Zhao, Dian Jin, Zijing Zhou

First: 2025-09-26T14:07:29+00:00 · Latest: 2026-04-16T08:15:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recently, Image-to-Music (I2M) generation has garnered significant attention, with potential applications in fields such as gaming, advertising, and multi-modal art creation. However, due to the ambiguous and subjective nature of I2M tasks, most end-to-end methods lack interpretability, leaving users puzzled about the generation results. Even methods based on emotion mapping face controversy, as emotion represents only a singular aspect of art. Additionally, most learning-based methods require substantial computational resources and large datasets for training, hindering accessibility for common users. To address these challenges, we propose the first Vision Language Model (VLM)-based I2M framework that offers high interpretability and low computational cost. Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language. We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training. Furthermore, we leverage the generated motivations in text and the attention maps from the VLM to provide explanations for the generated results in both text and image modalities. To validate our method, we conduct both human studies and machine evaluations, where our method outperforms others in terms of music quality and music-image consistency, indicating promising results. Our code is available at https://github.com/RS2002/Image2Music .

Summary / 总结

The paper addresses the challenges of generating music from images by proposing a Vision Language Model (VLM)-based framework that offers high interpretability and low computational cost. The method uses ABC notation to bridge text and music modalities, enabling the VLM to generate music through natural language. Multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques are applied to produce high-quality music without external training. The method also provides explanations for the generated results using generated motivations and attention maps. Experimental results show that the proposed method outperforms others in terms of music quality and consistency with images.

论文提出了一种新颖的基于VLM的框架，通过增强可解释性和降低计算成本来解决从图像生成音乐的挑战。该方法使用ABC符号将文本和音乐连接起来，并采用多模态RAG和自我精炼技术来生成高质量的音乐，而无需进行大量训练。该框架通过文本和图像模态提供生成音乐的解释，并在人类和机器评估中在音乐质量和与图像的一致性方面优于其他方法。

SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval

Authors: Xin Xie, Dongyun Xue, Wuguannan Yao, Mingxiao Feng, Wengang Zhou, Xiang Qi, Houqiang Li, Peng Zhang

First: 2026-04-16T07:22:36+00:00 · Latest: 2026-04-16T07:22:36+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.

Summary / 总结

SGA-MCTS addresses the trade-off between inference-time search latency and supervised fine-tuning limitations in LLM planning. It uses offline Monte Carlo Tree Search to create reusable State-Goal-Action atoms, which are then used by an online retrieval-augmented agent to provide soft reasoning hints. This framework allows frozen models to match SOTA performance without task-specific fine-tuning, achieving scalable and real-time autonomous planning.

SGA-MCTS 解决了 LLM 规划中推理时搜索延迟与监督微调限制之间的权衡。它通过离线的 Monte Carlo Tree Search 创建可重用的 State-Goal-Action 原子，然后由在线检索增强的代理提供软推理提示。该框架使冻结模型能够不进行任务特定微调就达到 SOTA 性能，从而实现可扩展且实时的自主规划。

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

Authors: Jiyoung Lim, Heejae Yang, Jee-Hyong Lee

Venue: CVPR 2026

First: 2026-04-16T07:21:21+00:00 · Latest: 2026-04-16T07:21:21+00:00

Comments: CVPR 2026 Accepted

Abs · PDF · Code1 · Code2 · Code3

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.

Summary / 总结

G-MIXER is a training-free method for Zero-Shot Composed Image Retrieval that addresses the limitations of previous methods by expanding implicit semantics through geodesic mixup and re-ranking candidates based on explicit semantics from Multimodal Large Language Models. This approach enhances both diversity and accuracy in retrieval results, achieving state-of-the-art performance across multiple benchmarks.

G-MIXER 是一种无需训练的方法，用于解决零样本组合图像检索问题，通过地理混合扩展隐式语义，并基于多模态大型语言模型提取的显式语义重新排名候选图像，从而提高检索的多样性和准确性，并在多个基准测试中达到最先进的性能。

SAM3-I: Segment Anything with Instructions

Authors: Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Wei Ji, Qi Bi, Yongri Piao, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Huchuan Lu, Li Cheng

First: 2025-12-04T09:00:25+00:00 · Latest: 2026-04-16T07:12:40+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3's vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.

中文标题/摘要

标题：SAM3-I: 按指令分割一切

Segment Anything Model 3 (SAM3) 通过可提示的概念分割推进了开放词汇分割，使用户能够使用简短的名词短语（NP）提示分割给定概念的所有实例。虽然在概念级定位方面有效，但现实世界的交互通常涉及更丰富的自然语言指令，这些指令结合了属性、关系、动作、状态或隐含推理。目前，SAM3依赖于外部多模态代理将复杂指令转换为NP并进行迭代掩码过滤，导致粗略的表示和有限的实例特异性。在此工作中，我们提出了SAM3-I，这是一种SAM家族的指令遵循扩展，将概念级定位和指令级推理统一在一个分割框架中。基于SAM3，SAM3-I引入了一种指令感知级联适应机制，具有专用对齐损失，逐步将表达性的指令语义与SAM3的视觉语言表示对齐，从而能够直接解释自然语言指令，同时保留其强大的概念召回能力。为了实现指令遵循学习，我们引入了HMPL-Instruct，这是一个大规模的以指令为中心的数据集，系统地涵盖了层次指令语义和多样化的目标粒度。实验表明，SAM3-I在引用和基于推理的分割方面表现出色，表明SAM3可以有效地扩展以遵循复杂的自然语言指令，而不牺牲其原始的概念驱动优势。代码和数据集可在https://github.com/debby-0527/SAM3-I 获取。

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Authors: Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan

First: 2025-05-23T17:41:14+00:00 · Latest: 2026-04-16T06:56:57+00:00

Comments: Technical Report

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL) is becoming an important direction for post-training vision-language models (VLMs), but public training methodologies for unified multimodal RL remain much less mature, especially for heterogeneous reasoning and perception-heavy tasks. We propose V-Triune, a Visual Triple Unified Reinforcement Learning methodology for unified multimodal RL. It organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. Within this methodology, Dynamic IoU provides localization-specific reward shaping that avoids reward ambiguity under loose thresholds and reward sparsity under strict ones. Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI/One-RL-to-See-Them-All.

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

Authors: Bo Qian, Dahu Shi, Xing Wei

Venue: ICLR 2026

First: 2026-04-16T06:40:44+00:00 · Latest: 2026-04-16T06:40:44+00:00

Comments: Published as a conference paper at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

Summary / 总结

The research aims to improve visual prompted object detection by addressing the lack of global discriminability in visual prompts. DETR-ViP, a robust object detection framework, is proposed, which includes global prompt integration and visual-textual prompt relation distillation to enhance discriminative prompt representations. Experiments on COCO, LVIS, ODinW, and Roboflow100 show that DETR-ViP outperforms other state-of-the-art methods in visual prompt detection. A series of ablation studies confirm the effectiveness of the proposed improvements.

研究旨在通过解决视觉提示缺乏全局区分性的问题，提升视觉提示目标检测的效果。提出了DETR-ViP框架，通过引入全局提示整合和视觉-文本提示关系蒸馏来增强提示表示的区分性。实验结果显示，DETR-ViP在COCO、LVIS、ODinW和Roboflow100上的表现优于其他最先进的方法，显著提升了视觉提示检测性能。一系列消融研究进一步验证了所提方法的有效性。

History

20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553