arXiv 论文速递

Snapshot: 20260417_0421

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Authors: Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang

First: 2026-04-15T17:59:52+00:00 · Latest: 2026-04-15T17:59:52+00:00

Abstract

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5\% of the supervised fine-tuning data, yet boosts the accuracy from 42.9\% to 46.2\% on LVBench and enhances multiple other long video benchmarks.

中文标题/摘要

标题：每个高度选择性帧一个令牌：向长视频理解的极端压缩迈进

长视频理解对于视觉-语言模型（VLMs）来说固然是具有挑战性的，因为帧的数量非常庞大。由于每个视频帧通常会扩展成数十或数百个令牌，大型语言模型（LLMs）有限的上下文长度迫使VLMs稀疏地感知帧，从而丢失时间信息。为了解决这个问题，我们探索了在最终LLM层实现极端视频令牌压缩的方法，目标是每个帧一个令牌。我们的关键洞察是，先前方法广泛采用的基于启发式的压缩容易导致信息丢失，因此需要监督LLM层进入可学习和渐进的模块进行令牌级压缩（LP-Comp）。这种压缩使我们的VLM能够消化2-4倍更多的帧，同时提高性能。为了进一步提高令牌效率，我们研究了帧级压缩，通过LLM层的内部注意力分数选择与查询最相关的帧，这种方法称为问题条件压缩（QC-Comp）。与先前研究的一个显著区别是，我们通过将长视频分割成短片段并使用局部注意力来缓解LLM注意力在长上下文中的位置偏差，即序列开头和结尾的过度集中。综合而言，我们的结合了令牌级和帧级压缩的方法，为长视频理解带来了极端压缩模型，称为“”，实现了显著更大的压缩比，并允许更密集的帧采样。我们的“”是从VideoChat-Flash微调而来的，通过一个数据高效的监督压缩调优阶段，只需要2.5%的监督微调数据，就能将LVBench的准确性从42.9%提升到46.2%，并增强多个其他长视频基准。

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Authors: Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

First: 2026-04-15T17:50:07+00:00 · Latest: 2026-04-15T17:50:07+00:00

Comments: Project Page: https://tianshuoy.github.io/HiVLA-page/

Abs · PDF · Code1 · Code2 · Project1

Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

中文标题/摘要

标题：HiVLA：一种视觉导向的分层体态操作系统

尽管端到端的视觉-语言-动作（VLA）模型为机器人操作提供了有希望的范式，但它们在狭窄控制数据上的微调往往削弱了从基础视觉-语言模型（VLM）继承的深刻推理能力。为了解决这一根本权衡，我们提出了一种视觉导向的分层框架HiVLA，该框架明确地将高层语义规划与低层运动控制解耦。在高层部分，VLM规划器首先进行任务分解和视觉接地，生成结构化计划，包括子任务指令和精确的目标边界框。然后，为了将此计划转化为物理动作，我们引入了低层部分的流匹配扩散变换器（DiT）动作专家，配备了新颖的级联交叉注意力机制。该设计按顺序融合全局上下文、高分辨率对象中心裁剪和技能语义，使DiT能够专注于稳健执行。我们的解耦架构保留了VLM的零样本推理能力，同时允许两个组件独立改进。在模拟和现实世界中的广泛实验表明，HiVLA在端到端基线中表现显著更优，特别是在长时序技能组合和杂乱场景中小物体的精细操作方面表现出色。

Summary / 总结

HiVLA is a hierarchical embodied manipulation system that decouples high-level semantic planning from low-level motor control, using a VLM planner for task decomposition and visual grounding, and a flow-matching Diffusion Transformer (DiT) for precise action execution. Experiments show that HiVLA outperforms end-to-end baselines, especially in long-horizon skill composition and fine-grained manipulation in cluttered scenes.

HiVLA 是一个分层的机器人操作系统，它将高层语义规划与低层运动控制分离，使用 VLM 计划器进行任务分解和视觉定位，以及使用流匹配扩散变换器 (DiT) 进行精确的动作执行。实验表明，HiVLA 在长时技能组合和杂乱场景中的精细操作方面优于端到端基线系统。

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

Authors: Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, Chanyoung Park

First: 2026-04-14T06:48:31+00:00 · Latest: 2026-04-15T17:43:53+00:00

Comments: Preprint, Project : https://ptkjw1997.github.io/DSTP-page/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

中文标题/摘要

标题：为什么和何时视觉标记剪枝会失效？关于MLLMs解码中相关视觉信息转移的研究

最近，视觉标记剪枝被研究用于处理多模态大型语言模型中大量的视觉标记。然而，我们观察到，虽然现有的剪枝方法在简单的视觉理解任务上表现可靠，但在复杂的视觉推理任务上却难以有效泛化，这是一个在先前研究中被忽视的关键差距。通过系统的分析，我们确定解码过程中的相关视觉信息转移（RVIS）是主要的失败驱动因素。为了解决这一问题，我们提出了解码阶段转移感知标记剪枝（DSTP），这是一种无需训练的附加框架，使现有的剪枝方法能够在解码阶段与变化的推理需求对齐视觉标记。广泛的实验表明，DSTP显著减轻了剪枝方法在复杂推理任务中的性能下降，同时在视觉理解基准测试中也持续提高了性能。此外，DSTP在多种最先进的架构中都表现出有效性，突显了其通用性和低计算开销。

Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models

Authors: Mingen Li, Houjian Yu, Yixuan Huang, Youngjin Hong, Hantao Ye, Changhyun Choi

First: 2025-10-22T05:57:23+00:00 · Latest: 2026-04-15T17:36:28+00:00

Comments: 8 pages, 6 figures, 3 tables

Abs · PDF · Code1 · Code2 · Project1

Abstract

Long-horizon routing tasks of deformable linear objects (DLOs), such as cables and ropes, are common in industrial assembly lines and everyday life. These tasks are particularly challenging because they require robots to manipulate DLO with long-horizon planning and reliable skill execution. Successfully completing such tasks demands adapting to their nonlinear dynamics, decomposing abstract routing goals, and generating multi-step plans composed of multiple skills, all of which require accurate high-level reasoning during execution. In this paper, we propose a fully autonomous hierarchical framework for solving challenging DLO routing tasks. Given an implicit or explicit routing goal expressed in language, our framework leverages vision-language models~(VLMs) for in-context high-level reasoning to synthesize feasible plans, which are then executed by low-level skills trained via reinforcement learning. To improve robustness over long horizons, we further introduce a failure recovery mechanism that reorients the DLO into insertion-feasible states. Our approach generalizes to diverse scenes involving object attributes, spatial descriptions, implicit language commands, and \myred{extended 5-clip settings}. It achieves an overall success rate of 92\% across long-horizon routing scenarios. Please refer to our project page: https://icra2026-dloroute.github.io/DLORoute/

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Authors: Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

First: 2026-04-15T17:32:28+00:00 · Latest: 2026-04-15T17:32:28+00:00

Comments: Project Page: https://zju-real.github.io/UI-Zoomer Code: https://github.com/ZJU-REAL/UI-Zoomer

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

3D Instruction Ambiguity Detection

Authors: Jiayu Ding, Haoran Tang, Hongbo Jin, Wei Gao, Ge Li

First: 2026-01-09T18:17:11+00:00 · Latest: 2026-04-15T17:18:28+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.

中文标题/摘要

标题：3D指令歧义检测

在安全关键领域，语言歧义可能导致严重后果；手术环境中一个模糊的命令“递给我那个药瓶”可能会导致灾难性错误。然而，大多数具身AI研究忽略了这一点，假设指令是清晰的，专注于执行而不是确认。为解决这一关键安全缺口，我们首次定义了3D指令歧义检测这一基本的新任务，即模型必须确定在一个给定的3D场景中一个命令是否具有单一且明确的意义。为了支持这一研究，我们构建了Ambi3D，这是该任务的大规模基准数据集，包含超过700个多样化的3D场景和约22000条指令。我们的分析揭示了一个令人惊讶的局限性：最先进的3D大型语言模型（LLMs）难以可靠地判断一个指令是否具有歧义性。为应对这一挑战，我们提出了AmbiVer，这是一种两阶段框架，通过从多个视角收集明确的视觉证据，并利用这些证据来指导视觉-语言模型（VLM）判断指令的歧义性。广泛的实验表明了我们任务的挑战性以及AmbiVer的有效性，为更安全和更可信赖的具身AI铺平了道路。代码和数据集可在https://jiayuding031020.github.io/ambi3d/获取。

Summary / 总结

The research aims to address the critical safety gap in embodied AI by detecting linguistic ambiguity in 3D scenes, which can lead to severe errors. It introduces Ambi3D, a benchmark with over 700 scenes and 22,000 instructions, and proposes AmbiVer, a two-stage framework that uses visual evidence to guide a vision-language model in determining instruction ambiguity. Experiments show that state-of-the-art 3D LLMs struggle with this task, while AmbiVer significantly improves accuracy. This work paves the way for safer and more trustworthy embodied AI systems.

研究旨在通过检测3D场景中的语言歧义来弥补体态AI中的关键安全缺口，这可能导致严重错误。它引入了Ambi3D基准，包含超过700个场景和22,000条指令，并提出了一种名为AmbiVer的两阶段框架，该框架利用视觉证据引导视觉语言模型判断指令的歧义性。实验表明，最先进的3D大语言模型在这一任务上表现不佳，而AmbiVer显著提高了准确性。这项工作为更安全和更可信的体态AI系统铺平了道路。

Training-Free Semantic Multi-Object Tracking with Vision-Language Models

Authors: Laurence Bonat, Francesco Tonini, Elisa Ricci, Lorenzo Vaquero

First: 2026-04-15T16:44:57+00:00 · Latest: 2026-04-15T16:44:57+00:00

Comments: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)