arXiv 论文速递

2026-04-17 04:21
Snapshot: 20260417_0421
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Authors: Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang
First: 2026-04-15T17:59:52+00:00 · Latest: 2026-04-15T17:59:52+00:00
Abstract
Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5\% of the supervised fine-tuning data, yet boosts the accuracy from 42.9\% to 46.2\% on LVBench and enhances multiple other long video benchmarks.
中文标题/摘要
标题:每个高度选择性帧一个令牌:向长视频理解的极端压缩迈进
长视频理解对于视觉-语言模型(VLMs)来说固然是具有挑战性的,因为帧的数量非常庞大。由于每个视频帧通常会扩展成数十或数百个令牌,大型语言模型(LLMs)有限的上下文长度迫使VLMs稀疏地感知帧,从而丢失时间信息。为了解决这个问题,我们探索了在最终LLM层实现极端视频令牌压缩的方法,目标是每个帧一个令牌。我们的关键洞察是,先前方法广泛采用的基于启发式的压缩容易导致信息丢失,因此需要监督LLM层进入可学习和渐进的模块进行令牌级压缩(LP-Comp)。这种压缩使我们的VLM能够消化2-4倍更多的帧,同时提高性能。为了进一步提高令牌效率,我们研究了帧级压缩,通过LLM层的内部注意力分数选择与查询最相关的帧,这种方法称为问题条件压缩(QC-Comp)。与先前研究的一个显著区别是,我们通过将长视频分割成短片段并使用局部注意力来缓解LLM注意力在长上下文中的位置偏差,即序列开头和结尾的过度集中。综合而言,我们的结合了令牌级和帧级压缩的方法,为长视频理解带来了极端压缩模型,称为“”,实现了显著更大的压缩比,并允许更密集的帧采样。我们的“”是从VideoChat-Flash微调而来的,通过一个数据高效的监督压缩调优阶段,只需要2.5%的监督微调数据,就能将LVBench的准确性从42.9%提升到46.2%,并增强多个其他长视频基准。
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
Authors: Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo
First: 2026-04-15T17:50:07+00:00 · Latest: 2026-04-15T17:50:07+00:00
Comments: Project Page: https://tianshuoy.github.io/HiVLA-page/
Abstract
While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
中文标题/摘要
标题:HiVLA:一种视觉导向的分层体态操作系统
尽管端到端的视觉-语言-动作(VLA)模型为机器人操作提供了有希望的范式,但它们在狭窄控制数据上的微调往往削弱了从基础视觉-语言模型(VLM)继承的深刻推理能力。为了解决这一根本权衡,我们提出了一种视觉导向的分层框架HiVLA,该框架明确地将高层语义规划与低层运动控制解耦。在高层部分,VLM规划器首先进行任务分解和视觉接地,生成结构化计划,包括子任务指令和精确的目标边界框。然后,为了将此计划转化为物理动作,我们引入了低层部分的流匹配扩散变换器(DiT)动作专家,配备了新颖的级联交叉注意力机制。该设计按顺序融合全局上下文、高分辨率对象中心裁剪和技能语义,使DiT能够专注于稳健执行。我们的解耦架构保留了VLM的零样本推理能力,同时允许两个组件独立改进。在模拟和现实世界中的广泛实验表明,HiVLA在端到端基线中表现显著更优,特别是在长时序技能组合和杂乱场景中小物体的精细操作方面表现出色。
Summary / 总结
HiVLA is a hierarchical embodied manipulation system that decouples high-level semantic planning from low-level motor control, using a VLM planner for task decomposition and visual grounding, and a flow-matching Diffusion Transformer (DiT) for precise action execution. Experiments show that HiVLA outperforms end-to-end baselines, especially in long-horizon skill composition and fine-grained manipulation in cluttered scenes.
HiVLA 是一个分层的机器人操作系统,它将高层语义规划与低层运动控制分离,使用 VLM 计划器进行任务分解和视觉定位,以及使用流匹配扩散变换器 (DiT) 进行精确的动作执行。实验表明,HiVLA 在长时技能组合和杂乱场景中的精细操作方面优于端到端基线系统。
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Authors: Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, Chanyoung Park
First: 2026-04-14T06:48:31+00:00 · Latest: 2026-04-15T17:43:53+00:00
Comments: Preprint, Project : https://ptkjw1997.github.io/DSTP-page/
Abstract
Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.
中文标题/摘要
标题:为什么和何时视觉标记剪枝会失效?关于MLLMs解码中相关视觉信息转移的研究
最近,视觉标记剪枝被研究用于处理多模态大型语言模型中大量的视觉标记。然而,我们观察到,虽然现有的剪枝方法在简单的视觉理解任务上表现可靠,但在复杂的视觉推理任务上却难以有效泛化,这是一个在先前研究中被忽视的关键差距。通过系统的分析,我们确定解码过程中的相关视觉信息转移(RVIS)是主要的失败驱动因素。为了解决这一问题,我们提出了解码阶段转移感知标记剪枝(DSTP),这是一种无需训练的附加框架,使现有的剪枝方法能够在解码阶段与变化的推理需求对齐视觉标记。广泛的实验表明,DSTP显著减轻了剪枝方法在复杂推理任务中的性能下降,同时在视觉理解基准测试中也持续提高了性能。此外,DSTP在多种最先进的架构中都表现出有效性,突显了其通用性和低计算开销。
Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models
Authors: Mingen Li, Houjian Yu, Yixuan Huang, Youngjin Hong, Hantao Ye, Changhyun Choi
First: 2025-10-22T05:57:23+00:00 · Latest: 2026-04-15T17:36:28+00:00
Comments: 8 pages, 6 figures, 3 tables
Abstract
Long-horizon routing tasks of deformable linear objects (DLOs), such as cables and ropes, are common in industrial assembly lines and everyday life. These tasks are particularly challenging because they require robots to manipulate DLO with long-horizon planning and reliable skill execution. Successfully completing such tasks demands adapting to their nonlinear dynamics, decomposing abstract routing goals, and generating multi-step plans composed of multiple skills, all of which require accurate high-level reasoning during execution. In this paper, we propose a fully autonomous hierarchical framework for solving challenging DLO routing tasks. Given an implicit or explicit routing goal expressed in language, our framework leverages vision-language models~(VLMs) for in-context high-level reasoning to synthesize feasible plans, which are then executed by low-level skills trained via reinforcement learning. To improve robustness over long horizons, we further introduce a failure recovery mechanism that reorients the DLO into insertion-feasible states. Our approach generalizes to diverse scenes involving object attributes, spatial descriptions, implicit language commands, and \myred{extended 5-clip settings}. It achieves an overall success rate of 92\% across long-horizon routing scenarios. Please refer to our project page: https://icra2026-dloroute.github.io/DLORoute/
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
Authors: Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
First: 2026-04-15T17:32:28+00:00 · Latest: 2026-04-15T17:32:28+00:00
Comments: Project Page: https://zju-real.github.io/UI-Zoomer Code: https://github.com/ZJU-REAL/UI-Zoomer
Abstract
GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.
3D Instruction Ambiguity Detection
Authors: Jiayu Ding, Haoran Tang, Hongbo Jin, Wei Gao, Ge Li
First: 2026-01-09T18:17:11+00:00 · Latest: 2026-04-15T17:18:28+00:00
Abstract
In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.
中文标题/摘要
标题:3D指令歧义检测
在安全关键领域,语言歧义可能导致严重后果;手术环境中一个模糊的命令“递给我那个药瓶”可能会导致灾难性错误。然而,大多数具身AI研究忽略了这一点,假设指令是清晰的,专注于执行而不是确认。为解决这一关键安全缺口,我们首次定义了3D指令歧义检测这一基本的新任务,即模型必须确定在一个给定的3D场景中一个命令是否具有单一且明确的意义。为了支持这一研究,我们构建了Ambi3D,这是该任务的大规模基准数据集,包含超过700个多样化的3D场景和约22000条指令。我们的分析揭示了一个令人惊讶的局限性:最先进的3D大型语言模型(LLMs)难以可靠地判断一个指令是否具有歧义性。为应对这一挑战,我们提出了AmbiVer,这是一种两阶段框架,通过从多个视角收集明确的视觉证据,并利用这些证据来指导视觉-语言模型(VLM)判断指令的歧义性。广泛的实验表明了我们任务的挑战性以及AmbiVer的有效性,为更安全和更可信赖的具身AI铺平了道路。代码和数据集可在https://jiayuding031020.github.io/ambi3d/获取。
Summary / 总结
The research aims to address the critical safety gap in embodied AI by detecting linguistic ambiguity in 3D scenes, which can lead to severe errors. It introduces Ambi3D, a benchmark with over 700 scenes and 22,000 instructions, and proposes AmbiVer, a two-stage framework that uses visual evidence to guide a vision-language model in determining instruction ambiguity. Experiments show that state-of-the-art 3D LLMs struggle with this task, while AmbiVer significantly improves accuracy. This work paves the way for safer and more trustworthy embodied AI systems.
研究旨在通过检测3D场景中的语言歧义来弥补体态AI中的关键安全缺口,这可能导致严重错误。它引入了Ambi3D基准,包含超过700个场景和22,000条指令,并提出了一种名为AmbiVer的两阶段框架,该框架利用视觉证据引导视觉语言模型判断指令的歧义性。实验表明,最先进的3D大语言模型在这一任务上表现不佳,而AmbiVer显著提高了准确性。这项工作为更安全和更可信的体态AI系统铺平了道路。
Training-Free Semantic Multi-Object Tracking with Vision-Language Models
Authors: Laurence Bonat, Francesco Tonini, Elisa Ricci, Lorenzo Vaquero
First: 2026-04-15T16:44:57+00:00 · Latest: 2026-04-15T16:44:57+00:00
Comments: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)
Abstract
Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.
Summary / 总结
The research aims to develop a training-free approach for Semantic Multi-Object Tracking (SMOT) by leveraging pretrained models for detection, tracking, and video-language generation. The method combines D-FINE and SAM2 for tracking, uses InternVideo2.5 for generating video summaries and instance captions, and aligns interaction predicates with WordNet synsets for semantic retrieval. On the BenSMOT dataset, TF-SMOT achieves state-of-the-art tracking performance and improves summary and caption quality, though interaction recognition remains challenging due to the fine-grained and long-tailed label space.
研究旨在通过利用检测、跟踪和视频-语言生成的预训练模型来开发一种无需训练的方法,以实现语义多对象跟踪(SMOT)。提出的TF-SMOT管道使用D-FINE和SAM2进行跟踪,使用InternVideo2.5生成视频摘要和实例描述,并使用LLM进行语义检索和消歧。在BenSMOT数据集上,TF-SMOT在跟踪方面优于现有方法,并提高了摘要和描述的质量,尽管由于细粒度和长尾标签空间,交互识别仍然面临挑战。
Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
Authors: Athos Georgiou
First: 2026-03-30T15:17:41+00:00 · Latest: 2026-04-15T16:17:29+00:00
Comments: 18 pages, 2 figures, 7 tables, 1 algorithm. v2: lm_head alias via Qwen3.5 weight-tying cuts peak GPU memory 41% -> 48% (10.5 -> 9.2 GB); bitwise-identical outputs verified over 50+ greedy samples, 10 decodes at 1024 tokens, 50 mode-switch round-trips. Code: github.com/athrael-soju/hydra ; HF models under huggingface.co/athrael-soju
Abstract
Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design cuts peak GPU memory from 17.9 GB (two-model baseline) to 9.2 GB -- a 48% reduction, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.
Summary / 总结
Hydra unifies document retrieval and generation in a single vision-language model, reducing memory and system complexity. It uses a dual-head approach with a single LoRA adapter that can be toggled to enable retrieval or generation. The model performs retrieval using multi-vector embeddings and generation using the base model. On four VQA benchmarks, Hydra's generation quality is byte-identical to the base model, with minor differences in ANLS scores. The single-model design reduces peak GPU memory by 48% compared to a two-model baseline, though adapter switching introduces some throughput overhead under concurrent serving loads.
Hydra将文档检索和生成统一在一个视觉语言模型中,减少内存和系统复杂性。它采用双头方法,使用单个LoRA适配器在检索和生成之间切换。模型使用多向量嵌入进行检索,使用基础模型进行生成。在四个VQA基准上,Hydra的生成质量与基础模型完全一致,ANLS得分略有差异。单模型设计将峰值GPU内存减少了48%,与双模型基线相比,尽管适配器切换在并发服务负载下引入了一些吞吐量开销。
MAny: Merge Anything for Multimodal Continual Instruction Tuning
Authors: Zijian Gao, Wangwang Jia, Xingxing Zhang, Pengfei Qian, Tao Sun, Bo Ding, Yong Dou, Huaimin Wang, Kele Xu
First: 2026-04-15T15:57:23+00:00 · Latest: 2026-04-15T15:57:23+00:00
Abstract
Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57\% and 2.85\% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.
中文标题/摘要
标题:MAny: 合并一切以实现多模态连续指令调优
多模态连续指令调优(MCIT)对于多模态大型语言模型(MLLMs)的顺序任务适应至关重要,但严重受限于灾难性遗忘。尽管现有文献主要关注推理语言骨干,本研究揭示了感知漂移在跨模态投影空间和推理崩溃在低秩参数空间中的双重遗忘现象。为解决这一问题,我们提出了**MAny**(**M**erge **Any**thing),一种通过**C**ross-modal **P**rojection **M**erging (CPM) 和**L**ow-rank **P**arameter **M**erging (LPM) 合并任务特定知识的框架。具体而言,CPM 通过视觉原型指导自适应地合并跨模态视觉表示,确保推理期间的特征恢复准确。同时,LPM 通过递归合并低秩权重矩阵消除任务特定低秩模块之间的相互干扰。利用递归最小二乘法,LPM 提供了一个闭式解,从数学上保证了推理稳定性的最佳融合轨迹。值得注意的是,MAny 作为一种无需训练的范式,通过高效的基于 CPU 的代数运算实现知识合并,消除了初始调优之外的额外梯度优化。我们广泛的研究证实了 MAny 在多个 MLLMs 和基准测试中的优越性能和鲁棒性。具体而言,在 UCIT 基准测试中,MAny 在两种不同 MLLMs 上分别实现了高达 8.57% 和 2.85% 的最终平均准确率领先于最先进的方法。
Summary / 总结
The research addresses the issue of catastrophic forgetting in Multimodal Continual Instruction Tuning (MCIT) for Multimodal Large Language Models (MLLMs). It introduces MAny, a framework that merges task-specific knowledge through Cross-modal Projection Merging (CPM) and Low-rank Parameter Merging (LPM). CPM recovers perceptual alignment by merging cross-modal visual representations, while LPM eliminates mutual interference among task-specific low-rank modules. MAny achieves superior performance and robustness, leading to up to 8.57% and 2.85% improvements in final average accuracy on the UCIT benchmark compared to state-of-the-art methods.
研究解决了多模态连续指令调优(MCIT)中多模态大型语言模型(MLLMs)的灾难性遗忘问题。提出了一种MAny框架,通过跨模态投影合并(CPM)和低秩参数合并(LPM)来合并任务特定的知识。CPM通过合并跨模态的视觉表示来恢复感知对齐,而LPM通过递归合并低秩权重矩阵来消除任务特定模块之间的相互干扰。MAny实现了优越的性能和鲁棒性,在UCIT基准上分别比最先进的方法提高了高达8.57%和2.85%的最终平均准确率。
Reward Design for Physical Reasoning in Vision-Language Models
Authors: Derek Lilienthal, Manisha Mukherjee, Sameera Horawalavithana
First: 2026-04-15T15:36:26+00:00 · Latest: 2026-04-15T15:36:26+00:00
Abstract
Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.
Summary / 总结
The study aims to understand how reward design influences physical reasoning in Vision-Language Models (VLMs). It conducts a systematic ablation study using Group Relative Policy Optimization (GRPO) on a 3,000-problem benchmark called PhyX. Four reward signals of increasing semantic richness are compared: format compliance, answer accuracy, a composite rubric reward, and a novel internal reward from model attention weights. Across both formats, accuracy-based rewards outperform Supervised Fine-Tuning (SFT) on most domains, but the effectiveness varies by reward type and domain. Rubric rewards enhance structured reasoning quality, attention-based rewards improve spatial reasoning but degrade symbolic domains, and the internal attention-weight reward significantly improves spatial relation accuracy without requiring spatial annotations.
研究旨在理解奖励设计如何影响Vision-Language模型(VLM)的物理推理能力。通过使用Group Relative Policy Optimization (GRPO)在包含3000个问题的PhyX基准上进行系统性的消融研究,比较了四种不同语义丰富度的奖励信号:格式合规性、答案准确性、复合评分奖励以及从模型对输入图像区域的注意力权重中提取的内部奖励。在两种格式下,基于准确性的奖励在大多数领域优于监督微调(SFT),但效果因奖励类型和领域而异。评分奖励提高了结构化推理的质量,注意力奖励增强了空间推理但降低了符号领域,而内部注意力权重奖励在无需空间标注的情况下显著提高了空间关系的准确性。
MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images
Authors: Felicia Bader, Philipp Seeböck, Anastasia Bartashova, Ulrike Attenberger, Georg Langs
First: 2026-04-15T15:19:54+00:00 · Latest: 2026-04-15T15:19:54+00:00
Comments: Accepted for MIDL 2026; Reviews available at https://openreview.net/forum?id=M8OO3CRbL9#discussion
Abstract
In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose "MApLe", a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks. The code is available at https://github.com/cirmuw/MApLe.
SiLVR: A Simple Language-based Video Reasoning Framework
Authors: Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius
First: 2025-05-30T17:59:19+00:00 · Latest: 2026-04-15T15:09:43+00:00
Comments: Accepted by TMLR (01/2026)
Abstract
Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SILVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SILVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an Adaptive Context Reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. More details can be found at https://sites.google.com/cs.unc.edu/silvr.
Summary / 总结
SiLVR is a framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) for complex video-language tasks. It decomposes video understanding into two stages: transforming raw video into language-based representations and using a reasoning LLM to solve complex tasks. SiLVR employs an Adaptive Context Reduction scheme to handle long-context inputs. The framework achieves state-of-the-art results on several benchmarks and demonstrates that reasoning LLMs can effectively integrate multisensory information for complex reasoning tasks in video without explicit video training.
SiLVR 是一种框架,旨在增强多模态大型语言模型在复杂视频-语言任务中的推理能力。它将视频理解分解为两个阶段:将原始视频转换为语言表示,并使用推理 LLM 解决任务。SiLVR 使用自适应上下文缩减方案来处理长上下文输入。它在多个基准测试中取得了最先进的成果,并展示了推理 LLM 能够在没有显式视频训练的情况下有效地聚合来自视频、语音和音频的多模态信息,以完成复杂的视频推理任务。
GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis
Authors: Bo Yu, Cheng Yang, Dongyang Hou, Chengfu Liu, Jiayao Liu, Chi Wang, Zhiming Zhang, Haifeng Li, Wentao Yang
First: 2026-04-15T13:55:34+00:00 · Latest: 2026-04-15T13:55:34+00:00
Comments: 20 pages, 3 figures, 6 tables
Abstract
The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a "Last-Attempt Alignment" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.
中文标题/摘要
标题:GeoAgentBench:空间分析中工具增强代理的动态执行基准
将大型语言模型(LLMs)集成到地理信息系统(GIS)中标志着空间分析自主化的范式转变。然而,由于地理空间工作流的复杂性和多步骤性,评估这些基于LLM的代理仍然具有挑战性。现有的基准主要依赖于静态文本或代码匹配,忽视了动态运行时反馈和空间输出的多模态性。为了解决这一差距,我们引入了GeoAgentBench(GABench),这是一种针对工具增强GIS代理的动态和交互式评估基准。GABench提供了一个包含117个原子GIS工具的现实执行沙箱,涵盖了6个核心GIS领域中的53个典型空间分析任务。认识到精确的参数配置是动态GIS环境中执行成功的主要决定因素,我们设计了参数执行准确性(PEA)度量,该度量利用“最后尝试对齐”策略来量化隐式参数推断的准确性。此外,我们提出了基于视觉-语言模型(VLM)的验证来评估数据-空间准确性和制图风格的符合性。为进一步解决由于参数对齐错误和运行时异常导致的频繁任务失败,我们开发了一种新的代理架构——计划-反应架构,该架构通过将全局协调与逐步反应执行解耦来模拟专家认知工作流程。通过与七种代表性LLM的广泛实验表明,计划-反应范式显著优于传统框架,实现了逻辑严谨性和执行鲁棒性的最佳平衡,特别是在多步推理和错误恢复方面。我们的研究结果突显了当前的能力边界,并为评估和推进下一代自主GeoAI建立了坚实的标准。
Summary / 总结
GeoAgentBench (GABench) is introduced to evaluate LLM-based agents in GIS by providing a dynamic and interactive sandbox for 117 atomic GIS tools. It uses the Parameter Execution Accuracy (PEA) metric and a Vision-Language Model (VLM) for verification, and proposes a Plan-and-React agent architecture to handle task failures. Experiments with seven LLMs show that the Plan-and-React paradigm outperforms traditional frameworks in multi-step reasoning and error recovery, highlighting current capability boundaries in autonomous GeoAI.
GeoAgentBench (GABench) 提供了一个动态且互动的沙盒环境,整合了117个原子GIS工具,用于评估基于LLM的GIS代理。它使用参数执行准确性(PEA)度量和视觉语言模型(VLM)进行验证,并提出了一种计划和反应的代理架构来处理任务失败。实验表明,计划和反应的架构在多步推理和错误恢复方面优于传统框架,突显了当前自主GeoAI的能力边界。
Democratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
Authors: Sander Moonemans, Sebastiaan Ram, Frédérique Meeuwsen, Carlijn Lems, Jeroen van der Laak, Geert Litjens, Francesco Ciompi
First: 2025-12-19T08:14:58+00:00 · Latest: 2026-04-15T13:29:02+00:00
Comments: 12 pages, 4 figures
Abstract
Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-α, a VLM capable of visual-question answering (VQA). We show that ANTONI-α outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-α trained with different amounts of data. All methods, data, and code are publicly available.
Summary / 总结
This paper addresses the limitations of current vision-language models (VLMs) for pathology by introducing Polysome, a tool for generating synthetic instructions, and creating HISTAI-Instruct, a large dataset for whole-slide instruction tuning. Using this dataset, the authors trained ANTONI-α, a VLM that outperforms MedGemma on WSI-level VQA tasks such as tissue identification, neoplasm detection, and differential diagnosis. All methods, data, and code are publicly available.
本文通过引入用于生成合成指令的Polysome工具和创建包含大量全切片指令-响应对的HISTAI-Instruct数据集,解决了当前用于病理学的视觉-语言模型(VLMs)的局限性。使用该数据集,作者训练了ANTONI-α,一种在组织识别、肿瘤检测和鉴别诊断等WSI级别VQA任务上优于MedGemma的VLM。所有方法、数据和代码均公开可用。
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Authors: Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh, Yova Kementchedjhieva, Yue Dong
First: 2026-04-02T19:40:56+00:00 · Latest: 2026-04-15T13:13:04+00:00
Abstract
Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they often fail on tasks that require fine-grained visual perception, even when the required information is still present in their internal representations. Prior work has attributed this ``hidden-in-plain-sight'' gap to the language model, but the cause remains unexplained. In this work, we demonstrate that this gap arises from the language model's lack of semantic labels for fine-grained visual details: when visual entities can be mapped to known concepts, VLMs bypass visual comparison and reason through language; when they cannot, VLMs resort to brittle and hallucinated descriptions. We verify this across semantic correspondence, synthetic shape matching, and face matching, and find that VLMs perform much better when the relevant entities are nameable than when they are unnamable. Mechanistically, Logit Lens analysis confirms that VLMs explicitly recover semantic labels for nameable entities and surface more unique tokens compared to unnameable entities. Furthermore, we show that this limitation can be addressed: teaching completely arbitrary names for unknown entities improves performance. More importantly, task-specific finetuning yields even stronger generalization without relying on language priors, i.e. through real visual perception. Our findings suggest that current VLM failures on visual tasks reflect a learned shortcut rather than a fundamental limitation of multimodal reasoning.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Authors: Arya Shah, Vaibhav Tripathi, Mayank Singh, Chaklam Silpasuwanchai
First: 2026-04-15T12:38:51+00:00 · Latest: 2026-04-15T12:38:51+00:00
Comments: 28 pages, 9 figures, 13 tables
Abstract
Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40$\times$ parameter range (256M--10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region-of-interest analysis reveals that alignment specifically in early visual cortex (V1--V3) is a reliable negative predictor of sycophancy ($r = -0.441$, BCa 95\% CI $[-0.740, -0.031]$), with all 12 leave-one-out correlations negative and the strongest effect for existence denial attacks ($r = -0.597$, $p = 0.040$). This anatomically specific relationship is absent in higher-order category-selective regions, suggesting that faithful low-level visual encoding provides a measurable anchor against adversarial linguistic override in vision-language models. We release our code on \href{https://github.com/aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation}{GitHub} and dataset on \href{https://huggingface.co/datasets/aryashah00/Gaslight-Gatekeep-V1-V3}{Hugging Face}
中文标题/摘要
标题:Gaslight, Gatekeep, V1-V3:早期视觉皮层对齐保护视觉语言模型免受奉承操控
视觉语言模型在高风险环境中越来越被部署,但它们对奉承操控的易感性仍然知之甚少,尤其是在这些模型如何内部表示视觉信息方面。模型的视觉表示与其更接近人类神经处理的模型是否也更抗御对抗性压力,这是一个开放问题,对神经科学和人工智能安全都有影响。我们通过评估12个开放权重的视觉语言模型,涵盖6个架构家族和40倍参数范围(256M-10B),从自然场景数据集中8个人类受试者和6个感兴趣区域的fMRI响应预测,以及通过76,800个两轮Gaslighting提示(涵盖5个类别和10个难度级别)来衡量大脑对齐和奉承性,来研究这个问题。区域分析表明,特定于早期视觉皮层(V1-V3)的对齐是奉承性的可靠负预测因子(r = -0.441,BCa 95% CI [-0.740, -0.031]),所有12个模型的留一交叉验证相关性均为负,效果最强的是存在否认攻击(r = -0.597,p = 0.040)。这种解剖学特异性关系在高级别类别选择区域中不存在,表明忠实的低级视觉编码为视觉语言模型提供了对抗性语言覆盖的可测量锚点。我们在GitHub(https://github.com/aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation)和Hugging Face(https://huggingface.co/datasets/aryashah00/Gaslight-Gatekeep-V1-V3)上发布了我们的代码和数据集。
Summary / 总结
This study investigates whether vision-language models that better align with human brain activity in early visual cortex are more resistant to sycophantic manipulation. By evaluating 12 models across various architectures and parameter sizes, the research finds that alignment in early visual cortex (V1-V3) is negatively correlated with sycophancy, with the strongest effect observed in existence denial attacks. This suggests that accurate low-level visual encoding can serve as a defense against adversarial linguistic manipulation in vision-language models.
研究探讨了哪些视觉语言模型在早期视觉皮层(V1-V3)与人类大脑活动更一致,是否更能抵抗奉承操纵。通过评估12种不同架构和参数规模的模型,研究发现,早期视觉皮层的对齐与奉承行为呈负相关,特别是在存在否认攻击中效果最明显。这表明准确的低级视觉编码可以作为对抗敌对语言操纵的防御措施。
Failure Identification in Imitation Learning Via Statistical and Semantic Filtering
Authors: Quentin Rolland, Fabrice Mayran de Chamisso, Jean-Baptiste Mouret
Venue: ICRA 2026
First: 2026-04-15T12:27:32+00:00 · Latest: 2026-04-15T12:27:32+00:00
Comments: 8 pages, Appendix coming soon, accepted at ICRA 2026
Abstract
Imitation learning (IL) policies in robotics deliver strong performance in controlled settings but remain brittle in real-world deployments: rare events such as hardware faults, defective parts, unexpected human actions, or any state that lies outside the training distribution can lead to failed executions. Vision-based Anomaly Detection (AD) methods emerged as an appropriate solution to detect these anomalous failure states but do not distinguish failures from benign deviations. We introduce FIDeL (Failure Identification in Demonstration Learning), a policy-independent failure detection module. Leveraging recent AD methods, FIDeL builds a compact representation of nominal demonstrations and aligns incoming observations via optimal transport matching to produce anomaly scores and heatmaps. Spatio-temporal thresholds are derived with an extension of conformal prediction, and a Vision-Language Model (VLM) performs semantic filtering to discriminate benign anomalies from genuine failures. We also introduce BotFails, a multimodal dataset of real-world tasks for failure detection in robotics. FIDeL consistently outperforms state-of-the-art baselines, yielding +5.30% percent AUROC in anomaly detection and +17.38% percent failure-detection accuracy on BotFails compared to existing methods.
Summary / 总结
The paper addresses the issue of rare events leading to failed executions in imitation learning policies for robotics. It introduces FIDeL, a failure detection module that uses statistical and semantic filtering to distinguish between benign anomalies and genuine failures. FIDeL builds a compact representation of nominal demonstrations and uses optimal transport matching to produce anomaly scores and heatmaps. It also employs conformal prediction and a Vision-Language Model for semantic filtering. Experiments on the BotFails dataset show that FIDeL outperforms existing methods, achieving a 5.30% improvement in AUROC and a 17.38% improvement in failure-detection accuracy.
论文针对机器人中模仿学习策略在罕见事件下导致执行失败的问题,提出了FIDeL,一种使用统计和语义过滤来区分良性异常和真实故障的故障检测模块。FIDeL构建了名义演示的紧凑表示,并使用最优运输匹配生成异常分数和热图。它还使用符合预测和视觉语言模型进行语义过滤。在BotFails数据集上的实验表明,FIDeL优于现有方法,实现了5.30%的AUROC改进和17.38%的故障检测准确率改进。
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Authors: Yihong Dong, Zhaoyu Ma, Xue Jiang, Zhiyuan Fan, Jiaru Qian, Yongmin Li, Jianha Xiao, Zhi Jin, Rongyu Cao, Binhua Li, Fei Huang, Yongbin Li, Ge Li
Venue: ACL 2026
First: 2025-10-20T23:38:12+00:00 · Latest: 2026-04-15T11:38:13+00:00
Comments: Accepted to ACL 2026 (main)
Abstract
Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, the performance of DLMs on code generation tasks, which have stronger structural constraints, is significantly hampered by the critical trade-off between inference speed and output quality. We observed that accelerating the code generation process by reducing the number of sampling steps usually leads to a catastrophic collapse in performance. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs to achieve better inference speed and output quality in code generation. Specifically, Saber is motivated by two key insights in the DLM generation process: 1) it can be adaptively accelerated as more of the code context is established; 2) it requires a backtracking mechanism to reverse the generated tokens. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average improvement of 1.9% over mainstream DLM sampling methods, meanwhile achieving an average 251.4% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.
中文标题/摘要
标题:Saber:一种高效的自适应加速和回溯增强重掩码采样算法以提高扩散语言模型的代码生成性能
扩散语言模型(DLMs)正在成为一种强大的替代自回归范式的有力选择,提供了并行生成和双向上下文建模的内在优势。然而,DLMs在具有更强结构约束的代码生成任务上的性能受到推理速度和输出质量之间关键权衡的严重限制。我们观察到,通过减少采样步骤来加速代码生成过程通常会导致性能灾难性下降。在本文中,我们提出了一种新的无需训练的采样算法——自适应加速和回溯增强重掩码(即Saber),以实现DLMs在代码生成中的更好推理速度和输出质量。具体而言,Saber受到DLM生成过程中的两个关键洞察的启发:1)随着更多代码上下文的建立,它可以自适应加速;2)它需要一个回溯机制来撤销生成的标记。在多个主流代码生成基准上的广泛实验表明,与主流DLM采样方法相比,Saber在Pass@1准确率上平均提高了1.9%,同时实现了平均251.4%的推理速度提升。通过利用DLMs的内在优势,我们的工作显著缩小了与自回归模型在代码生成中的性能差距。
Summary / 总结
Saber is a novel training-free sampling algorithm for diffusion language models (DLMs) designed to improve inference speed and output quality in code generation tasks. Motivated by the ability of DLMs to adaptively accelerate and the need for a backtracking mechanism, Saber enhances sampling efficiency without compromising performance. Experiments on multiple code generation benchmarks demonstrate that Saber improves Pass@1 accuracy by 1.9% and achieves a 251.4% inference speedup compared to mainstream DLM sampling methods.
Saber 是一种针对扩散语言模型(DLMs)的新型无训练采样算法,旨在提高代码生成任务中的推理速度和输出质量。受 DLMs 能够自适应加速和需要回退机制的启发,Saber 提高了采样效率而不牺牲性能。在多个代码生成基准测试上的实验表明,Saber 将 Pass@1 准确性提高了 1.9%,同时实现了比主流 DLM 采样方法快 251.4% 的推理速度。
Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
Authors: Xuwen Zhou, Fangxin Liu, Chao Wang, Xiao Zheng, Hao Zheng, Min He, Li Jiang, Haibing Guan
Venue: ACL 2026
First: 2026-04-15T09:01:54+00:00 · Latest: 2026-04-15T09:01:54+00:00
Comments: ACL 2026 Main Conference
Abstract
Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of "Frequency-Guided Candidate Selection and Probability-Guarded Acceptance," CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.
中文标题/摘要
标题:校准推测性解码:基于频率的候选选择以实现高效推理
推测性解码通过让草稿令牌绕过全面验证来加速自回归生成,但传统框架经常遭受频繁的误拒,尤其是在草稿模型产生语义正确但词汇上偏离的输出时。本文中,我们提出了校准推测性解码(CSD),这是一种无需训练的框架,能够恢复标准验证中丢弃的有效令牌。CSD 通过“基于频率的候选选择和概率保护的接受”原则,结合了两个轻量级模块:在线纠正记忆,它汇总历史拒接以提出反复出现的偏离模式作为救援候选;以及语义一致性门控,它使用概率比而不是精确的令牌匹配来验证候选的可接受性。我们在多种大型语言模型上的评估表明,CSD 在所有任务中均优于现有方法,峰值吞吐量加速达到 2.33 倍。CSD 保持了模型在所有任务中的准确性,同时在复杂推理数据集上进一步提升了性能。这些结果确立了 CSD 作为实用的大规模语言模型部署中高效、轻量级的解决方案。
Summary / 总结
Calibrated Speculative Decoding (CSD) addresses the issue of false rejections in speculative decoding by incorporating two lightweight modules: Online Correction Memory and Semantic Consistency Gating. These modules help recover valid tokens and verify candidate admissibility, respectively. CSD improves throughput by up to 2.33x and maintains model accuracy, particularly on complex reasoning tasks, making it a highly effective solution for large language model deployments.
Calibrated Speculative Decoding (CSD) 通过引入在线修正记忆模块和语义一致性筛选模块,解决了传统推测解码中频繁误拒的问题。这些模块有助于恢复有效令牌并验证候选令牌的可接受性。CSD 可将吞吐量提高多达 2.33 倍,同时保持模型准确性,特别是在复杂推理任务上表现更佳,使其成为大型语言模型部署的有效解决方案。
SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
Authors: Qi Xia, Peishan Cong, Ziyi Wang, Yujing Sun, Qin Sun, Xinge Zhu, Mao Ye, Ruigang Yang, Yuexin Ma
First: 2026-04-15T07:41:52+00:00 · Latest: 2026-04-15T07:41:52+00:00
Abstract
Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
Authors: Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Bin Li
First: 2026-02-02T07:20:02+00:00 · Latest: 2026-04-15T07:37:32+00:00
Abstract
While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30\%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.
Summary / 总结
This study addresses the limitations of specialized detectors for AI-generated images (AIGI) in real-world scenarios by demonstrating that a simple linear classifier trained on the frozen features of modern Vision Foundation Models achieves superior performance. The classifier outperforms specialized detectors on both traditional benchmarks and in-the-wild datasets, with accuracy improvements of over 30%. The authors attribute this success to the massive scale of pre-training data containing synthetic content, which enables the models to internalize semantic concepts of forgery and acquire discriminative forensic features. However, the models still face limitations in handling recaptured and transmitted images and localized editing. The research suggests a shift towards leveraging the evolving knowledge of foundation models for more reliable AI forensics in real-world applications.
本研究通过展示一个简单的线性分类器在现代视觉基础模型的冻结特征上进行训练,能够在现实世界场景中超越专门的AI生成图像(AIGI)检测器,实现了更优的性能。该分类器在传统基准测试和野外数据集上均表现出色,准确率提高了超过30%。作者认为这种成功得益于大量包含合成内容的预训练数据,使模型能够内化伪造的语义概念并获得区分性的法医特征。然而,模型在处理重新捕获和传输的图像以及局部编辑方面仍存在局限性。研究建议转向利用基础模型不断发展的知识来进行更可靠的AI取证,以适应现实世界的应用需求。
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
Authors: Yunkai Dang, Minxin Dai, Yuekun Yang, Zhangnan Li, Wenbin Li, Feng Miao, Yang Gao
First: 2026-04-15T07:21:37+00:00 · Latest: 2026-04-15T07:21:37+00:00
Abstract
Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.
中文标题/摘要
标题:UHR-BAT:预算感知的视觉语言模型,用于超高清遥感
超高清(UHR)遥感图像结合了千米级的上下文和可能仅占据几个像素的关键查询证据。这种巨大的空间尺度导致视觉标记数量呈平方级爆炸,阻碍了对小目标信息的提取。以往的工作利用直接下采样、密集平铺或全局top-k剪枝,要么牺牲关键查询的图像细节,要么导致不可预测的计算开销。在本文中,我们提出了一种UHR-BAT,这是一种在严格上下文预算下高效选择视觉标记的查询引导和区域忠实标记压缩框架。具体而言,我们利用文本引导的多尺度重要性估计视觉标记,有效地解决了精确而低成本特征提取的挑战。此外,通过引入区域级保留和合并策略,我们减少了视觉标记的冗余,进一步降低了计算预算。实验结果表明,UHR-BAT在各种基准测试中达到了最先进的性能。代码将在https://github.com/Yunkaidang/UHR/上提供。
CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
Authors: Shivika, Kartik Bose, Pankaj Gupta
First: 2026-04-15T07:10:01+00:00 · Latest: 2026-04-15T07:10:01+00:00
Abstract
Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin's alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.
中文标题/摘要
标题:CLIP架构在腹部CT图像-文本对齐及零样本学习中的应用:探究批次组成和数据规模的影响
使用对比学习在配对医学图像和报告上训练的视觉-语言模型展示了强大的零样本诊断能力,但3D医学成像中训练批次组成对学习表示的影响尚未被探索。我们重现了Merlin,这是一种双编码器模型,使用对称的InfoNCE损失将3D腹部CT体积与放射学报告对齐,实现了70个发现中的74.45%的零样本宏F1(原始值为73.00%)。然后我们研究了两个变量轴。首先,我们通过在完整数据集上进行节段级平衡采样,将训练批次中的正常与异常比例控制在25:75、50:50和75:25。所有三种配置均比不平衡基线低2.4到2.8个百分点,其中75:25在平衡变体中表现最佳(72.02%)。其次,我们在4,362个研究子集上进行了数据规模消融实验,分别使用20%、40%和100%的数据进行训练。性能从65.26%线性增长到71.88%,不同发现的数据敏感性差异显著。在相同子集上强制执行50:50的平衡采样进一步降低了性能至68.01%,证实了明确的类别平衡无论是在哪个数据集或平衡粒度下都会损害性能。我们的结果表明,随机采样的随机多样性与Merlin在解剖子部分上的交替批次相结合,提供了比工程化的类别比例在小批次大小下更有效的正则化。
Training-Free Test-Time Contrastive Learning for Large Language Models
Authors: Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu
Venue: ACL 2026
First: 2026-04-15T06:56:35+00:00 · Latest: 2026-04-15T06:56:35+00:00
Comments: Accepted by Findings ACL 2026
Abstract
Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic "Explore-Reflect-Steer" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.
Summary / 总结
The paper proposes TF-TTCL, a training-free test-time adaptation method for large language models (LLMs) that improves performance under distribution shift without requiring gradient-based updates. It uses a dynamic 'Explore-Reflect-Steer' loop with three modules: Semantic Query Augmentation, Contrastive Experience Distillation, and Contextual Rule Retrieval. Experiments show TF-TTCL outperforms zero-shot baselines and other TTA methods on both closed-ended and open-ended tasks.
论文提出了一种无需训练的测试时自适应方法TF-TTCL,用于在分布变化下提升大型语言模型(LLM)的表现,而不需要梯度更新。该方法使用一个动态的‘探索-反思-引导’循环,包含三个模块:语义查询增强、对比经验蒸馏和上下文规则检索。实验表明,TF-TTCL在封闭式和开放式任务上均优于零样本基线和其他自适应方法。
Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
Authors: Yibo Jiang, Tao Wu, Rui Jiang, Yehao Lu, Chaoxiang Cai, Zequn Qin, Xi Li
First: 2026-04-15T06:41:56+00:00 · Latest: 2026-04-15T06:41:56+00:00
Abstract
Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch'' hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.
Summary / 总结
The paper addresses the capability mismatch in Unified Multimodal Models (UMMs), where understanding is strong but generation is weak. It proposes UniRect-CoT, a training-free framework inspired by human 'Thinking-While-Drawing' to continuously reflect and activate internal knowledge during generation. Experiments show that UniRect-CoT enhances generation quality across various complex tasks when integrated into existing UMMs.
论文针对统一多模态模型(UMMs)中理解能力强而生成能力弱的问题,提出了一种无需训练的框架UniRect-CoT,灵感来源于人类的‘边画边思考’过程,以连续反思和激活内部知识来提升生成能力。实验表明,将UniRect-CoT集成到现有的UMMs中,可以显著提高多种复杂任务的生成质量。
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
Authors: Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui
Venue: ACL 2026
First: 2026-01-07T23:05:17+00:00 · Latest: 2026-04-15T06:39:57+00:00
Comments: Accepted to Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Abstract
Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.
中文标题/摘要
标题:通过门控感知推理优化解决大型视觉-语言模型中的过度思考
大型视觉-语言模型(LVLMs)通过生成逐步推理机制展示了强大的推理能力。然而,这种缓慢思考的方法往往会导致过度思考,模型会对简单查询产生冗长的响应,导致测试时的低效率甚至降低了准确性。先前的工作试图通过自适应推理策略来缓解这一问题,但这些方法大多忽视了一个根本瓶颈:视觉感知失败。我们认为,稳定的推理依赖于低级视觉定位,推理错误通常源自不完美的感知而非不足的思考。为了解决这一限制,我们提出了门控感知推理优化(GPRO),这是一种元推理控制器,在每次生成步骤中动态地在三条决策路径之间分配计算:一条轻量级的快速路径,一条缓慢的感知路径用于重新检查视觉输入,以及一条缓慢的推理路径用于内部自我反思。为了学习这种区分,我们从大约79万样本中推导出大规模的失败归因监督,使用教师模型区分感知幻觉和推理错误。然后,我们使用多目标强化学习训练控制器,在不确定性下优化任务准确性和计算成本之间的权衡。在五个基准上的实验表明,GPRO在准确性和效率上都有显著提升,优于最近的缓慢思考方法,同时生成的响应也显著更短。
Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization
Authors: Jianzong Wang, Botao Zhao, Yayun He, Junqing Peng, Xulong Zhang
First: 2026-04-15T06:29:02+00:00 · Latest: 2026-04-15T06:29:02+00:00
Comments: This work has been accepted for publication in the Proceedings of the 2026 International Joint Conference on Neural Networks (IJCNN 2026)
Abstract
Achieving general-purpose robotics requires empowering robots to adapt and evolve based on their environment and feedback. Traditional methods face limitations such as extensive training requirements, difficulties in cross-task generalization, and lack of interpretability. Prompt learning offers new opportunities for self-evolving robots without extensive training, but simply reflecting on past experiences.However, extracting meaningful insights from task successes and failures remains a challenge. To this end, we propose the evolvable embodied agent (EEAgent) framework, which leverages large vision-language models (VLMs) for better environmental interpretation and policy planning. To enhance reflection on past experiences, we propose a long short-term reflective optimization (LSTRO) mechanism that dynamically refines prompts based on both past experiences and newly learned lessons, facilitating continuous self-evolution, thereby enhancing overall task success rates. Evaluations on six VIMA-Bench tasks reveal that our approach sets a new state-of-the-art, notably outperforming baselines in complex scenarios.
中文标题/摘要
标题:基于长期短期反思与优化的可进化实体代理在机器人操作中的应用
实现通用机器人需要赋予机器人根据环境和反馈进行适应和进化的功能。传统方法面临训练需求广泛、跨任务泛化困难和缺乏可解释性等局限。提示学习为无需大量训练的自我进化的机器人提供了新机会,但仅通过反思过往经验仍存在挑战。为此,我们提出了可进化实体代理(EEAgent)框架,利用大型视觉语言模型(VLMs)进行更好的环境解释和策略规划。为了增强对过往经验的反思,我们提出了一种长期短期反思优化(LSTRO)机制,该机制根据过往经验和新学的教训动态调整提示,促进持续自我进化,从而提高整体任务成功率。在六个VIMA-Bench任务上的评估表明,我们的方法达到了新的最佳水平,特别是在复杂场景中显著优于基线。
Summary / 总结
The research aims to develop a self-evolving robotic manipulation system that can adapt to different environments and tasks. The proposed evolvable embodied agent (EEAgent) framework uses large vision-language models for better environmental understanding and policy planning. It introduces a long short-term reflective optimization (LSTRO) mechanism to dynamically refine prompts based on past experiences and new lessons, enabling continuous self-improvement. Experiments on six VIMA-Bench tasks demonstrate that this approach outperforms existing methods, particularly in complex scenarios.
研究旨在开发一种能够适应不同环境和任务的自我进化的机器人操作系统。提出的可进化体态代理(EEAgent)框架利用大型视觉语言模型进行更好的环境理解和策略规划。引入了一种长期短期反思优化(LSTRO)机制,根据过去的经验和新学到的教训动态调整提示,实现持续自我改进。在六个VIMA-Bench任务上的实验表明,该方法在复杂场景中优于现有方法。
Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
Authors: Qishun Yang, Shu Yang, Lijie Hu, Di Wang
First: 2026-03-09T15:20:53+00:00 · Latest: 2026-04-15T03:21:48+00:00
Abstract
Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.
中文标题/摘要
标题:视觉自我实现对齐:通过威胁相关图像塑造安全导向的人格
多模态大型语言模型(MLLMs)面临安全对齐问题,其中视觉输入可能导致有害输出。为解决这一问题,现有方法需要明确的安全标签或对比数据;然而,威胁相关概念具体且可视觉呈现,而诸如乐于助人等安全概念则抽象且缺乏视觉参照。受潜在的自我实现机制启发,我们提出视觉自我实现对齐(VSFA)。VSFA 在威胁相关图像构建的中性视觉问答任务上微调视觉语言模型(VLMs),无需任何安全标签。通过反复接触威胁相关视觉内容,模型内化了警惕和谨慎的隐含语义,塑造了安全导向的人格。跨多个VLMs和安全基准的实验表明,VSFA 降低了攻击成功率,提高了响应质量,减轻了过度拒绝现象,同时保留了通用能力。我们的工作将自我实现机制从文本扩展到视觉模态,为视觉语言模型对齐提供了一种无标签的方法。
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
Authors: Simin Huo, Ning Li
Venue: CVPR 2026
First: 2026-04-15T03:06:24+00:00 · Latest: 2026-04-15T03:06:24+00:00
Comments: 20 pages. Extended version of CVPR 2026 Findings paper. Neurocomputing (Elsevier) under review
Abstract
Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe's and MaRe's effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.
中文标题/摘要
标题:MaMe & MaRe:基于矩阵的标记合并与恢复方法以提高视觉感知和合成效率
标记压缩对于缓解视觉变换器(ViTs)中自注意力机制的二次复杂性至关重要,这些机制通常涉及大量输入标记。现有方法,如ToMe,依赖于GPU效率低的操作(例如排序、分散写入),引入了限制其效果的开销。我们引入了MaMe,一种基于矩阵操作的无训练、可微分的标记合并方法,使其对GPU友好,以加速ViTs。此外,我们还提出了MaRe,其逆操作,用于标记恢复,形成了一个MaMe+MaRe流水线以进行图像合成。当应用于预训练模型时,MaMe将ViT-B的吞吐量提高了一倍,准确率下降2%。值得注意的是,使用MaMe微调最后一层将ViT-B的准确率提高了1.0%,速度提高了1.1倍。在SigLIP2-B@512零样本分类中,MaMe提供了1.3倍的加速,性能下降可以忽略不计。在视频任务中,MaMe在Kinetics-400上将VideoMAE-L的加速提高了48.5%,准确率损失仅为0.84%。此外,MaMe在某些任务中同时提高了性能和速度。在图像合成中,MaMe+MaRe流水线提高了质量,同时将Stable Diffusion v2.1生成延迟降低了31%。这些结果共同证明了MaMe和MaRe在加速视觉模型方面的有效性。代码可在https://github.com/cominder/mame获取。
Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
Authors: Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg
Venue: ICLR 2026
First: 2026-04-14T06:59:27+00:00 · Latest: 2026-04-15T02:42:08+00:00
Comments: Accepted at ICLR 2026 Workshop on Agents in the Wild
Abstract
We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.
History
20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553