Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
Authors: Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He, Muxin Fu, Daizong Liu, Wei-Long Zheng, Yu Cheng
First: 2026-05-01T17:54:37+00:00 · Latest: 2026-05-01T17:54:37+00:00
Abstract
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.
中文标题/摘要
标题:持久视觉记忆:在LVLM中维持深度生成的感知
尽管自回归大型视觉-语言模型(LVLMs)在多模态任务中表现出色,但它们面临“视觉信号稀释”现象,其中文本历史的累积扩大了注意力分区函数,导致视觉注意力与生成序列长度成反比衰减。为应对这一问题,我们提出持久视觉记忆(PVM),这是一种轻量级可学习模块,旨在确保持续的按需视觉感知。PVM作为LVLM中FFN的并行分支集成,建立了一种距离无关的检索路径,直接提供精确的视觉嵌入,从而结构上缓解了深度生成固有的信号抑制。在Qwen3-VL模型上的广泛实验表明,PVM带来了显著改进,且参数开销微乎其微,能够在4B和8B规模上一致地提高平均准确率,特别是在需要持续视觉感知的复杂推理任务中。此外,深入分析表明,PVM能够抵抗长度引起的信号衰减并加速内部预测收敛。
ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
Authors: A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar
Venue: ICML 2026
First: 2026-02-15T19:00:02+00:00 · Latest: 2026-05-01T17:32:06+00:00
Comments: Accepted at ICML 2026. 28 pages, 15 figures
Abstract
Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.
中文标题/摘要
标题:ScreenParse:超越稀疏接地的完整屏幕解析监督
现代计算机使用代理(CUA)必须将屏幕视为一个结构化状态,识别可见元素、它们的位置以及包含的文本,才能可靠地进行指令定位和执行。然而,大多数可用的接地数据集提供的监督信息是稀疏的,标签不足且多样性低,只能标注每个屏幕中一小部分任务相关元素,这限制了覆盖范围和泛化能力;此外,实际部署需要高效性,以实现低延迟的设备端使用。我们引入了ScreenParse,这是一个大规模的完整屏幕解析数据集,对771K网页截图(21M元素)中的所有可见UI元素(框、55类类型和文本)进行了密集标注。ScreenParse由Webshot生成,这是一个自动化的、可扩展的管道,可以渲染多样化的URL,提取标注并应用基于VLM的重新标注和质量过滤。使用ScreenParse,我们训练了ScreenVLM,这是一个紧凑的、参数量为316M的视觉语言模型(VLM),能够解码具有结构感知损失的紧凑ScreenTag标记表示,该损失提高了结构关键标记的权重。ScreenVLM在密集解析(例如,ScreenParse上的PageIoU为0.592,而基础VLM为0.294)方面显著优于更大的基础VLM,并且在公共基准测试中表现出强大的迁移能力。此外,对基础VLM进行ScreenParse的微调始终提高了它们的接地性能,表明密集屏幕监督提供了可转移的结构先验知识,有助于UI理解。项目页面:https://saidgurbuz.github.io/screenparse/
Make Your LVLM KV Cache More Lightweight
Authors: Xihao Chen, Yangyang Guo, Roger Zimmermann
First: 2026-05-01T17:11:39+00:00 · Latest: 2026-05-01T17:11:39+00:00
Comments: Accepted to Transactions on Machine Learning Research (TMLR), 2026
Abstract
Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.
中文标题/摘要
标题:使您的LVLM KV缓存更轻量
键值(KV)缓存已成为现代大型视觉-语言模型(LVLM)推理中的一个标准组件。虽然它增强了大型语言模型(LLMs)的解码效率,但在LVLM中直接采用它会因预填充阶段处理大量视觉标记而引入显著的GPU内存开销。为了解决这个问题,我们提出了LightKV,这是一种新颖的方法,通过利用视觉标记嵌入之间的冗余性来减少KV缓存的大小。LightKV在文本提示的引导下,通过跨模态消息传递聚合视觉标记的信息性消息,并在预填充过程中逐步压缩它们。这种基于提示的引导使我们的方法区别于先前的仅视觉压缩策略。我们在八个开源LVLM上对LightKV进行了评估,这些LVLM分布在八个公开基准数据集上,例如MME和SeedBench。实验结果表明,与原始视觉标记数量相比,仅使用55%的视觉标记,LightKV能够(a)将视觉标记的KV缓存大小减半,(b)减少高达40%的计算量,并且(c)保持通用性能的同时显著优于现有基线。
Summary / 总结
The paper addresses the memory overhead issue of Key-Value (KV) cache in Large Vision-Language Models (LVLMs) by proposing LightKV, which reduces the cache size through cross-modality message passing and prompt-aware compression. The method halves the vision-token KV cache size and reduces computation by up to 40% while maintaining performance, outperforming existing baselines on eight open-source LVLMs across eight public benchmark datasets.
论文通过提出LightKV方法,利用跨模态消息传递和提示感知压缩,解决了大型视觉-语言模型(LVLM)中的内存开销问题,将视觉标记的KV缓存大小减半,计算量最多减少40%,同时保持性能并优于八个开源LVLM在八个公共基准数据集上的现有基线方法。
LLM DNA: Tracing Model Evolution via Functional Representations
Authors: Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He
Venue: ICLR 2026 Oral
First: 2025-09-29T09:09:57+00:00 · Latest: 2026-05-01T14:35:45+00:00
Comments: ICLR 2026 (Oral)
Abstract
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
中文标题/摘要
标题:LLM DNA:通过功能表示追踪模型进化
大型语言模型(LLMs)的爆炸性增长创造了一个庞大但不透明的景观:数百万个模型存在,但它们通过微调、蒸馏或适应的进化关系往往未被记录或不明确,使LLM管理变得复杂。现有方法受限于任务特定性、固定模型集或对分词器或架构的严格假设。受生物DNA启发,我们通过数学定义LLM DNA为功能行为的低维、双利普希茨表示来解决这些限制。我们证明LLM DNA满足继承性和遗传决定性属性,并证明了DNA的存在。在此理论基础上,我们推导出一个通用、可扩展、无需训练的DNA提取管道。在305个LLM的实验中,DNA与先前对有限子集的研究结果一致,并在特定任务上实现了优于或竞争性表现。超出这些任务,DNA比较揭示了LLM之间之前未记录的关系。我们进一步使用系统发生算法构建了LLM的进化树,该树与从编码器-解码器到仅解码器架构的转变、反映时间进程以及揭示不同LLM家族的进化速度不同步。
Summary / 总结
This paper addresses the challenge of understanding the evolution of large language models (LLMs) through fine-tuning, distillation, or adaptation, which is often undocumented. It introduces LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior, proving its inheritance and genetic determinism properties. The authors develop a scalable, training-free pipeline for DNA extraction and demonstrate its effectiveness in aligning with prior studies and uncovering previously undocumented relationships among LLMs. Additionally, they construct an evolutionary tree of LLMs using phylogenetic algorithms, which reflects the historical progression and distinct evolutionary speeds across different LLM families.
论文旨在解决大规模语言模型(LLMs)通过微调、蒸馏或适应进化时,其演变过程往往缺乏记录的问题。作者引入了LLM DNA,这是一种功能行为的低维表示,并证明了其继承性和遗传决定性。研究人员开发了一种可扩展的、无需训练的DNA提取管道,并展示了其在与先前研究对齐以及发现LLMs之间之前未记录的关系方面的有效性。此外,他们使用系统发生算法构建了LLMs的进化树,反映了时间上的演变过程以及不同LLM家族的进化速度差异。
InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization
Authors: Jaeyoung Chung, Suyoung Lee, Kyoung Mu Lee
First: 2026-05-01T13:45:25+00:00 · Latest: 2026-05-01T13:45:25+00:00
Comments: project page: https://robot0321.github.io/InpaintSLat/index.html
Abstract
We present a training-free approach for controllable 3D inpainting based on initial noise optimization. In the structured 3D latent diffusion framework, we observe that the underlying geometric structure is established during the early stages of the diffusion process and exhibits high sensitivity to the initial noise. Such characteristics compromise stability in tasks like inpainting and editing, where the model must ensure strict alignment with the existing context while synthesizing a new structure. In this paper, we introduce a strategy to optimize the initial noise within the structured 3D latent diffusion framework, ensuring high-fidelity 3D inpainting. Specifically, we update the initial noise by leveraging a backpropagation approximation grounded in the rectified flow model, with the spectral parameterization specially designed for robust and efficient structured 3D latent optimization. Experiments demonstrate consistent improvements in contextual consistency and prompt alignment over representative training-free inpainting baselines, establishing initial noise control as an independent dimension for 3D inpainting, orthogonal to conventional sampling trajectory manipulation.
Summary / 总结
The paper presents a training-free approach for 3D inpainting using initial noise optimization. It leverages the structured 3D latent diffusion framework, where the geometric structure is established early and is highly sensitive to initial noise. The authors introduce a method to optimize the initial noise using a backpropagation approximation and spectral parameterization, which improves contextual consistency and prompt alignment. Experiments show consistent improvements over existing training-free inpainting methods.
论文提出了一种基于初始噪声优化的无训练3D修复方法。利用结构化3D潜在扩散框架,其中几何结构在早期建立且高度依赖初始噪声。作者通过基于修正流模型的反向传播近似和谱参数化来优化初始噪声,从而提高上下文一致性与提示对齐。实验结果显示,该方法在现有无训练3D修复基线方法上表现出一致的改进。
Adaptive Dual-Teacher Distillation with Subnetwork Rectification for Bridging Semantic Gaps in Black-Box Domain Adaptation
Authors: Zhe Zhang, Jing Li, Wanli Xue, Xu Cheng, Jianhua Zhang, Qinghua Hu, Shengyong Chen
First: 2026-03-24T07:54:19+00:00 · Latest: 2026-05-01T13:36:31+00:00
Comments: Under Review
Abstract
Assuming that neither source data nor source model parameters are accessible, black-box domain adaptation (BBDA) represents a highly practical yet challenging setting, where transferable knowledge is limited to the predictions of a black-box source model. Existing approaches exploit such knowledge via pseudo-label refinement or by leveraging vision-language models (ViLs), but they often fail to reconcile the inherent discrepancy between task-specific knowledge from black-box models and language-aligned semantic priors of ViLs, resulting in suboptimal integration and degraded adaptation performance. To address this challenge, we propose adaptive Dual-Teacher Distillation with Subnetwork Rectification (DDSR), a framework that explicitly reconciles these complementary yet inconsistent knowledge sources. DDSR employs an adaptive prediction fusion strategy to integrate predictions from the black-box source model and a ViL, generating reliable pseudo-labels for the target domain. A subnetwork-based regularization mechanism mitigates overfitting to noisy supervision by enforcing output consistency and gradient divergency. Furthermore, progressively improved target predictions iteratively refine both pseudo-labels and ViL prompts, enhancing semantic alignment. Finally, class-wise prototypes are used to further optimize target predictions via self-training. Extensive experiments on multiple benchmark datasets demonstrate that DDSR consistently outperforms state-of-the-art methods, including those with access to source data or source model parameters.
中文标题/摘要
标题:基于子网络校正的自适应双教师蒸馏在黑盒领域适应中弥合语义差距
假设源数据和源模型参数不可访问,黑盒领域适应(BBDA)代表了一个高度实用但极具挑战性的环境,其中可转移的知识仅限于黑盒源模型的预测。现有方法通过伪标签细化或利用视觉语言模型(ViLs)来利用这种知识,但它们往往无法解决黑盒模型的任务特定知识与ViLs的语言对齐语义先验之间的固有差异,导致知识整合不充分和适应性能下降。为了解决这一挑战,我们提出了一种自适应双教师蒸馏与子网络校正(DDSR)框架,该框架明确地弥合了这些互补但不一致的知识来源。DDSR 使用自适应预测融合策略将黑盒源模型和ViL的预测进行整合,生成目标域的可靠伪标签。基于子网络的正则化机制通过输出一致性与梯度发散性来减轻对嘈杂监督的过拟合。此外,逐步改进的目标预测迭代地细化伪标签和ViL提示,增强语义对齐。最后,类别原型通过自我训练进一步优化目标预测。在多个基准数据集上的广泛实验表明,DDSR 一致地优于现有最先进的方法,包括那些可以访问源数据或源模型参数的方法。
Summary / 总结
The paper proposes Adaptive Dual-Teacher Distillation with Subnetwork Rectification (DDSR) to address the challenge of black-box domain adaptation where neither source data nor model parameters are accessible. DDSR integrates predictions from a black-box source model and a vision-language model to generate reliable pseudo-labels, and uses a subnetwork-based regularization mechanism to mitigate overfitting. The method iteratively refines pseudo-labels and vision-language model prompts, and employs class-wise prototypes for self-training. Experiments show that DDSR outperforms existing methods on multiple benchmark datasets.
论文提出了一种名为DDSR的框架,用于黑盒领域适应,该框架整合了来自黑盒源模型的任务特定知识和来自视觉语言模型的语言对齐语义先验。它使用自适应预测融合策略和基于子网络的正则化机制来生成可靠的伪标签并减轻过拟合。通过迭代细化伪标签和视觉语言模型提示,增强语义对齐,并通过类别原型进一步优化目标预测。实验表明,DDSR在多个基准数据集上优于现有方法。
Paired-CSLiDAR: Height-Stratified Registration for Cross-Source Aerial-Ground LiDAR Pose Refinement
Authors: Montana Hoover, Jing Liang, Tianrui Guan, Dinesh Manocha
First: 2026-05-01T13:14:20+00:00 · Latest: 2026-05-01T13:14:20+00:00
Comments: 8 pages, 4 figures. Dataset and code are being prepared for public release
Abstract
We introduce Paired-CSLiDAR (CSLiDAR), a cross-source aerial-ground LiDAR benchmark for single-scan pose refinement: refining a ground-scan pose within a 50 m-radius aerial crop. The benchmark contains 12,683 ground-aerial pairs across 6 evaluation sites and per-scan reference 6-DoF alignments for sub-meter root-mean-square error (RMSE) evaluation. Because aerial scans capture rooftops and canopy while ground scans capture facades and under-canopy, the two modalities share only a fraction of their geometry, primarily the terrain surface, causing standard registration methods and learned correspondence models to converge to metrically incorrect local minima. We propose Residual-Guided Stratified Registration (RGSR), a training-free, geometry-only refinement pipeline that exploits the shared ground plane through height-stratified ICP, reversed registration directions, and confidence-gated accept-if-better selection. RGSR achieves 86.0% S@0.75 m and 99.8% S@1.0 m on the primary benchmark of 9,012 scans, outperforming both the confidence-gated cascade at 83.7% and GeoTransformer at 76.3%. We validate RMSE-based pose selection with independent survey control and trajectory consistency, and show that added Fourier-Mellin BEV proposals can reduce RMSE while increasing actual pose error under extreme partial overlap. The dataset and code are being prepared for public release.
Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Authors: Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong
First: 2025-08-21T13:42:49+00:00 · Latest: 2026-05-01T12:25:42+00:00
Abstract
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
Summary / 总结
This paper paper proposes a backpropagation-free test test test-time adaptation method called ADAPT, which models a Gaussian probabilistic inference to model gradually update update class- likelihoods and a shared- shared covariance matrix. This method enables class-form on free inference on T potential likelihoods without introduce lightweight regularization guided by CLIP priors and historical knowledge supervision. The method achieves state state state state-of-the-art performance on diverse benchmarks under under a wide range range class shifts with on on scalability and robustness.
论文通过提出ADAPT方法解决了测试时适应(TTA)的挑战,该方法避免了反向传播和迭代优化,从而提高了可扩展性和实时部署能力。ADAPT使用高斯概率推断来建模类条件特征分布,并引入正则化来纠正似然偏差。实验表明,ADAPT在各种基准测试中表现出色,实现了最先进的性能,并且在分布偏移下的可扩展性和鲁棒性方面有所提升。
Color Conditional Generation with Sliced Wasserstein Guidance
Authors: Alexander Lobashev, Maria Larchenko, Dmitry Guskov
Venue: NeurIPS 2025 spotlight
First: 2025-03-24T18:06:03+00:00 · Latest: 2026-05-01T12:23:10+00:00
Comments: NeurIPS 2025, spotlight
Abstract
We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at https://github.com/alobashev/sw-guidance/.
Summary / 总结
The research aims to address the issue of semantically meaningless colors in color-conditional image generation. The method, SW-Guidance, modifies the sampling process of a diffusion model using the differentiable Sliced 1-Wasserstein distance to align the color distribution of the generated image with that of a reference image. The results show that SW-Guidance outperforms existing techniques in terms of color similarity and maintains semantic coherence with the original text prompt.
研究旨在解决颜色条件生成中颜色意义不明确的问题。方法SW-Guidance通过使用可微分的Sliced 1-Wasserstein距离修改扩散模型的采样过程,使生成图像的颜色分布与参考图像的颜色分布相匹配。结果表明,SW-Guidance在颜色相似度方面优于现有技术,并且能够保持与原始文本提示的语义一致性。
The Determinism of Randomness: Latent Space Degeneracy in Diffusion Model
Authors: Song Yan, Chenfeng Wang, Wei Zhai, Xinliang Bi, Jian Yang, Yancheng Cai, Yusen Zhang, Yunwei Lan, Tao Zhang, GuanYe Xiong, Min Li, Zheng-Jun Zha
First: 2025-11-11T02:12:38+00:00 · Latest: 2026-05-01T12:09:32+00:00
Abstract
Diffusion models initialize generation from an isotropic Gaussian latent, yet changing only the random seed can substantially alter prompt faithfulness, composition, and visual quality. We explain this gap by distinguishing the Euclidean geometry of the prior from the semantic geometry induced by the sampler: the effective map from initial noise to semantic outcome has many semantic-invariant directions and a much smaller set of semantic-sensitive directions. This induces a degenerate pullback semi-metric on the latent space and provides a geometric view of the seed lottery. Guided by this view, we propose a training-free inference procedure that estimates a prompt-aligned horizontal proxy from a single high-noise cold-start probe and applies tangential seed injection followed by spherical retraction to remain on the prior's typical shell. Across image, video, and 3D generation benchmarks, the method improves alignment and quality metrics over standard sampling.
Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models
Authors: Jiayu Li, Jiaxin Qi, Sheng Zhou, Jiaqiang Huang, Xiansheng Hua
First: 2026-05-01T11:57:51+00:00 · Latest: 2026-05-01T11:57:51+00:00
Abstract
Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.
Summary / 总结
The research addresses the issue of label noise sensitivity in prompt tuning for vision-language models like CLIP, which exhibit strong zero-shot generalization. The proposed method, Double-Softmax Prompt Tuning (DSPT), suppresses gradients from high-error noisy samples while preserving informative updates through a self-adaptive saturation zone. Experiments show that DSPT achieves state-of-the-art robustness across various noisy benchmarks, outperforming more complex methods with handcrafted hyperparameters.
研究针对视觉-语言模型CLIP在提示调优中对标签噪声的高度敏感性,CLIP在零样本泛化方面表现出色。提出的Double-Softmax提示调谐(DSPT)方法通过自适应饱和区抑制高错误噪声样本的梯度,同时保留信息性更新。实验表明,DSPT在各种噪声基准上实现了最先进的鲁棒性,优于具有手工调参复杂架构的方法。
Jailbreaking Vision-Language Models Through the Visual Modality
Authors: Aharon Azulay, Jan Dubiński, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman
Venue: ICML 2026
First: 2026-05-01T11:43:21+00:00 · Latest: 2026-05-01T11:43:21+00:00
Comments: Accepted to ICML 2026
Abstract
The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across six frontier VLMs, our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually. For example, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.
Summary / 总结
The paper explores the vulnerability of vision-language models (VLMs) through the visual modality, introducing four jailbreak attacks: encoding harmful instructions visually, replacing harmful objects with benign ones, altering harmful text in images, and visual analogy puzzles. These attacks bypass safety alignment in six leading VLMs, demonstrating a cross-modality alignment gap where text-based safety training fails to protect against visual intent. For instance, a visual cipher had a 40.9% attack success rate on Claude-Haiku-4.5 compared to 10.7% for a textual cipher. The study also provides initial interpretability and mitigation insights.
论文探讨了视觉语言模型(VLMs)通过视觉模态的脆弱性,提出了四种越狱攻击:将有害指令以视觉符号序列编码,用无害物品替换有害物品,修改图像中的有害文本,以及视觉类比谜题。这些攻击在六种领先VLM中绕过了安全对齐,展示了跨模态对齐缺口,即基于文本的安全训练无法保护视觉意图。例如,视觉密文在Claude-Haiku-4.5上的攻击成功率达到了40.9%,而等效的文本密文仅为10.7%。研究还提供了初步的可解释性和缓解方法的见解。
Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE
Authors: Yuanteng Chen, Peisong Wang, Nanxin Zeng, Yuantian Shao, Shuang Qiu, Gang Li, Jing Liu, Jian Cheng
Venue: ICML
First: 2026-02-02T18:39:33+00:00 · Latest: 2026-05-01T11:10:35+00:00
Comments: 25 pages, 13 figures
Abstract
Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.
中文标题/摘要
标题:确定头部,不确定尾部:测试时缩放在细粒度MoE中的专家样本
测试时缩放通过生成多个候选解决方案来提高LLM性能,但按token级采样需要温度调节,这在多样性和稳定性之间进行权衡。细粒度MoE,每层包含数百个训练良好的专家,并且每个token具有多专家激活,通过其丰富的路由空间提供了一种未被探索的替代方案。我们通过实证研究细粒度MoE路由,并发现了一个有信息的模式:路由器分数表现出高置信度专家的确定头部,随后是低置信度候选者的不确定尾部。当激活更少的专家时,单次运行贪婪准确率保持稳定,而多样本pass@n显著下降——这表明确定头部管理核心推理能力,而不确定尾部与推理多样性相关。受这些发现的启发,我们提出了一种无需训练的方法——专家样本,该方法保留高置信度选择的同时,向不确定尾部注入可控的随机性,从而实现多样生成而不破坏输出。在数学、知识推理和代码任务等多个细粒度MoE模型上进行评估,专家样本一致地提高了pass@n和基于验证的准确性。在Qwen3-30B-A3B-Instruct上,使用GPQA-Diamond进行32并行样本评估,pass@32从85.4%提高到91.9%,并且在Best-of-N验证下准确性从59.1%提高到62.6%。
Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation
Authors: Nadav Z. Cohen, Ofir Abramovich, Ariel Shamir
Venue: SIGGRAPH 2026
First: 2026-05-01T10:02:14+00:00 · Latest: 2026-05-01T10:02:14+00:00
Comments: SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/
Abstract
Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.
Summary / 总结
This work explores the characteristics of input noise in text-to-image diffusion models and finds that low-frequency components of white Gaussian noise primarily determine the global structure and color composition of generated images, while high-frequency components control finer details. By manipulating low-frequency noise using low-frequency image priors, the authors propose a training-free method to condition the generation process, allowing for control over overall image structure and color while maintaining variability in fine details.
研究旨在通过操控扩散模型中的低频噪声,增强对特定视觉属性的控制。方法是利用低频图像先验来修改输入噪声中的低频成分,这些成分主要决定了生成图像的全局结构和颜色组成。这种方法可以在不进行训练的情况下引导整体图像结构和颜色,同时高频成分负责处理细节,确保生成输出的多样性。
GenRecEdit: Adapting Model Editing for Generative Recommendation with Cold-Start Items
Authors: Chenglei Shen, Teng Shi, Weijie Yu, Xiao Zhang, Jun Xu
First: 2026-03-15T07:31:28+00:00 · Latest: 2026-05-01T09:55:23+00:00
Abstract
Generative recommendation (GR) has shown strong potential for sequential recommendation in an end-to-end generation paradigm. However, existing GR models suffer from severe cold-start collapse: their recommendation accuracy on cold-start items can drop to near zero. Current solutions typically rely on retraining with cold-start interactions, which is hindered by sparse feedback, high computational cost, and delayed updates, limiting practical utility in rapidly evolving recommendation catalogs. Inspired by model editing in NLP, which enables training-free knowledge injection into large language models, we explore how to bring this paradigm to generative recommendation. This, however, faces two key challenges: GR lacks the explicit subject-object binding common in natural language, making targeted edits difficult; and GR does not exhibit stable token co-occurrence patterns, making the injection of multi-token item representations unreliable. To address these challenges, we propose GenRecEdit, a model editing framework tailored for generative recommendation. GenRecEdit explicitly models the relationship between the full sequence context and next-token generation, adopts iterative token-level editing to inject multi-token item representations, and introduces a one-to-one trigger mechanism to reduce interference among multiple edits during inference. Extensive experiments on multiple datasets show that GenRecEdit substantially improves recommendation performance on cold-start items while preserving the model's original recommendation quality. Moreover, it achieves these gains using only about 9.5% of the training time required for retraining, enabling more efficient and frequent model updates.
Summary / 总结
The paper addresses the issue of cold-start collapse in generative recommendation models, where the accuracy drops significantly for new items. It proposes GenRecEdit, a model editing framework that uses iterative token-level editing to inject multi-token item representations without retraining. Experiments show that GenRecEdit improves cold-start recommendation performance while maintaining original quality and reducing training time by about 9.5%.
论文针对生成推荐模型中新项目推荐精度急剧下降的问题,提出了GenRecEdit框架,通过迭代的词级编辑注入多词项目表示,无需重新训练。实验表明,GenRecEdit在提高新项目推荐性能的同时保持了原有质量,并将训练时间减少了约9.5%。
Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement
Authors: Xiaoran Sun, Liyan Wang, Yeying Jin, Kin-man Lam, Zhixun Su, Yang Yang, Jinshan Pan, Cong Wang
First: 2025-07-24T03:35:20+00:00 · Latest: 2026-05-01T09:45:37+00:00
Comments: 11 papers,8 figures, CVPR2026 Findings
Abstract
Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a framework that adapts large vision-language models with iterative and manual instructions for generative LLIE. VLM-IMI mainly contains two branches: Normal-Light Instruction Prior Generation (NL-IPG) and Instruction-aware Light Enhancement Diffusion (IA-LED). The NL-IPG incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. IA-LED incorporates instruction priors from the NL-IPG to guide the diffusion process, enabling precise illumination enhancement. To effectively integrate cross-modal priors, we introduce a learnable instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, as the ground-truth normal-light images are not available, we propose an inference with an iterative instructions strategy to refine textual instructions, progressively improving visual quality. Our VLM-IMI also inherently supports manual instruction control by allowing users to directly input custom instructions into the LLM to generate user-expected outputs. Experiments across diverse scenarios demonstrate that VLM-IMI outperforms SOTA methods in terms of perception and realism. The source code is available at: https://github.com/sunxiaoran01/VLM-IMI.
High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions
Authors: Yongpeng Cao, Yuji Yamakawa
First: 2026-05-01T08:12:26+00:00 · Latest: 2026-05-01T08:12:26+00:00
Abstract
Understanding human actions from visual observations is essential for human--robot interaction, particularly when semantic interpretation of unfamiliar or hard-to-annotate actions is required. In scenarios such as rapid and less common activities, collecting sufficient labeled data for supervised learning is challenging, making zero-shot approaches a practical alternative for semantic understanding without task-specific training. While recent advances in large-scale pretrained models enable such zero-shot reasoning, the impact of temporal resolution, especially for rapid and fine-grained motions, remains underexplored.
In this study, we investigate how temporal resolution affects zero-shot semantic understanding of high-speed human actions. Using kendo as a representative case of rapid and subtle motion patterns, we propose a training-free pipeline that combines a pre-trained video-language model for semantic representation with large language model-based reasoning for pairwise action comparison. Through controlled experiments across multiple frame rates (120 Hz, 60 Hz, and 30 Hz), we show that higher temporal resolution significantly improves semantic separability in zero-shot settings. We further analyze the role of tracking-based human joint information under both full and partial observation scenarios. Quantitative evaluation using a nearest-class prototype strategy demonstrates that high-speed video provides more stable and interpretable semantic representations for fast actions. These findings highlight the importance of temporal resolution in training-free action recognition and suggest that high-speed perception can enhance semantic understanding capabilities.
Summary / 总结
This study investigates the impact of temporal resolution on zero-shot semantic understanding of high-speed human actions, using kendo as a case study. The research proposes a training-free pipeline combining a pre-trained video-language model and large language model-based reasoning. Experiments across different frame rates (120 Hz, 60 Hz, and 30 Hz) show that higher temporal resolution significantly improves semantic separability in zero-shot settings, providing more stable and interpretable semantic representations for fast actions.
研究探讨了时间分辨率对高速人类动作零样本语义理解的影响,以剑道为例。研究提出了一种无需训练的管道,结合预训练的视频-语言模型和基于大型语言模型的推理。不同帧率(120 Hz、60 Hz 和 30 Hz)的实验表明,更高的时间分辨率显著改善了零样本设置中的语义区分性,提供了更稳定和可解释的语义表示,适用于快速动作。
Leveraging Vision-Language Models as Weak Annotators in Active Learning
Authors: Phuong Ngoc Nguyen, Kaito Shiku, Ryoma Bise, Seiichi Uchida, Shinnosuke Matsuo
First: 2026-05-01T07:40:49+00:00 · Latest: 2026-05-01T07:40:49+00:00
Comments: Accepted at ICIP2026
Abstract
Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further reduce the reliance on costly human annotation within the active learning paradigm. To this end, we find that the reliability of VLMs varies significantly with label granularity in fine-grained recognition tasks: they perform poorly on fine-grained labels but can provide accurate coarse-grained labels. Leveraging this property, we propose an active learning framework that combines fine-grained human annotations with coarse-grained VLM-generated weak labels through instance-wise label assignment. We further model the systematic noise in VLM-generated labels using a small set of trusted full labels. Experiments on CUB200 and FGVC-Aircraft show that the proposed framework consistently outperforms existing active learning methods under the same annotation budget.
中文标题/摘要
标题:利用视觉语言模型作为弱注释者在主动学习中的应用
主动学习旨在通过在有限的标注预算下选择性地查询信息性样本进行监督来降低标注成本。在本工作中,我们研究了如何利用视觉语言模型(VLMs)进一步减少对昂贵的人工标注的依赖。为此,我们发现VLMs在细粒度识别任务中的可靠性随标签粒度的变化而显著不同:它们在细粒度标签上表现不佳,但在粗粒度标签上可以提供准确的标注。利用这一特性,我们提出了一种结合细粒度的人工标注和粗粒度的VLM生成的弱标注的主动学习框架,通过实例级别的标签分配。我们进一步使用少量可信的完整标注来建模VLM生成标签中的系统性噪声。在CUB200和FGVC-Aircraft上的实验表明,在相同的标注预算下,所提出的框架始终优于现有的主动学习方法。
Summary / 总结
The research aims to reduce annotation costs in active learning by utilizing vision-language models (VLMs) as weak annotators. The study finds that VLMs perform poorly on fine-grained labels but can provide accurate coarse-grained labels. An active learning framework is proposed that combines fine-grained human annotations with coarse-grained VLM-generated weak labels, and models the systematic noise in VLM-generated labels using trusted full labels. The framework consistently outperforms existing methods under the same annotation budget on CUB200 and FGVC-Aircraft datasets.
研究探讨了使用视觉-语言模型(VLMs)在主动学习中减少对人工标注的依赖。研究发现,VLMs在粗粒度标签上比细粒度标签更可靠。提出了一种结合细粒度人工标注和粗粒度VLM生成的弱标注的主动学习框架,并用可信的完整标注来建模VLM标注中的系统噪声。在CUB200和FGVC-Aircraft上的实验表明,该方法在相同的标注预算下优于现有方法。
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
Authors: Jinkun Liu, Haohan Chi, Lingfeng Zhang, Yifan Xie, YuAn Wang, Long Chen, Hangjun Ye, Xiaoshuai Hao, Wenbo Ding
First: 2026-05-01T06:15:43+00:00 · Latest: 2026-05-01T06:15:43+00:00
Abstract
Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision--Language Reasoning (IVLR), a policy framework built around \trace{}, an explicit intermediate representation that alternates textual subgoals with visual keyframes over the full task horizon. At test time, a single native multimodal transformer self-generates this global semantic-geometric trace from the initial observation and instruction, caches it, and conditions a closed-loop action decoder on the trace, original instruction, and current observation. Because standard robot datasets lack such traces, we construct pseudo-supervision by temporally segmenting demonstrations and captioning each stage with a vision-language model. Across simulated benchmarks for long-horizon manipulation and visual distribution shift, \method{} reaches 95.5\% average success on LIBERO, including 92.4\% on LIBERO-Long, and 59.4\% overall success on SimplerEnv-WidowX. Ablations show that both modalities are necessary: without traces, LIBERO-Long success drops to 37.7\%; text-only and vision-only traces reach 62.0\% and 68.4\%, while the full interleaved trace reaches 92.4\%. Stress tests with execution perturbations and masked trace content show moderate degradation, suggesting that the trace can tolerate local corruption and moderate execution drift, but remains limited under stale or incorrect global plans.
中文标题/摘要
标题:思考文字与图像:交错视觉-语言推理轨迹在长时机器人操作中的应用
长时机器人操作需要既逻辑连贯又几何基础的计划。现有视觉-语言-动作策略通常将规划隐藏在潜在状态中或仅暴露一种模态:仅文本的因果链编码因果顺序但忽略了空间约束,而视觉预测提供几何线索但通常局限于局部且语义约束不足。我们引入交错视觉-语言推理(IVLR),这是一种围绕trace构建的策略框架,trace是一种显式的中间表示,交替包含文本子目标和视觉关键帧,覆盖整个任务时间范围。在测试时,单一原生多模态变换器从初始观察和指令自动生成这个全局语义-几何轨迹,缓存它,并在轨迹、原始指令和当前观察的基础上条件化闭环动作解码器。由于标准机器人数据集缺乏此类轨迹,我们通过时间分割演示并用视觉-语言模型为每个阶段添加描述来构建伪监督。在模拟长时操作基准和视觉分布偏移中,该方法在LIBERO上的平均成功率达到了95.5%,包括在LIBERO-Long上的92.4%,以及在SimplerEnv-WidowX上的59.4%的整体成功率。消融实验表明,两种模态都是必要的:没有轨迹时,LIBERO-Long的成功率降至37.7%;仅文本和仅视觉轨迹分别达到62.0%和68.4%,而完整的交错轨迹达到92.4%。执行扰动和遮蔽轨迹内容的压力测试显示适度退化,表明轨迹可以容忍局部损坏和适度执行漂移,但在过时或错误的全局计划下仍有限制。
Summary / 总结
This paper addresses the challenge of long-horizon robotic manipulation by introducing Interleaved Vision--Language Reasoning (IVLR), which uses a global semantic-geometric trace to combine textual and visual information. The trace alternates between textual subgoals and visual keyframes, allowing the policy to generate a comprehensive plan. IVLR achieves 95.5% average success on LIBERO, with 92.4% on LIBERO-Long, and 59.4% overall success on SimplerEnv-WidowX. Ablation studies show that both modalities are essential, with text-only and vision-only traces performing worse than the full interleaved trace. The method also shows moderate resilience to execution perturbations and masked trace content.
本文通过引入交替视觉-语言推理(IVLR)来解决长期机器人操作的挑战,该方法使用全局语义-几何轨迹结合文本和视觉信息。轨迹交替包含文本子目标和视觉关键帧,使策略能够生成全面的计划。IVLR在LIBERO上的平均成功率达到了95.5%,其中在LIBERO-Long上的成功率为92.4%,在SimplerEnv-WidowX上的总体成功率为59.4%。消融研究表明,两种模态都是必要的,单独使用文本或视觉轨迹的表现不如完整的交替轨迹。该方法还对执行扰动和轨迹内容遮蔽具有一定的鲁棒性。
InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
Authors: Anirudh Iyengar Kaniyar Narayana Iyengar, Srija Mukhopadhyay, Adnan Qidwai, Shubhankar Singh, Dan Roth, Vivek Gupta
Venue: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics 2025, 2046-2067
First: 2025-08-11T05:19:23+00:00 · Latest: 2026-05-01T04:34:46+00:00
Comments: 22 pages, 8 figures, 14 tables. Accepted at IJCNLP-AACL 2025
Abstract
We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open- and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
Authors: Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, Tongxuan Liu
First: 2026-05-01T04:30:16+00:00 · Latest: 2026-05-01T04:30:16+00:00
Comments: 19p pages, accepted by ICML2026
Abstract
DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.
中文标题/摘要
标题:RTPrune: 二次阅读启发的标记剪枝方法以提高DeepSeek-OCR推理效率
DeepSeek-OCR 利用视觉文本压缩来减少长文本处理成本并加速推理,但视觉标记仍然容易包含冗余的文本和结构信息。此外,当前用于传统视觉语言模型(VLMs)的标记剪枝方法由于压缩机制不当,无法保留文本的准确性。通过分析 DeepSeek-OCR 的解码过程,我们发现了一个独特的两阶段阅读轨迹:模型首先优先处理高范数标记,然后重新分配注意力到剩余的标记。受此启发,我们提出了一种针对 DeepSeek-OCR 的两阶段标记剪枝方法 RTPrune。在第一阶段,我们优先处理能够捕捉重要文本和结构信息的高范数视觉标记。在第二阶段,剩余的标记基于最优传输理论进行配对和合并,以实现高效的特征聚合。我们还引入了一种动态剪枝比,根据标记相似性和文本密度适应 OCR 任务,从而实现更好的效率-准确度权衡。广泛的实验表明,RTPrune 达到了最先进的性能,例如在 OmniDocBench 上实现了 99.47% 的准确率和 1.23 倍的预填充加速,同时保留了 84.25% 的标记。
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Authors: Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, Chi Jin
First: 2026-05-01T02:05:56+00:00 · Latest: 2026-05-01T02:05:56+00:00
Abstract
Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.
DiffMI: Breaking Face Recognition Privacy via Diffusion-Driven Training-Free Model Inversion
Authors: Hanrui Wang, Shuo Wang, Chun-Shien Lu, Isao Echizen
Venue: IEEE Transactions on Information Forensics and Security, vol. 21, 2026. 4275-4290
First: 2025-04-25T01:53:27+00:00 · Latest: 2026-05-01T02:03:47+00:00
Comments: IEEE Transactions on Information Forensics and Security
Abstract
Face recognition poses serious privacy risks due to its reliance on sensitive and immutable biometric data. While modern systems mitigate privacy risks by mapping facial images to embeddings (commonly regarded as privacy-preserving), model inversion attacks reveal that identity information can still be recovered, exposing critical vulnerabilities. However, existing attacks are often computationally expensive and lack generalization, especially those requiring target-specific training. Even training-free approaches suffer from limited identity controllability, hindering faithful reconstruction of nuanced or unseen identities. In this work, we propose DiffMI, the first diffusion-driven, training-free model inversion attack. DiffMI introduces a novel pipeline combining robust latent code initialization, a ranked adversarial refinement strategy, and a statistically grounded, confidence-aware optimization objective. DiffMI applies directly to unseen target identities and face recognition models, offering greater adaptability than training-dependent approaches while significantly reducing computational overhead. Our method achieves 84.42%--92.87% attack success rates against inversion-resilient systems and outperforms the best prior training-free GAN-based approach by 4.01%--9.82%. The implementation is available at https://github.com/azrealwang/DiffMI.
Summary / 总结
This paper addresses the privacy risks in face recognition by proposing DiffMI, a diffusion-driven, training-free model inversion attack. DiffMI combines robust latent code initialization, a ranked adversarial refinement strategy, and a confidence-aware optimization objective, achieving high attack success rates against inversion-resilient systems. It outperforms existing training-free approaches by 4.01% to 9.82% and offers greater adaptability with reduced computational overhead.
本文提出了一种扩散驱动、无需训练的模型反转攻击方法DiffMI,以应对面部识别中的隐私风险。DiffMI 结合了鲁棒的潜在代码初始化、排名对抗性细化策略和基于统计的信心感知优化目标,能够在对抗性鲁棒系统中实现高攻击成功率。与现有无需训练的方法相比,其性能提高了4.01%到9.82%,并且具有更强的适应性和更低的计算开销。
A Survey on Vision-Language-Action Models for Embodied AI
Authors: Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, Irwin King
Venue: IEEE Transactions on Neural Networks and Learning Systems (Early Access), 2026
First: 2024-05-23T01:43:54+00:00 · Latest: 2026-05-01T01:50:44+00:00
Comments: Project page: https://github.com/yueen-ma/Awesome-VLA
Abstract
Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models -- referred to as vision-language-action (VLA) models -- has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing VLA-based control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges facing VLAs and outline promising future directions in embodied AI. A curated repository associated with this survey is available at: https://github.com/yueen-ma/Awesome-VLA.
Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification
Authors: Charles Weng, Dingwen Li, Alexander Martin
First: 2026-05-01T01:06:30+00:00 · Latest: 2026-05-01T01:06:30+00:00
Comments: Preprint. 19 pages, 5 figures
Abstract
Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.
中文标题/摘要
标题:零样本二元视觉-语言安全分类中的提示诱导评分变异
零样本视觉-语言模型(VLM)安全分类器的单提示首词概率被视为决策评分,但我们表明,在语义等效提示重新表述下它们是不可靠的:即使二元标签被限制在固定输出位置,等效提示仍可对同一样本诱导出实质性不同的不安全概率。在多模态安全基准和多个VLM家族中,跨提示变异强烈关联于提示级别分歧和更高错误率,使其成为有用的脆弱性诊断指标。无需训练的均值集成在所有14个数据集-模型评估对中提高了NLL,并在12/14中优于训练选择的单提示基线,且在头对头NLL比较中胜出更多。均值上的有标签校准在有标签时进一步提高了排名收益,表明提示平均作为无标签校准的第一阶段而非替代品是强有力的。我们将其视为零样本VLM首词安全评分的可靠性压力测试,并建议使用均值聚合的提示家族评估作为标准的无标签可靠性基线。
Summary / 总结
The study investigates the reliability of zero-shot vision-language model safety classifiers by examining the variance in single-prompt first-token probabilities under semantically equivalent prompt reformulations. Key findings show that even when the binary label is fixed, different prompts can induce significantly different unsafe probabilities, indicating a strong association with prompt-level disagreement and higher error rates. A mean ensemble of prompts improves negative log likelihood and expected calibration error compared to a single-prompt baseline, demonstrating its effectiveness as a label-free reliability baseline for zero-shot VLM safety scores.
研究通过在语义等价的提示重新表述下检查零样本视觉-语言模型安全分类器的一致性,发现即使二元标签固定,不同提示也会导致显著不同的不安全概率,这与提示级别的分歧和更高的错误率密切相关。提示平均值的集合提高了负对数似然和预期校准误差,表明其作为零样本VLM安全评分的无标签可靠性基线的有效性。
Online Self-Calibration Against Hallucination in Vision-Language Models
Authors: Minghui Chen, Chenxu Yang, Hengjie Zhu, Dayan Wu, Zheng Lin, Qingyi Si
Venue: IJCAI 2026
First: 2026-05-01T01:03:05+00:00 · Latest: 2026-05-01T01:03:05+00:00
Comments: IJCAI 2026
Abstract
Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose \textbf{O}nline \textbf{S}elf-\textbf{CA}lib\textbf{R}ation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.
中文标题/摘要
标题:在线对抗幻觉的自我校准在视觉语言模型中
大型视觉语言模型(LVLMs)经常遭受幻觉问题,生成包含输入图像中不存在的视觉细节的描述。最近的偏好对齐方法通常依赖于从更强的模型如GPT中提取的监督信息。然而,这种离线范式引入了监督感知不匹配:学生模型被迫对超出其感知能力的细粒度细节进行对齐,学习猜测而不是观察。为了获得可靠的自我监督以进行在线学习,我们识别了LVLMs中的生成-判别差距,其中模型在判别验证方面比开放生成方面表现出更高的准确性。利用这一能力,我们提出了**O**nline **S**elf-**CA**lib**R**ation (OSCAR) 框架,该框架结合了蒙特卡洛树搜索与双粒度奖励机制来构建偏好数据,并通过直接偏好优化迭代地改进模型。广泛的实验表明,OSCAR在幻觉基准测试中达到了最先进的性能,同时提高了多模态的一般能力。
Summary / 总结
The research aims to address the issue of hallucination in large vision-language models (LVLMs) by proposing OSCAR, a framework that leverages the generative-discriminative gap within LVLMs. OSCAR uses Monte Carlo Tree Search and a dual-granularity reward mechanism to iteratively refine the model through direct preference optimization. Experiments show that OSCAR outperforms existing methods on hallucination benchmarks and enhances general multimodal capabilities.
研究旨在通过提出一种新的在线自校准方法OSCAR来解决大型视觉-语言模型(LVLM)中的幻觉问题。OSCAR利用LVLM中的生成-判别差距,并结合蒙特卡洛树搜索和双粒度奖励机制来逐步优化模型。实验表明,OSCAR在幻觉基准测试中优于现有方法,并增强了通用的多模态能力。
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
Authors: Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
First: 2025-05-24T21:30:29+00:00 · Latest: 2026-04-30T23:18:30+00:00
Abstract
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.
Summary / 总结
The research aims to accelerate video generation by addressing the computational inefficiencies of diffusion transformers. The method, SVG2, employs semantic-aware permutation to cluster and reorder tokens based on semantic similarity, reducing wasted computation and improving accuracy. Experiments show SVG2 achieves up to 2.30x and 1.89x speedup while maintaining PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.
研究旨在通过解决扩散变换器的计算效率问题来加速视频生成。方法SVG2采用基于语义相似性的排列,将令牌聚类并重新排序,减少不必要的计算并提高准确性。实验表明,SVG2在HunyuanVideo和Wan 2.1上分别实现了最高2.30倍和1.89倍的加速,同时保持最高30和26的PSNR。
Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards
Authors: Chenhui Xu, Fuxun Yu, Michael J. Bianco, Jacob Kovarskiy, Raphael Tang, Qi Zhang, Zirui Xu, Will LeVine, Brandon Dubbs, Heming Liao, Cassandra Burgess, Suvam Bag, Jay Patravali, Rupanjali Kukal, Mikael Figueroa, Rishi Madhok, Nikolaos Karianakis, Jinjun Xiong
Venue: ICML 2026
First: 2025-09-29T21:34:55+00:00 · Latest: 2026-04-30T21:51:08+00:00
Comments: ICML 2026
Abstract
Training robust reasoning vision-language models (VLMs) in rare domains (such as geospatial) is fundamentally constrained by supervision scarcity. While raw geospatial imagery is abundant, the amount of task-direct supervision falls far behind that of common domains. In this work, we validate an important conclusion: indirect verifiable rewards, derived from seemingly unrelated metadata, are sufficient to induce sophisticated and generalizable geospatial reasoning across a wide range of downstream tasks (25+). We present Geo-R1 as one empirical instantiation of this paradigm. Rather than relying on limited task-specific annotations (i.e., direct rewards), Geo-R1 utilizes scalable, verifiable indirect proxy rewards based on cross-view alignment with metadata (geolocation information) to drive reinforcement learning at scale. Such indirect rewards successfully motivate the model to discover and internalize zero-shot geospatial reasoning across diverse tasks, achieving extraordinary zero-shot transfer on out-of-distribution benchmarks and even surpassing fully supervised specialists on certain benchmarks. These findings indicate that optimizing for indirect verifiable rewards may provide a scalable pathway to unlock generalized reasoning capabilities in rare domains with massive unlabeled data archives. Our code is availavle at: https://github.com/miniHuiHui/Geo-R1.
Summary / 总结
This work addresses the challenge of training vision-language models for geospatial reasoning by leveraging indirect verifiable rewards derived from metadata. The method, Geo-R1, uses cross-view alignment with geolocation information to drive reinforcement learning, enabling zero-shot geospatial reasoning across various tasks. The model demonstrates exceptional zero-shot transfer on out-of-distribution benchmarks and even outperforms fully supervised specialists on some benchmarks, indicating the potential of indirect rewards for scalable reasoning in rare domains with abundant unlabeled data.
该研究通过利用元数据中的间接可验证奖励解决了训练用于地理空间推理的视觉语言模型的挑战。方法Geo-R1 使用跨视图与地理位置信息的对齐来驱动强化学习,使模型能够在各种任务中实现零样本地理空间推理。该模型在分布外基准测试中表现出色的零样本转移,并在某些基准测试中甚至超过了完全监督的专业模型,表明间接奖励在稀有领域中利用大量未标记数据档案实现泛化推理能力的潜力。
Training-Free Time Series Classification via In-Context Reasoning with LLM Agents
Authors: Songyuan Sui, Zihang Xu, Xia Hu
First: 2025-10-07T14:07:43+00:00 · Latest: 2026-04-30T21:41:24+00:00
Comments: 8 pages main content, 12 pages total including appendix, 1 figure
Abstract
Time series classification (TSC) spans diverse application scenarios, yet labeled data are often scarce, making task-specific training costly and inflexible. Recent reasoning-oriented large language models (LLMs) show promise in understanding temporal patterns, but purely zero-shot usage remains suboptimal. We propose FETA, a multi-agent framework for training-free TSC via exemplar-based in-context reasoning. FETA decomposes a multivariate series into channel-wise subproblems, retrieves a few structurally similar labeled examples for each channel, and leverages a reasoning LLM to compare the query against these exemplars, producing channel-level labels with self-assessed confidences; a confidence-weighted aggregator then fuses all channel decisions. This design eliminates the need for pretraining or fine-tuning, improves efficiency by pruning irrelevant channels and controlling input length, and enhances interpretability through exemplar grounding and confidence estimation. On nine challenging UEA datasets, FETA achieves strong accuracy under a fully training-free setting, surpassing multiple trained baselines. These results demonstrate that a multi-agent in-context reasoning framework can transform LLMs into competitive, plug-and-play TSC solvers without any parameter training. The code is available at https://github.com/SongyuanSui/FETATSC.
中文标题/摘要
标题:基于上下文推理的无训练时间序列分类
时间序列分类(TSC)涵盖了多种应用场景,但标记数据往往稀缺,使得特定任务的训练成本高昂且不灵活。近期的推理导向的大语言模型(LLMs)在理解时间模式方面显示出潜力,但纯粹的零样本使用仍不尽如人意。我们提出了一种名为FETA的多代理框架,用于通过基于示例的上下文推理实现无训练TSC。FETA将多变量序列分解为通道级子问题,为每个通道检索几个结构上相似的标记示例,并利用推理LLM将查询与这些示例进行比较,生成带有自我评估置信度的通道级标签;然后通过置信加权聚合器融合所有通道决策。此设计消除了预训练或微调的需要,通过剪枝无关通道和控制输入长度来提高效率,并通过示例定位和置信度估计增强可解释性。在九个具有挑战性的UEA数据集上,FETA在完全无训练设置下实现了强大的准确性,超越了多个训练基线。这些结果表明,多代理上下文推理框架可以将LLMs转化为无需任何参数训练的可插拔TSC求解器。代码可在https://github.com/SongyuanSui/FETATSC获取。
FreeRet: MLLMs as Training-Free Retrievers
Authors: Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Chunxu Liu, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang
Venue: ICML 2026
First: 2025-09-29T11:28:42+00:00 · Latest: 2026-04-30T20:31:56+00:00
Comments: ICML 2026
Abstract
Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.