arXiv 论文速递

Snapshot: 20260505_0436

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Authors: Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He, Muxin Fu, Daizong Liu, Wei-Long Zheng, Yu Cheng

First: 2026-05-01T17:54:37+00:00 · Latest: 2026-05-01T17:54:37+00:00

Abstract

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

中文标题/摘要

标题：持久视觉记忆：在LVLM中维持深度生成的感知

尽管自回归大型视觉-语言模型（LVLMs）在多模态任务中表现出色，但它们面临“视觉信号稀释”现象，其中文本历史的累积扩大了注意力分区函数，导致视觉注意力与生成序列长度成反比衰减。为应对这一问题，我们提出持久视觉记忆（PVM），这是一种轻量级可学习模块，旨在确保持续的按需视觉感知。PVM作为LVLM中FFN的并行分支集成，建立了一种距离无关的检索路径，直接提供精确的视觉嵌入，从而结构上缓解了深度生成固有的信号抑制。在Qwen3-VL模型上的广泛实验表明，PVM带来了显著改进，且参数开销微乎其微，能够在4B和8B规模上一致地提高平均准确率，特别是在需要持续视觉感知的复杂推理任务中。此外，深入分析表明，PVM能够抵抗长度引起的信号衰减并加速内部预测收敛。

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

Authors: A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar

Venue: ICML 2026

First: 2026-02-15T19:00:02+00:00 · Latest: 2026-05-01T17:32:06+00:00

Comments: Accepted at ICML 2026. 28 pages, 15 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.

中文标题/摘要

标题：ScreenParse：超越稀疏接地的完整屏幕解析监督

现代计算机使用代理（CUA）必须将屏幕视为一个结构化状态，识别可见元素、它们的位置以及包含的文本，才能可靠地进行指令定位和执行。然而，大多数可用的接地数据集提供的监督信息是稀疏的，标签不足且多样性低，只能标注每个屏幕中一小部分任务相关元素，这限制了覆盖范围和泛化能力；此外，实际部署需要高效性，以实现低延迟的设备端使用。我们引入了ScreenParse，这是一个大规模的完整屏幕解析数据集，对771K网页截图（21M元素）中的所有可见UI元素（框、55类类型和文本）进行了密集标注。ScreenParse由Webshot生成，这是一个自动化的、可扩展的管道，可以渲染多样化的URL，提取标注并应用基于VLM的重新标注和质量过滤。使用ScreenParse，我们训练了ScreenVLM，这是一个紧凑的、参数量为316M的视觉语言模型（VLM），能够解码具有结构感知损失的紧凑ScreenTag标记表示，该损失提高了结构关键标记的权重。ScreenVLM在密集解析（例如，ScreenParse上的PageIoU为0.592，而基础VLM为0.294）方面显著优于更大的基础VLM，并且在公共基准测试中表现出强大的迁移能力。此外，对基础VLM进行ScreenParse的微调始终提高了它们的接地性能，表明密集屏幕监督提供了可转移的结构先验知识，有助于UI理解。项目页面：https://saidgurbuz.github.io/screenparse/

Make Your LVLM KV Cache More Lightweight

Authors: Xihao Chen, Yangyang Guo, Roger Zimmermann

First: 2026-05-01T17:11:39+00:00 · Latest: 2026-05-01T17:11:39+00:00

Comments: Accepted to Transactions on Machine Learning Research (TMLR), 2026