arXiv 论文速递

Snapshot: 20260411_0356

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Authors: Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai

Venue: CVPR 2026

First: 2026-04-09T17:59:57+00:00 · Latest: 2026-04-09T17:59:57+00:00

Comments: Accepted by CVPR 2026. Project page: https://h-embodvis.github.io/NUMINA

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

中文标题/摘要

标题：数字发声：文本数字与视觉实例在文本到视频扩散模型中的对齐

文本到视频扩散模型已实现开放式的视频合成，但常难以生成提示中指定数量的对象。我们提出NUMINA，一种无需训练的识别然后引导框架，以提高数值对齐。NUMINA通过选择区分性的自注意力和跨注意力头来识别提示布局不一致，从而推导出可计数的潜在布局。然后，它保守地细化此布局，并调节跨注意力以引导再生。在引入的CountBench上，NUMINA在Wan2.1-1.3B模型上将计数准确性提高至多7.4%，在5B和14B模型上分别提高4.9%和5.5%。此外，CLIP对齐得到改善，同时保持时间一致性。这些结果表明，结构引导补充了种子搜索和提示增强，提供了一条通往计数准确的文本到视频扩散的实用路径。代码可在https://github.com/H-EmbodVis/NUMINA获取。

Summary / 总结

The research aims to improve the accuracy of numerical alignment in text-to-video synthesis by addressing the issue of generating the correct number of objects. NUMINA, a training-free framework, identifies and guides the generation of numerical elements in the video. It achieves up to 7.4% improvement in counting accuracy on a new benchmark, while also enhancing CLIP alignment and maintaining temporal consistency. This demonstrates the effectiveness of structural guidance in text-to-video models.

研究旨在通过解决生成指定数量对象的挑战，提高文本到视频合成中的数值对齐准确性。提出的NUMINA框架无需重新训练即可识别并引导数值元素的生成，大型模型的计数准确性可提高多达7.4%。同时，它还改善了CLIP对齐并保持时间一致性，表明结构指导可以补充现有方法，以实现更准确的文本到视频合成。

ParseBench: A Document Parsing Benchmark for AI Agents

Authors: Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Simon Suo

First: 2026-04-09T17:59:36+00:00 · Latest: 2026-04-09T17:59:36+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.

中文标题/摘要

标题：ParseBench：AI代理的文档解析基准

AI代理正在改变文档解析的要求。关键在于语义正确性：解析输出必须保留支持自主决策的结构和意义，包括正确的表格结构、精确的图表数据、语义相关的格式化以及视觉定位。现有基准未能充分捕捉企业自动化中的这一设置，依赖于狭窄的文档分布和文本相似性度量，这些度量忽略了代理关键的失败。我们引入了**ParseBench**，一个包含约2000个企业文档中的人工验证页面的基准，这些文档涵盖了保险、金融和政府领域，并围绕五个能力维度组织：表格、图表、内容忠实度、语义格式化和视觉定位。在涵盖视觉语言模型、专门的文档解析器和LlamaParse在内的14种方法中，基准揭示了一个碎片化的能力景观：没有一种方法在所有五个维度上都表现出色。LlamaParse Agentic在综合得分上最高，达到\agenticoverall\%，而基准也突显了当前系统中的能力差距。数据集和评估代码可在\href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace}和\href{https://github.com/run-llama/ParseBench}{GitHub}上获取。

Summary / 总结

ParseBench is introduced to evaluate the semantic correctness of document parsing for AI agents, focusing on table structure, chart data, content faithfulness, semantic formatting, and visual grounding. It consists of over 2,000 human-verified pages from enterprise documents. Across 14 methods, no single approach excels in all five dimensions, with LlamaParse Agentic achieving the highest overall score. The benchmark highlights the remaining capability gaps in current systems.

研究旨在解决AI代理在文档解析中需要语义正确性的问题，这对于自主决策至关重要。研究引入了ParseBench基准，包含超过2,000份来自保险、金融和政府领域的企业文档。该基准评估了五个维度：表格、图表、内容忠实度、语义格式化和视觉定位。在包括视觉语言模型和专门解析器在内的14种方法中，没有一种方法在所有维度上都表现出色，LlamaParse Agentic获得了最高的总体评分。该基准揭示了当前系统中存在的能力差距。

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Authors: Mu Nan, Muquan Yu, Weijian Mai, Jacob S. Prince, Hossein Adeli, Rui Zhang, Jiahang Cao, Benjamin Becker, John A. Pyles, Margaret M. Henderson, Chunfeng Song, Nikolaus Kriegeskorte, Michael J. Tarr, Xiaoqing Hu, Andrew F. Luo

Venue: CVPR 2026

First: 2026-04-09T17:59:32+00:00 · Latest: 2026-04-09T17:59:32+00:00

Comments: Accepted to CVPR 2026, website: https://github.com/ezacngm/brainCodec

Abs · PDF · Code1 · Code2 · Code3

Abstract

Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A field-wide goal is to achieve generalizable, cross-subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine-tuning separately for each subject. To address this challenge, we introduce a meta-optimized approach for semantic visual decoding from fMRI that generalizes to novel subjects without any fine-tuning. By simply conditioning on a small set of image-brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in-context learning of the new subject's encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross-subject and cross-scanner generalization across diverse visual backbones without retraining or fine-tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non-invasive brain decoding.

中文标题/摘要

标题：元学习使上下文内学习能够在无需训练的情况下实现跨主题脑解码

从脑信号进行视觉解码是计算机视觉与神经科学交叉领域的一个关键挑战，需要能够连接神经表示和视觉计算模型的方法。一个领域内的目标是实现可泛化的跨主题模型。这一目标的主要障碍是不同个体之间神经表示的巨大差异，这要求训练定制模型或为每个主题分别进行微调。为了解决这一挑战，我们提出了一种用于从fMRI进行语义视觉解码的元优化方法，该方法能够在无需任何微调的情况下泛化到新的主题。通过仅对新个体的一小组图像-脑激活示例进行条件化，我们的模型能够快速推断出其独特的神经编码模式，从而实现稳健且高效的视觉解码。我们的方法明确针对新主题的编码模型的上下文学习进行了优化，并通过分层推理进行解码，逆向推导编码器。首先，对于多个脑区，我们通过构建多个刺激和响应的上下文来估计每个体素的视觉响应编码参数。其次，我们构建一个由多个体素的编码参数和响应值组成的上下文，以执行聚合功能逆向推导。我们展示了在多种视觉骨干网络上具有强大的跨主题和跨扫描仪泛化能力，无需重新训练或微调。此外，我们的方法不需要解剖对齐或刺激重叠。这项工作是实现非侵入性脑解码通用基础模型的重要一步。

Summary / 总结

The research aims to develop a method for cross-subject brain decoding in visual tasks, addressing the variability in neural representations across individuals. The approach uses a meta-optimized model that can generalize to new subjects without fine-tuning, by conditioning on a small set of image-brain activation examples. Key findings include strong cross-subject and cross-scanner generalization, and the ability to perform decoding without anatomical alignment or stimulus overlap, marking a step towards a generalizable foundation model for brain decoding.

研究旨在开发一种无需重新训练或微调即可跨被试泛化的脑解码模型。方法是使用少量图像-大脑激活示例来推断每个新被试的独特神经编码模式。关键发现包括在各种视觉模型下实现强大的跨被试和跨扫描仪泛化，且该方法无需进行解剖对齐或刺激重叠。

What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Authors: Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang, Aladine Chetouani, Ulas Bagci, Alessandro Bruno

First: 2026-04-09T17:36:22+00:00 · Latest: 2026-04-09T17:36:22+00:00

Comments: Accepted at ETRA 2026 GenAI workshop

Abs · PDF · Code1 · Code2

Abstract

Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

中文标题/摘要

标题：他们所见，而不仅仅是注视的位置：通过VLMs和NLP度量的语义扫描路径相似性

扫描路径相似性度量是眼动研究的核心，但现有方法主要评估空间和时间对齐，而忽视了被注视图像区域之间的语义等价性。我们提出了一种语义扫描路径相似性框架，将视觉语言模型（VLMs）整合到眼动追踪分析中。每个注视点在受控视觉上下文中（基于块和标记策略）进行编码，并转换为简洁的文本描述，然后聚合为扫描路径级表示。语义相似性使用嵌入式和词汇NLP度量进行计算，并与多匹配和DTW等现有空间度量进行比较。对自由观看眼动追踪数据的实验表明，语义相似性部分独立于几何对齐捕获了差异，即使在空间上存在差异时内容也高度一致。我们进一步分析了上下文编码对描述准确性和度量稳定性的影响。我们的研究结果表明，多模态基础模型使经典的扫描路径分析具有可解释性和内容意识的扩展成为可能，为ETRA社区内的凝视研究提供了补充维度。

Summary / 总结

The research aims to improve scanpath similarity metrics by incorporating semantic analysis, which is often overlooked in favor of spatial and temporal alignment. The study uses vision-language models to encode fixations under controlled visual context and transform them into textual descriptions, then computes semantic similarity using NLP metrics. Experiments show that semantic similarity provides additional variance compared to traditional spatial measures, highlighting cases where content agreement exists despite spatial divergence. The findings indicate that multimodal foundation models can enhance classical scanpath analysis, offering a new perspective for gaze research.

研究针对现有扫描路径相似性度量主要关注空间和时间对齐而忽视语义等价性的局限。它提出了一种框架，利用视觉语言模型以受控的视觉上下文编码注视点，将其转换为文本描述，并将这些描述聚合为扫描路径级表示。实验表明，语义相似性能够捕捉到与空间度量不同的额外变异，突出显示即使存在空间差异，内容也存在一致性的案例。研究还评估了不同上下文编码策略对描述准确性和度量稳定性的影响，表明多模态基础模型增强了经典扫描路径分析，为注视研究提供了新的维度。

LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

Authors: Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang

First: 2026-04-09T17:14:00+00:00 · Latest: 2026-04-09T17:14:00+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.

中文标题/摘要

标题：LAMP: 将图像编辑提升为开放世界操纵的一般3D先验

在开放世界中实现类人的泛化仍然是机器人操纵中的一个基本挑战。现有的基于学习的方法，包括强化学习、模仿学习和视觉-语言-动作模型（VLAs），往往难以应对新的任务和未见过的环境。另一个有前景的方向是探索能够捕捉开放世界操纵中精细的空间和几何关系的一般化表示。虽然大型语言模型（LLMs）和视觉语言模型（VLMs）提供了基于语言或标注的2D表示的强语义推理，但它们有限的3D意识限制了它们在精细操纵中的应用。为了解决这个问题，我们提出了LAMP，它将图像编辑提升为3D先验，以提取物体间的3D变换作为连续的、几何感知的表示。我们的核心见解是，图像编辑本质上编码了丰富的2D空间线索，将这些隐含的线索提升到3D变换中为开放世界操纵提供了精细和准确的指导。广泛的实验表明，LAMP 提供了精确的3D变换，并在开放世界操纵中实现了强大的零样本泛化。项目页面：https://zju3dv.github.io/LAMP/

Summary / 总结

The paper addresses the challenge of human-like generalization in open-world robotic manipulation. It proposes LAMP, which uses 3D priors derived from image-editing to extract continuous, geometry-aware representations of inter-object 3D transformations. Experiments show that LAMP provides precise 3D transformations and excels in zero-shot generalization for open-world manipulation tasks.

论文针对开放世界机器人操作中的人类级泛化难题，提出了LAMP方法，利用图像编辑作为3D先验提取连续的几何感知的物体间3D变换表示。该方法利用图像编辑中固有的丰富2D空间线索，并将其提升到3D变换中，为操作任务提供精确的指导。实验表明，LAMP在开放世界操作场景中实现了精确的3D变换和强大的零样本泛化能力。

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Authors: Haoxi Zeng, Qiankun Liu, Yi Bin, Haiyue Zhang, Yujuan Ding, Guoqing Wang, Deqiang Ouyang, Heng Tao Shen

First: 2026-04-09T16:57:11+00:00 · Latest: 2026-04-09T16:57:11+00:00

Comments: 14 pages, 12 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).

中文标题/摘要

标题：OVS-DINO：通过结构对齐SAM-DINO实现开放词汇分割并带有语言指导

开放词汇分割（OVS）旨在通过利用语义描述来分割超出预定义类别集的图像区域。虽然基于CLIP的方法在语义泛化方面表现出色，但在密集预测所需的精细空间意识方面却经常不足。最近的努力已经将视觉基础模型（VFMs）如DINO纳入其中，以缓解这些限制。然而，这些方法仍然难以处理高保真分割所需的精确边缘感知。在本文中，我们分析了DINO的内部表示，并发现其固有的边界意识并非不存在，而是随着特征过渡到更深的变压器块而逐渐衰减。为了解决这一问题，我们提出了OVS-DINO，这是一种新颖的框架，通过结构对齐SAM来重新激活DINO的潜在边缘敏感性。具体来说，我们引入了结构感知编码器（SAE）和结构调制解码器（SMD），利用SAM的结构先验有效激活DINO的边界特征，并辅以利用SAM生成的伪掩码的监督策略。广泛的实验表明，我们的方法在多个弱监督OVS基准测试中达到了最先进的性能，平均得分提高了2.1%（从44.8%提高到46.9%）。值得注意的是，我们的方法在复杂、拥挤的场景中显著提高了分割精度，在Cityscapes上提高了6.3%（从36.6%提高到42.9%）。

Summary / 总结

OVS-DINO addresses the limitations of existing Open-Vocabulary Segmentation methods by revitalizing DINO's latent edge-sensitivity through structural alignment with SAM, resulting in state-of-the-art performance across multiple benchmarks. The method improves the average score by 2.1% and achieves a 6.3% gain in segmentation accuracy on complex, cluttered scenarios like Cityscapes.

OVS-DINO通过将DINO与SAM的结构先验进行结构对齐，重新激活了DINO的边界感知能力，提出了结构感知编码器和结构调制解码器来增强其细粒度分割能力。实验表明，OVS-DINO在多个弱监督OVS基准测试中达到了最先进的性能，平均得分提高了2.1%，在Cityscapes上的复杂场景分割准确性提高了6.3%。

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Authors: Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen, Sikai Chen, Bin Ran

First: 2026-04-09T16:52:04+00:00 · Latest: 2026-04-09T16:52:04+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

中文标题/摘要

标题：CrashSight：一种基于阶段的基础设施为中心的视频基准，用于交通事故现场理解与推理

合作自动驾驶需要从车辆和基础设施视角理解交通场景。尽管视觉语言模型（VLMs）展示了强大的通用推理能力，但由于现有基准的以自我车辆为中心，它们在安全关键交通场景中的性能尚未得到充分评估。为弥合这一差距，我们提出了**CrashSight**，一个使用真实道路旁摄像头数据的大型视觉语言基准，用于道路事故理解。该数据集包含250个事故视频，并用13K个多选题-答案对进行了标注，分为两层分类体系。第一层评估场景上下文和涉及方的视觉定位，而第二层探究更高层次的推理，包括事故机制、因果归因、时间进程和事故后的结果。我们对8个最先进的VLM进行了基准测试，并表明，尽管这些模型在场景描述方面表现出色，但在安全关键场景中的时间与因果推理方面仍存在困难。我们详细分析了失败场景，并讨论了改进VLM事故理解的方向。该基准为合作自动驾驶中的基础设施辅助感知提供了一个标准化评估框架。CrashSight基准，包括完整数据集和代码，可在https://mcgrche.github.io/crashsight/获取。

Summary / 总结

CrashSight is a large-scale vision-language benchmark for traffic crash scene understanding using real-world roadside camera data. It evaluates the visual grounding and higher-level reasoning capabilities of models through a two-tier taxonomy of 13K question-answer pairs. The study benchmarks 8 state-of-the-art vision-language models and finds that they struggle with temporal and causal reasoning in safety-critical scenarios, highlighting the need for improved crash understanding capabilities in autonomous driving systems.

CrashSight 是一个新的人工智能视觉-语言基准，用于从车辆和基础设施两个角度理解交通事故场景。它包含250个真实世界的事故视频和13K个问答对，分为两层来评估视觉定位和高层次推理。八个最先进的视觉-语言模型被测试，揭示了它们在安全关键场景中的时间性和因果性推理方面的局限性。该基准旨在标准化协作自动驾驶中基础设施辅助感知的评估。

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Authors: Marcel Gröpl, Jaewoo Jung, Seungryong Kim, Marc Pollefeys, Sunghwan Hong

First: 2026-04-09T16:51:42+00:00 · Latest: 2026-04-09T16:51:42+00:00

Comments: Project Page : https://entropy-gradient-grounding.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

中文标题/摘要

标题：熵-梯度定位：无需训练的视觉-语言模型证据检索

尽管取得了快速进展，预训练的视觉-语言模型在依赖于微小视觉细节或需要结合分布在多个区域的线索时，仍然难以应对，尤其是在文档和组合查询中。我们通过将定位问题重新定义为测试时的证据检索：给定一个查询，模型应该主动识别需要查看的位置以解决歧义。为此，我们提出了一种无需训练、模型内在的定位方法，使用不确定性作为监督。具体来说，我们计算模型的下一个标记分布的熵，并将其反向传播到视觉标记嵌入中，以获得一个熵-梯度相关性图，而无需辅助检测器或注意力图启发式方法。然后，我们提取并排序多个连贯区域以支持多证据查询，并引入一种迭代的缩放和重新定位过程，带有空间熵停止规则，以避免过度细化。在四个视觉-语言模型架构上的七个基准测试中，该方法在细节关键和高分辨率设置中表现出一致的改进，同时生成更可解释的证据定位。

Summary / 总结

The research addresses the challenge of pretrained vision-language models struggling with queries that require attention to small visual details or combining clues from multiple regions. It proposes an entropy-gradient grounding method that uses the model's uncertainty as supervision to retrieve relevant visual evidence without additional training. The method computes the entropy of the model's next-token distribution and uses it to generate an entropy-gradient relevance map, which helps in identifying multiple coherent regions for complex queries. Experiments show consistent improvements over existing methods, especially in detail-critical and high-resolution settings, and produce more interpretable evidence localizations.

研究针对预训练的视觉-语言模型在处理需要关注小视觉细节或结合多个区域线索的查询时遇到的挑战。提出了一种无需训练的熵梯度定位方法，利用模型的不确定性生成熵梯度相关图，以帮助识别多个连贯区域进行证据检索，并引入了一种迭代的缩放和重新定位过程，以避免过度细化。实验结果显示在各种基准测试中的一致改进，特别是在细节关键和高分辨率设置中取得了显著进步，从而产生了更可解释的证据定位。

BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

Authors: Fan Yang, Wenrui Chen, Guorun Yan, Ruize Liao, Wanjun Jia, Dongsheng Luo, Kailun Yang, Zhiyong Li, Yaonan Wang

First: 2026-04-09T16:10:20+00:00 · Latest: 2026-04-09T16:10:20+00:00

Comments: Code will be publicly available at https://github.com/PopeyePxx/BLaDA

Abs · PDF · Code1 · Code2 · Code3

Abstract

In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.

中文标题/摘要

标题：BLaDA：在3DGS领域内将语言与功能性灵巧动作相结合

在非结构化环境中，功能性灵巧抓取需要语义理解、精确的3D功能定位和可物理解释的执行的紧密集成。模块化分层方法比端到端的VLA方法更可控和可解释，但现有的方法仍然依赖预定义的功能性操作标签，缺乏功能性灵巧操作所需的语义-姿态耦合。为了解决这个问题，我们提出了BLaDA（在3DGS领域内将语言与灵巧动作相结合），这是一种可解释的零样本框架，将开放词汇指令作为功能性灵巧操作的感知和控制约束。BLaDA通过知识引导的语言解析（KLP）模块将自然语言解析为操作约束的结构化六元组，建立了一个可解释的推理链。为了实现姿态一致的空间推理，我们引入了三角功能点定位（TriLocation）模块，该模块利用3D高斯点积作为连续场景表示，并在三角几何约束下识别功能性区域。最后，3D关键点抓取矩阵变换执行（KGT3D+）模块将这些语义-几何约束解码为物理上合理的手腕姿态和手指级命令。在复杂基准上的广泛实验表明，BLaDA在功能性操作的附着力接地精度和成功率方面显著优于现有方法，涵盖多种类别和任务。代码将在https://github.com/PopeyePxx/BLaDA公开。

Summary / 总结

BLaDA is an interpretable zero-shot framework designed for functional dexterous manipulation in unstructured environments. It integrates natural language instructions into perceptual and control constraints through a Knowledge-guided Language Parsing module, and uses Triangular Functional Point Localization and 3D Keypoint Grasp Matrix Transformation Execution modules to achieve precise pose localization and physically plausible grasp execution. Experiments show that BLaDA outperforms existing methods in both affordance grounding precision and functional manipulation success rate across various categories and tasks.

BLaDA 是一个可解释的零样本框架，用于在未结构化的环境中实现功能性灵巧操作。它通过知识引导的语言解析模块将自然语言指令转化为感知和控制约束，并使用三角功能点定位和三维关键点抓取矩阵变换执行模块实现精确的姿态定位和物理上合理的抓取执行。实验表明，BLaDA 在各种类别和任务的功能操作成功率和语义定位精度方面均优于现有方法。

Phantasia: Context-Adaptive Backdoors in Vision Language Models

Authors: Nam Duong Tran, Phi Le Nguyen

Venue: CVPR 2026

First: 2026-04-09T15:55:33+00:00 · Latest: 2026-04-09T15:55:33+00:00

Comments: CVPR 2026 Findings

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.

中文标题/摘要

标题：幻象：视觉语言模型中的上下文自适应后门

视觉语言模型（VLMs）的最新进展极大地增强了视觉感知与语言推理的整合，推动了多模态理解的迅速进步。尽管取得了这些成就，视觉语言模型的安全性，尤其是其对后门攻击的脆弱性，仍然被显著低估。现有的针对VLMs的后门攻击仍处于早期发展阶段，大多数当前方法依赖于生成包含固定、易于识别模式的受污染响应。在本文中，我们做出了两项关键贡献。首先，我们首次证明了现有VLM后门攻击的隐蔽性被严重高估。通过将其他领域（如仅视觉和仅文本模型）中设计的防御技术进行适应，我们展示了多种最先进的攻击可以出人意料地被检测到。其次，为了解决这一差距，我们引入了Phantasia，一种上下文自适应后门攻击，能够动态地使受污染输出与每个输入的语义对齐。Phantasia 不是生成静态的受污染模式，而是鼓励模型生成上下文连贯但又恶意的响应，这些响应保持了合理性，从而显著提高了隐蔽性和适应性。广泛的实验表明，Phantasia 在各种防御设置下实现了最先进的攻击成功率，同时保持了良好的性能。

Summary / 总结

This work addresses the security vulnerabilities of Vision-Language Models (VLMs) by demonstrating that existing backdoor attacks are less stealthy than previously thought. The authors introduce Phantasia, a context-adaptive backdoor attack that generates contextually coherent yet malicious responses, improving stealth and adaptability. Experiments show that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under defensive settings.

该论文通过证明现有视觉语言模型（VLM）后门攻击的隐蔽性被高估，提出了Phantasia，一种上下文自适应后门攻击，生成上下文相关但恶意的响应，提高隐蔽性和适应性。实验表明，Phantasia在各种防御设置下实现了最先进的攻击成功率，同时保持了良好的性能。

Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Authors: Khushal Sethi

First: 2026-04-09T15:34:22+00:00 · Latest: 2026-04-09T15:34:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Inference-time compute scaling has emerged as a powerful technique for improving the reliability of large language model (LLM) agents, but existing methods apply compute uniformly: every decision step receives the same budget regardless of its difficulty. We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement. At each step, TrACE samples a small set of candidate next actions and measures how consistently the model commits to the same action. High agreement signals an easy decision; the controller commits immediately. Low agreement signals uncertainty; the controller samples additional rollouts up to a configurable cap before committing to the plurality action. No learned components, no external verifier, and no human labels are required. We evaluate TrACE against greedy decoding and fixed-budget self-consistency (SC-4, SC-8) on two benchmarks spanning single-step reasoning (GSM8K, n=50) and multi-step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct model running on CPU. TrACE-4 matches SC-4 accuracy while using 33% fewer LLM calls on GSM8K and 39% fewer on MiniHouse. TrACE-8 matches SC-8 accuracy with 55% fewer calls on GSM8K and 65% fewer on MiniHouse. We further show that inter-rollout agreement is a reliable signal of step-level success, validating the core hypothesis that the model's own output consistency encodes difficulty information that can be exploited without training. TrACE is the first training-free, per-timestep adaptive-compute controller for LLM agents to be evaluated on multi-step sequential decision tasks.

中文标题/摘要

标题：不要过度思考：基于间卷积行动协议的自由自适应计算信号

推理时的计算扩展已成为提高大型语言模型（LLM）代理可靠性的强大技术，但现有方法均匀分配计算资源：每个决策步骤都获得相同的预算，不论其难度。我们引入了TrACE（轨迹自适应计算通过协议），这是一种无需训练的控制器，通过测量间卷积行动协议来适应性地分配LLM调用，从而在代理时间步之间分配计算资源。在每一步，TrACE 会采样一组候选的下一个行动，并测量模型是否一致地选择相同的行动。高一致信号表明这是一个简单的决策；控制器会立即做出决定。低一致信号表明不确定性；控制器会在达到可配置上限之前采样额外的间卷积，然后选择多数行动。无需学习组件、外部验证器和人工标签。我们使用Qwen 2.5 3B Instruct模型在CPU上，在两个基准测试（GSM8K，n=50；MiniHouse，n=30）上将TrACE与贪婪解码和固定预算自我一致性（SC-4，SC-8）进行了评估。TrACE-4在GSM8K上的LLM调用次数比SC-4少33%，在MiniHouse上少39%。TrACE-8在GSM8K上的LLM调用次数比SC-8少55%，在MiniHouse上少65%。我们进一步表明，间卷积行动协议是步骤级成功的一个可靠信号，验证了核心假设，即模型自身的输出一致性包含了可以利用的难度信息，而无需训练。TrACE是第一个在多步序列决策任务中评估的无需训练、每时间步自适应计算的LLM代理控制器。

Summary / 总结

The paper introduces TrACE, a training-free method that allocates compute resources adaptively based on inter-rollout action agreement for large language model agents. It evaluates TrACE on GSM8K and MiniHouse benchmarks, showing that it can achieve the same accuracy as fixed-budget self-consistency methods while using fewer LLM calls. Specifically, TrACE-4 and TrACE-8 match SC-4 and SC-8 accuracy, respectively, with significant reductions in LLM calls on both benchmarks.

论文提出了TrACE，这是一种基于 rollout 行动一致性分配计算资源的无训练方法，用于大型语言模型代理。它在 GSM8K 和 MiniHouse 基准上评估了 TrACE，显示它可以使用更少的 LLM 调用次数达到与固定预算自我一致性方法相同的准确度。具体来说，TrACE-4 和 TrACE-8 分别与 SC-4 和 SC-8 的准确度相当，但在两个基准上的 LLM 调用次数显著减少。

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

Authors: Ruizhi Zhang, Ye Huang, Yuangang Pan, Chuanfu Shen, Zhilin Liu, Ting Xie, Wen Li, Lixin Duan

First: 2026-04-09T15:12:36+00:00 · Latest: 2026-04-09T15:12:36+00:00

Comments: Tech report

Abs · PDF · Code1 · Code2

Abstract

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.

中文标题/摘要

标题：PokeGym：一种视觉驱动的长时程基准测试，用于视觉-语言模型

尽管视觉-语言模型（VLMs）在静态视觉理解方面取得了显著进展，但在复杂3D嵌入式环境中的部署仍然受到严重限制。现有基准存在四个关键缺陷：（1）被动感知任务绕过了互动动态；（2）简化的2D环境无法评估深度感知；（3）先验状态泄露绕过了真正的视觉处理；（4）人工评估成本高昂且无法扩展。我们引入了PokeGym，这是一种在《Pokémon Legends: Z-A》这一视觉复杂的3D开放世界角色扮演游戏中的视觉驱动长时程基准测试。PokeGym 强制执行严格的代码级隔离：代理仅基于原始RGB观察进行操作，而独立评估器通过内存扫描验证成功，确保基于视觉的决策和自动化的可扩展评估。基准测试包括30项任务（30-220步），涵盖导航、交互和混合场景，具有三种指令粒度（视觉引导、步骤引导、仅目标）以系统地分解视觉定位、语义推理和自主探索能力。我们的评估揭示了当前VLMs的关键局限性：物理死锁恢复而非高级规划构成了主要瓶颈，死锁与任务成功之间存在强烈负相关。此外，我们发现了一种元认知分歧：较弱的模型主要遭受无意识死锁（对陷阱不知情），而先进的模型表现出有意识死锁（认识到陷阱但无法恢复）。这些发现突显了将显式空间直觉整合到VLM架构中的必要性。代码和基准测试将在GitHub上提供。

Summary / 总结

PokeGym is a visually-driven long-horizon benchmark for Vision-Language Models (VLMs) within a complex 3D game environment, addressing limitations of existing benchmarks. It involves 30 tasks with varying instruction granularities, ensuring pure vision-based decision-making and automated evaluation. Key findings show that current VLMs struggle with physical deadlock recovery, indicating a need for improved spatial intuition in VLM architectures. The benchmark aims to systematically evaluate visual grounding, semantic reasoning, and autonomous exploration capabilities of VLMs.

PokeGym 是一个基于视觉的长期基准，用于评估 Vision-Language 模型（VLMs）在复杂 3D 开放世界游戏中的表现，通过引入互动动态、深度感知和纯视觉决策来弥补现有基准的不足。它包含30个任务，具有不同的指令粒度，以评估视觉定位、语义推理和自主探索能力。关键发现表明，当前的 VLMs 在物理死锁恢复方面存在困难，这表明需要在 VLM 架构中集成显式的空间直觉。

Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow

Authors: Richard Petersen, Fredrik Kahl, Jennifer Alvén

Venue: MICCAI 2026

First: 2026-04-09T14:46:14+00:00 · Latest: 2026-04-09T14:46:14+00:00

Comments: Submitted to MICCAI 2026

Abs · PDF · Code1 · Code2

Abstract

Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.

中文标题/摘要

标题：基于3D校正流的无监督训练引导肺结节分割

密集标注，如分割掩码，获取成本高且耗时，尤其是在3D医学图像中，需要专家逐像素标注。弱监督方法旨在解决这一限制，但通常依赖于基于归因的方法，这些方法难以准确捕捉如肺结节这样的小结构。本文提出了一种结合预训练的最新校正流和预测器模型的弱监督肺结节分割方法，以插件方式使用。我们的方法使用3D校正流模型的无监督引导，仅需使用图像级标签微调预测器，无需重新训练生成模型。所提出的方法为两个独立的预测器生成了高质量的分割结果，一致地检测了不同大小和形状的肺结节。LUNA16上的实验表明，该方法优于基线方法，突显了生成基础模型作为弱监督3D医学图像分割工具的潜力。

SeLaR: Selective Latent Reasoning in Large Language Models

Authors: Renyu Fu, Guibo Luo

Venue: ACL 2026

First: 2026-04-09T14:32:07+00:00 · Latest: 2026-04-09T14:32:07+00:00

Comments: Camera-ready for ACL 2026 (main conference)

Abs · PDF · Code1 · Code2

Abstract

Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token's direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.

中文标题/摘要

标题：SeLaR：大型语言模型中的选择性潜在推理

思维链（CoT）已成为大型语言模型推理的核心，但其有效性受到离散标记采样表达能力有限的限制。最近的潜在推理方法试图通过用软嵌入（概率加权的标记嵌入混合物）或隐藏状态替换离散标记来缓解这一限制，但它们通常存在两个问题：（1）全局激活会在高置信度步骤中注入扰动，损害推理稳定性；（2）软嵌入迅速向最高概率的标记方向坍塌，限制了对替代路径的探索。为了解决这些挑战，我们提出了SeLaR（选择性潜在推理），这是一种轻量级且无需训练的框架。SeLaR引入了一个熵门控机制，仅在低置信度步骤激活软嵌入，而在高置信度步骤保持离散解码。此外，我们还提出了一种熵感知对比正则化，将软嵌入推向主导（最高概率）标记的方向相反，鼓励对多种潜在推理路径的持续探索。在五个推理基准上的实验表明，SeLaR在所有情况下都优于标准CoT和最先进的无需训练方法。

Summary / 总结

SeLaR is a lightweight framework for selective latent reasoning in large language models that addresses the limitations of global activation and soft embedding collapse. By using an entropy-gated mechanism, SeLaR activates soft embeddings only at low-confidence steps, maintaining discrete decoding at high-confidence steps. Additionally, entropy-aware contrastive regularization encourages exploration of multiple reasoning paths. Experiments show that SeLaR outperforms standard Chain-of-Thought and state-of-the-art training-free methods across five reasoning benchmarks.

SeLaR 是一个轻量级且无需训练的框架，通过解决现有潜在推理方法的限制来增强大型语言模型的推理能力。它引入了一个基于熵门控的机制，在低置信度步骤中选择性地激活软嵌入，而在高置信度步骤中保持离散解码。此外，SeLaR 还采用了一种基于熵的对比正则化，以鼓励探索多种潜在推理路径。实验结果显示，SeLaR 在五个推理基准测试中均优于标准的链式思考和最先进的无需训练方法。

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Authors: Miguel Monte e Freitas, Rui Henriques, Ricardo Rei, Pedro Henrique Martins

First: 2026-04-09T14:29:19+00:00 · Latest: 2026-04-09T14:29:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

中文标题/摘要

标题：视觉语言模型能否评判动作质量？一项实证评估

动作质量评估（AQA）在物理治疗、体育教练和竞技评判中有着广泛的应用。尽管视觉语言模型（VLMs）在AQA方面具有很大的潜力，但它们在这一领域的实际表现仍然鲜有研究。我们对最先进的VLMs在活动领域（如健身、花样滑冰、跳水）、任务、表示和提示策略方面进行了全面评估。基线结果显示，Gemini 3.1 Pro、Qwen3-VL和InternVL3.5模型的表现仅略高于随机猜测，尽管融入骨骼信息、语义指令、推理结构和上下文学习等策略可以带来孤立的提升，但没有一种策略是始终有效的。对预测分布的分析揭示了两种系统性偏差：一种是无论视觉证据如何，倾向于预测正确的执行，另一种是对表面语言框架的敏感性。通过对比重新表述任务以减轻这些偏差的效果甚微，这表明模型的局限性超出了这些偏差，指向了精细动作质量评估的基本困难。我们的研究结果为未来基于VLM的AQA研究奠定了严格的基线，并提供了在可靠的实际部署前需要缓解的失败模式的可操作指南。

Summary / 总结

The study evaluates the performance of state-of-the-art Vision Language Models (VLMs) in Action Quality Assessment (AQA), a task with applications in physical therapy, sports coaching, and competitive judging. Despite promising potential, the baseline results show that models like Gemini 3.1 Pro, Qwen3-VL, and InternVL3.5 perform only slightly better than random chance. Incorporating skeleton information, grounding instructions, reasoning structures, and in-context learning provided some gains but were not consistently effective. Analysis revealed two biases: a tendency to predict correct execution regardless of visual evidence and sensitivity to superficial linguistic framing. These findings suggest that VLMs face fundamental challenges in assessing fine-grained movement quality, setting a baseline for future research and highlighting areas needing improvement for reliable real-world deployment.

研究评估了最先进的视觉语言模型（VLMs）在动作质量评估（AQA）中的表现，发现如Gemini 3.1 Pro、Qwen3-VL和InternVL3.5等模型的表现仅略高于随机猜测。虽然整合骨架信息、语义指导、推理结构和上下文学习等策略有所提升，但并不稳定。分析发现两种偏见：倾向于无视视觉证据预测正确执行和对表面语言框架的敏感性。重新表述任务并未显著改善结果，表明在评估精细动作质量方面存在根本性挑战。这项研究为未来研究设定了基准，并指出了需要改进的领域以实现可靠的现实世界部署。

EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

Authors: Xiangyuan Wang, Honghao Cai, Yunhao Bai, Tianze Zhou, Haohua Chen, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu

First: 2026-04-09T13:11:33+00:00 · Latest: 2026-04-09T13:11:33+00:00

Abs · PDF · Code1 · Code2

Abstract

High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.

中文标题/摘要

标题：EditCaption: 通过监督微调和直接偏好优化实现图像编辑的人类对齐指令合成

高质量的训练三元组（带有精确编辑指令的源-目标图像对）是扩展基于指令的图像编辑模型的关键瓶颈。视觉语言模型（VLMs）广泛用于自动化指令合成，但我们发现在图像对设置中存在三种系统性失败模式：方向不一致（例如，左右混淆）、视角模糊以及属性描述不足。人类评估显示，超过47%的来自强大基线VLM的指令包含关键错误，无法用于下游训练。我们提出了EditCaption，一种基于VLM的指令合成的可扩展两阶段后训练管道。第一阶段通过结合GLM自动注释、EditScore过滤和人类细化，构建了10万条监督微调（SFT）数据集，以提高空间、方向和属性级别的准确性。第二阶段收集了1万条人类偏好对，针对三种失败模式，并应用直接偏好优化（DPO）以超越SFT本身实现对齐。在Eval-400、ByteMorph-Bench和HQ-Edit上，微调后的Qwen3-VL模型优于开源基线；2350亿参数模型在Eval-400上达到4.712（与Gemini-3-Pro 4.706、GPT-4.1 4.220、Kimi-K2.5 4.111相比），在ByteMorph-Bench上达到4.588（与Gemini-3-Pro 4.522、GPT-4.1 3.412相比）。人类评估显示，关键错误率从47.75%降至23%，正确性从41.75%升至66%。该工作提供了一条实用路径，实现可扩展且人类对齐的图像编辑数据指令合成。

Summary / 总结

The research aims to address the bottleneck of high-quality training triplets for instruction-guided image editing models. It proposes EditCaption, a two-stage post-training pipeline using supervised fine-tuning and direct preference optimization. Stage 1 creates a 100K dataset with human refinement for accuracy, and Stage 2 uses human preference pairs to optimize alignment. The study shows that fine-tuned Qwen3-VL models outperform open-source baselines, with the 235B model achieving scores of 4.712 on Eval-400 and 4.588 on ByteMorph-Bench, and human errors decreasing from 47.75% to 23%. Correctness increased from 41.75% to 66%. This work provides a practical approach for scalable, human-aligned instruction synthesis for image editing data.

论文旨在解决用于图像编辑的指令引导模型高质量训练三元组的挑战。提出了EditCaption，这是一种两阶段后训练管道，使用监督微调和直接偏好优化来改进指令合成。第一阶段创建了一个包含100K数据集，并通过人工修正空间、方向和属性级别的错误。第二阶段使用人工偏好对进行进一步对齐。结果显示，在Eval-400、ByteMorph-Bench和HQ-Edit上的表现有所提高，235B模型分别获得了4.712和4.588的分数，并且人工评估表明关键错误显著减少，正确性有所提高。

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

Authors: Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Armstrong Aboah

First: 2026-04-09T13:11:30+00:00 · Latest: 2026-04-09T13:11:30+00:00

Abs · PDF · Code1 · Code2

Abstract

General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.

中文标题/摘要

标题：视觉-语言基础模型在综合路面状况评估中的应用

通用的视觉-语言模型在日常领域表现出色，但在需要精确术语、结构化推理和遵守工程标准的专业技术领域却表现不佳。本研究探讨了是否可以通过领域特定指令调优使视觉-语言模型能够进行全面的路面状况评估。PaveInstruct数据集包含278,889张图像-指令-响应对，覆盖32种任务类型，由九个异构路面数据集的注释统一而成。PaveGPT是基于该数据集训练的路面基础模型，其在感知、理解和推理任务上与最先进的视觉-语言模型进行了对比评估。指令调优提升了模型的能力，在空间定位、推理和生成任务上取得了超过20%的改进，同时生成符合ASTM D6433标准的输出。这些结果使交通部门能够部署统一的对话式评估工具，替代多个专业系统，简化工作流程并降低技术专长要求。该方法为桥梁检查、铁路维护和建筑状况评估等基础设施领域的指令驱动AI系统开发奠定了路径。

Summary / 总结

This research aims to enhance the performance of vision-language models in specialized technical fields like pavement condition assessment. It introduces PaveInstruct, a dataset of 278,889 image-instruction-response pairs, and PaveGPT, a model trained on this dataset. The study shows that instruction tuning significantly improves model capabilities, achieving up to 20% better performance in spatial grounding, reasoning, and generation tasks, and producing compliant outputs. This enables transportation agencies to use unified conversational assessment tools, simplifying workflows and reducing technical expertise requirements.

该研究旨在提升视觉-语言模型在专业技术领域如路面状况评估中的性能。它引入了包含278,889张图像-指令-响应对的PaveInstruct数据集和基于该数据集训练的PaveGPT模型。研究表明，指令调优显著提升了模型的能力，在空间定位、推理和生成任务中分别实现了最高20%的性能提升，并生成符合标准的输出。这使得交通部门能够使用统一的对话式评估工具，简化工作流程并减少技术专长要求。

MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

Authors: Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu, Lifeng Sun, Minfeng Xu

Venue: ICLR 2026

First: 2026-04-09T13:04:49+00:00 · Latest: 2026-04-09T13:04:49+00:00

Comments: Accepted by ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.

中文标题/摘要

标题：MedVR：基于代理强化学习的无注释医学视觉推理

医学视觉-语言模型（VLMs）在复杂临床任务中具有巨大潜力，但其推理能力往往受限于仅基于文本的范式，无法将推断与视觉证据联系起来。这一限制不仅限制了对需要精细视觉分析的任务的性能，还增加了在安全关键应用中出现视觉幻觉的风险。因此，我们提出了MedVR，这是一种新颖的强化学习框架，使医学VLMs能够进行无注释的视觉推理。其核心创新在于两种协同机制：熵导向的视觉再定位（EVR）利用模型不确定性引导探索，而基于共识的信用分配（CCA）从回放一致性中提炼伪监督。在没有任何人类注释的情况下，MedVR在多种公开的医学VQA基准测试中达到了最先进的性能，显著优于现有模型。通过直接与视觉证据进行推理，MedVR促进了对于加速医学AI临床部署至关重要的稳健性和透明度。

Summary / 总结

MedVR is a reinforcement learning framework designed to enhance the visual reasoning capabilities of medical vision-language models (VLMs) without requiring human annotations. It employs Entropy-guided Visual Regrounding (EVR) to explore model uncertainty and Consensus-based Credit Assignment (CCA) to generate pseudo-supervision from rollout agreement. MedVR demonstrates superior performance on various public medical VQA benchmarks, outperforming existing models and promoting robustness and transparency in medical AI applications.

MedVR 是一种无需人工标注的强化学习框架，旨在增强医疗视觉语言模型的视觉推理能力。它使用 Entropy-guided Visual Regrounding (EVR) 来探索模型不确定性，并使用 Consensus-based Credit Assignment (CCA) 生成伪监督。MedVR 在各种公开的医疗 VQA 基准测试中取得了最先进的性能，超越了现有模型，并促进了医疗 AI 应用的稳健性和透明度。

AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Authors: Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj

Venue: CVPR 2026

First: 2026-01-28T12:02:58+00:00 · Latest: 2026-04-09T12:38:57+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/

中文标题/摘要

标题：AnomalyVFM -- 将视觉基础模型转化为零样本异常检测器

零样本异常检测旨在无需访问任何领域内训练图像的情况下，检测和定位图像中的异常区域。虽然最近的方法利用视觉-语言模型（VLMs），如CLIP，来转移高级概念知识，但基于纯粹视觉基础模型（VFMs）的方法，如DINOv2，在性能上落后。我们认为这种差距源于两个实际问题：(i) 现有辅助异常检测数据集的多样性有限，(ii) VFM的适应策略过于浅显。为了解决这两个挑战，我们提出了AnomalyVFM，这是一种通用且有效的框架，能够将任何预训练的VFM转化为强大的零样本异常检测器。我们的方法结合了一种稳健的三阶段合成数据集生成方案和一种参数高效的适应机制，利用低秩特征适配器和置信加权像素损失。这些组件共同使现代VFMs在性能上显著优于当前最先进的方法。具体而言，以RADIO作为骨干，AnomalyVFM在9个不同数据集上的平均图像级AUROC为94.1%，比之前的方法高出显著的3.3个百分点。项目页面：https://maticfuc.github.io/anomaly_vfm/

Summary / 总结

AnomalyVFM is designed to enhance zero-shot anomaly detection by transforming pretrained vision foundation models (VFMs) into effective detectors. It addresses the limitations of existing methods by introducing a robust three-stage synthetic dataset generation and a parameter-efficient adaptation mechanism. AnomalyVFM significantly outperforms current state-of-the-art methods, achieving an average image-level AUROC of 94.1% across nine diverse datasets, surpassing previous methods by 3.3 percentage points.

AnomalyVFM 是一种利用视觉基础模型（VFMs）如 DINOv2 提升零样本异常检测的方法，通过提出一种稳健的三阶段合成数据集生成方案和参数高效的适应机制来解决现有方法的局限性。该方法在九个不同数据集上的平均图像级 AUROC 达到 94.1%，比之前的方法高出 3.3 个百分点。

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Authors: Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong, Yang Wang, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang

First: 2026-04-09T12:28:14+00:00 · Latest: 2026-04-09T12:28:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator for value estimation. Taking the current observation and robot proprioception as input, ViVa jointly predicts future proprioception and a scalar value for the current state. By leveraging the spatiotemporal priors of a pretrained video generator, our approach grounds value estimation in anticipated embodiment dynamics, moving beyond static snapshots to intrinsically couple value with foresight. Integrated into RECAP, ViVa delivers substantial improvements on real-world box assembly. Qualitative analysis across all three tasks confirms that ViVa produces more reliable value signals, accurately reflecting task progress. By leveraging spatiotemporal priors from video corpora, ViVa also generalizes to novel objects, highlighting the promise of video-generative models for value estimation.

中文标题/摘要

标题：ViVa：一种用于机器人强化学习的视频生成价值模型

视觉-语言-动作（VLA）模型通过大规模预训练提升了机器人的操作能力，但由于部分可观测性和延迟反馈，实际部署仍然具有挑战性。强化学习通过价值函数解决了这一问题，评估任务进展并指导策略改进。然而，现有的基于视觉-语言模型（VLMs）的价值模型难以捕捉时间动态，影响了长期任务中可靠价值估计。本文提出了一种视频生成价值模型ViVa，该模型重新利用了预训练的视频生成器进行价值估计。通过将当前观察和机器人本体感觉作为输入，ViVa联合预测未来本体感觉和当前状态的标量值。通过利用预训练视频生成器的时空先验，我们的方法将价值估计与预期的本体动态联系起来，超越了静态快照，内在地将价值与前瞻性联系起来。集成到RECAP中，ViVa在真实世界的盒子组装任务中取得了显著改进。在所有三个任务的定性分析中，证实了ViVa生成了更可靠的价值信号，准确反映了任务进展。通过利用视频数据集中的时空先验，ViVa还能够泛化到新对象，突显了视频生成模型在价值估计中的潜力。

Summary / 总结

The paper proposes ViVa, a video-generative value model for robot reinforcement learning, addressing the challenge of partial observability and delayed feedback in real-world deployment. ViVa uses a pretrained video generator to predict future proprioception and a scalar value for the current state, integrating value estimation with anticipated embodiment dynamics. The model shows substantial improvements in real-world box assembly tasks and produces more reliable value signals that accurately reflect task progress, generalizing to novel objects.

论文提出了一种视频生成价值模型ViVa，用于解决机器人强化学习中部分可观测性和延迟反馈的现实部署难题。ViVa 利用预训练的视频生成器预测未来 proprioception 和当前状态的标量值，将价值估计与预期的动态联系起来。这种方法提高了任务进度评估和策略改进，实现在真实世界盒子组装任务中的显著改进。定性分析表明，ViVa 提供了更可靠的价值信号，准确反映了任务进度，并能够泛化到新对象。

T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

Authors: Pranjal Khadka

Venue: CVPR 2026

First: 2026-04-09T12:27:50+00:00 · Latest: 2026-04-09T12:27:50+00:00

Comments: Accepted at the PHAROS-AIF-MIH Workshop at CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.

中文标题/摘要

标题：T-门适配器：一种轻量级时间适配器用于视觉语言医学分割

医学图像分割传统上依赖于完全监督的3D架构，需要临床专家提供大量密集的体素级注释，这是一个极其昂贵的过程。视觉语言模型（VLMs）通过利用从数十亿张图像中学习到的广泛视觉语义表示提供了有力的替代方案。然而，当独立应用于3D扫描的2D切片时，这些模型通常会产生噪声大且解剖上不合理的分割，违反了解剖结构的内在连续性。我们提出了一种时间适配器，通过直接将相邻切片的上下文注入模型的视觉标记表示中来解决这一问题。该适配器包括一个在标记级别跨固定上下文窗口进行时间变换的时间变换器、一个在切片内进行细化的空间上下文块以及一个平衡时间与单切片特征的自适应门控。在FLARE22数据集中30个标注的体积上进行训练，我们的方法在13个腹部器官上实现了平均Dice值为0.704，比没有时间上下文的基线VLM提高了0.206。在BTCV和AMOS22数据集上的零样本评估中，分别取得了+0.210和+0.230的一致改进，跨域性能下降从38.0%降低到24.9%。此外，在AMOS22 MRI的跨模态评估中，两模型均未接受任何MRI监督，我们的方法实现了平均Dice值为0.366，优于仅在CT上训练的完全监督3D基线（DynUNet，0.224），表明CLIP的视觉语义表示在不同成像模态之间泛化得更好，比卷积特征更自然。

Summary / 总结

The research aims to improve the accuracy and continuity of medical image segmentations using Vision Language Models (VLMs) by addressing their tendency to produce noisy and anatomically implausible results. The proposed T-Gated Adapter injects adjacent-slice context into the model's visual token representations, enhancing the temporal and spatial coherence of the segmentations. The method achieves a mean Dice score of 0.704 across 13 abdominal organs, outperforming the baseline by +0.206. Zero-shot evaluations on BTCV and AMOS22 datasets show consistent improvements, and the method also outperforms a fully supervised 3D baseline in a cross-modality evaluation on AMOS22 MRI data without any MRI supervision, indicating better generalization across imaging modalities.

研究旨在通过解决视觉语言模型（VLMs）生成的分割结果噪声大、解剖上不合理的问题，提高医学图像分割的准确性和连续性。提出的T-Gated Adapter直接将相邻切片的上下文注入模型的视觉标记表示中，增强时间和空间的一致性。该方法在13个腹部器官上实现了平均Dice分数0.704，比基线高出+0.206。在BTCV和AMOS22数据集上的零样本评估显示了一致的改进，并且在AMOS22 MRI数据的跨模态评估中，即使没有MRI监督，该方法也优于完全监督的3D基线（DynUNet，0.224），表明其在不同成像模态之间的泛化能力更强。

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Authors: Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu

First: 2026-04-09T11:40:25+00:00 · Latest: 2026-04-09T11:40:25+00:00

Comments: Project page and demo are available at https://FeiElysia.github.io/tempo-page/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

中文标题/摘要

标题：小型视觉-语言模型是长视频理解的高效压缩器

将多模态大型语言模型（MLLMs）适应一小时长的视频受到上下文限制的瓶颈。密集的视觉流耗尽了令牌预算并加剧了中间信息丢失的现象。现有的启发式方法，如稀疏采样或均匀池化，盲目地牺牲了保真度，丢弃了关键时刻并浪费带宽在无关的背景上。我们提出了 Tempo，一种高效的查询感知框架，用于压缩长视频以供下游理解。Tempo 利用小型视觉-语言模型（SVLM）作为局部时间压缩器，将令牌减少视为早期跨模态蒸馏过程，以在单次前向传递中生成紧凑且意图对齐的表示。为了在不破坏因果关系的情况下严格控制预算，我们引入了自适应令牌分配（ATA）。利用 SVLM 的零样本相关性先验和语义前加载，ATA 作为无训练的 $O(1)$ 动态路由器。它将密集带宽分配给查询关键段，同时将冗余压缩为最小的时间锚点以保持全局故事情节。大量实验表明，我们的 6B 架构在激进的动态压缩（0.5-16 令牌/帧）下实现了最先进的性能。在极端长的 LVBench（4101s）上，Tempo 在严格的 8K 视觉预算下得分 52.3，优于 GPT-4o 和 Gemini 1.5 Pro。扩展到 2048 帧达到 53.7。至关重要的是，Tempo 显著压缩了小时长的视频，证明真正的长视频理解依赖于意图驱动的效率，而不是贪婪填充的上下文窗口。

Summary / 总结

The paper addresses the challenge of processing long videos using multimodal large language models by proposing Tempo, an efficient query-aware framework. Tempo uses a Small Vision-Language Model (SVLM) to compress long videos, reducing tokens through early cross-modal distillation while maintaining intent alignment. Adaptive Token Allocation (ATA) dynamically routes bandwidth to critical segments, compressing redundancies into minimal temporal anchors. Experiments show that Tempo achieves state-of-the-art performance on long videos with aggressive compression, outperforming other models like GPT-4o and Gemini 1.5 Pro under strict visual budgets.

论文提出 Tempo，一种高效的查询感知框架，以解决使用多模态大型语言模型处理长视频的挑战。Tempo 使用小型视觉语言模型 (SVLM) 通过早期跨模态蒸馏减少令牌数量，同时保持意图对齐。自适应令牌分配 (ATA) 动态地将带宽路由到关键段落，将冗余压缩为最小的时间锚点。实验表明，Tempo 在严格的视觉预算下实现了最先进的性能，超越了其他模型如 GPT-4o 和 Gemini 1.5 Pro，对长视频进行激进压缩。

OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

Authors: Seungjae Moon, Seunghyun Oh, Youngmin Ro

First: 2026-04-09T11:28:43+00:00 · Latest: 2026-04-09T11:28:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.

中文标题/摘要

标题：OV-Stitcher：一种全局上下文感知的无训练框架，用于开放词汇语义分割

无训练开放词汇语义分割(TF-OVSS)由于能够通过利用大型视觉和视觉语言模型的预训练知识进行密集预测，而无需额外训练，最近引起了关注。然而，由于这些预训练编码器的输入分辨率有限，现有的TF-OVSS方法通常采用滑动窗口策略，独立处理裁剪的子图像。虽然这种方法对于处理高分辨率输入是有效的，但它会阻止对整个图像进行全局关注，导致特征表示碎片化和上下文推理能力有限。我们提出了一种名为OV-Stitcher的无训练框架，通过在最终编码器块内直接拼接碎片化的子图像特征来解决这一限制。通过从碎片化的子图像特征中重构注意力表示，OV-Stitcher能够在最终编码器块内实现全局关注，产生连贯的上下文聚合和空间上一致、语义对齐的分割图。在八个基准上的广泛评估表明，OV-Stitcher提供了一种可扩展且有效的开放词汇分割解决方案，与之前的无训练基线相比，平均交并比(mIoU)提高了从48.7到50.7。

Summary / 总结

OV-Stitcher is a training-free framework for open-vocabulary semantic segmentation that addresses the limitations of existing methods by stitching fragmented sub-image features within the final encoder block, enabling global attention and coherent context aggregation. This approach improves mean Intersection over Union (mIoU) from 48.7 to 50.7 compared to prior training-free baselines across eight benchmarks.

该论文提出了OV-Stitcher，一种训练-free 的开放词汇语义分割框架，通过在最终编码器块内缝合分割的子图像特征来解决现有方法的局限性。这种方法能够实现全局注意力和上下文聚合，使其在八个基准测试中相对于之前的训练-free 基线方法在平均交并比(mIoU)上取得了从48.7到50.7的显著提升。

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Authors: Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu

First: 2025-12-03T07:51:03+00:00 · Latest: 2026-04-09T11:20:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.

中文标题/摘要

标题：OpenTrack3D：朝向准确且通用的开放词汇3D实例分割

将开放词汇3D实例分割（OV-3DIS）推广到多样、无结构且无网格的环境中对于机器人技术和AR/VR至关重要，但仍然是一个重大挑战。我们将其归因于现有方法的两个关键限制：（1）提案生成依赖于数据集特定的提案网络或基于网格的超点，使其在无网格场景中不适用，并限制了对新场景的泛化；（2）基于CLIP的分类器的弱文本推理能力，难以识别组合性和功能性用户查询。为了解决这些问题，我们提出了OpenTrack3D，这是一种通用且准确的框架。与依赖预生成提案的方法不同，OpenTrack3D采用了一种新颖的视觉-空间跟踪器来在线构建跨视图一致的对象提案。给定一个RGB-D流，我们的流水线首先利用2D开放词汇分割器生成掩码，然后使用深度信息将这些掩码提升到3D点云。掩码引导的实例特征随后使用DINO特征图提取，我们的跟踪器融合视觉和空间线索以保持实例一致性。核心流水线完全无网格，但我们还提供了一个可选的超点细化模块，当场景网格可用时，可以进一步提高性能。最后，我们用多模态大型语言模型（MLLM）取代了CLIP，显著增强了对复杂用户查询的组合性推理能力。在包括ScanNet200、Replica、ScanNet++和SceneFun3D在内的多种基准上的广泛实验表明，该方法具有最先进的性能和强大的泛化能力。

Summary / 总结

OpenTrack3D addresses the challenges of generalizing open-vocabulary 3D instance segmentation to diverse environments by introducing a novel visual-spatial tracker that constructs cross-view consistent object proposals online. The framework leverages a 2D open-vocabulary segmenter to generate masks, which are then lifted to 3D point clouds. Instance features are extracted using DINO feature maps, and a tracker fuses visual and spatial cues to maintain instance consistency. Experiments on various benchmarks show that OpenTrack3D achieves state-of-the-art performance and strong generalization capabilities.

OpenTrack3D通过引入一种新颖的视觉-空间追踪器来在线生成跨视图一致的对象提案，以解决在多样且未结构化环境中的开放词汇3D实例分割挑战。不同于依赖预生成提案或基于网格的超点的方法，OpenTrack3D使用2D开放词汇分割器和DINO特征图从RGB-D流中构建提案，并通过融合视觉和空间线索来保持实例一致性。此外，当场景网格可用时，还可以使用一个可选的超点细化模块。另外，OpenTrack3D用多模态大型语言模型替换CLIP，以更好地处理复杂用户查询的组合推理。在各种基准测试上的实验表明，OpenTrack3D实现了最先进的性能和强大的泛化能力。

Understanding Task Transfer in Vision-Language Models

Authors: Bhuvan Sachdeva, Karan Uppal, Abhinav Java, Vineeth N. Balasubramanian

Venue: CVPR 2026 Oral

First: 2025-11-24T05:37:52+00:00 · Latest: 2026-04-09T10:41:06+00:00

Comments: CVPR 2026 (Oral)

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. We introduce Perfection Gap Factor (PGF), a normalized metric that measures change in performance as a result of task transfer. We utilize PGF to compute Task Transferability, which captures both the breadth and the magnitude of transfer induced by a source task. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

中文标题/摘要

标题：理解视觉语言模型中的任务迁移

视觉语言模型（VLMs）在多模态基准测试中表现良好，但在深度估计或物体计数等视觉感知任务上落后于人类和专门模型。在一项任务上的微调可能会不可预测地影响其他任务的表现，使得针对特定任务的微调具有挑战性。在本文中，我们通过系统研究任务迁移性来应对这一挑战。我们研究了在一项感知任务上微调VLM如何影响其在其他任务上的零样本表现。我们引入了完美差距因子（PGF），这是一种归一化度量，用于衡量由于任务迁移而导致的表现变化。我们利用PGF计算任务迁移性，该度量捕捉了由源任务引起的迁移的广度和幅度。使用三个开源VLMs在13项感知任务上进行评估，我们构建了一个任务迁移图，揭示了感知任务之间以前未被观察到的关系。我们的分析揭示了正迁移和负迁移的模式，确定了相互影响的任务组，根据其迁移行为将任务组织成不同的角色，并展示了PGF如何指导数据选择以实现更高效的训练。这些发现突显了正迁移的机会和负干扰的风险，为推进VLMs提供了可操作的指导。

Summary / 总结

This paper investigates how fine-tuning a Vision-Language Model (VLM) on one visual perception task affects its performance on other tasks. It introduces the Perfection Gap Factor (PGF) to measure the change in performance due to task transfer and defines Task Transferability to capture both the breadth and magnitude of this transfer. Using three open-weight VLMs across 13 perception tasks, the authors construct a task transfer graph that reveals new relationships and patterns of positive and negative transfer, guiding data selection for more efficient training and highlighting both opportunities and risks in VLM development.

本文研究了对视觉语言模型（VLM）进行一个感知任务的微调如何影响其在其他任务上的表现。作者引入了完美差距因子（PGF）来衡量任务转移导致的性能变化，并定义了任务转移性来捕捉这种影响的广度和程度。使用三个开源的VLM和13个感知任务，他们构建了一个任务转移图，揭示了新的关系和正向与负向转移的模式，指导更高效的训练数据选择。这项工作突显了VLM中正向转移的机会和负向干扰的风险。

AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

Authors: Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane

First: 2026-04-09T10:38:23+00:00 · Latest: 2026-04-09T10:38:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.

中文标题/摘要

标题：AtlasOCR：使用视觉语言模型构建首个开源达里雅OCR模型

达里雅，摩洛哥阿拉伯方言，富含视觉内容但缺乏专门的光学字符识别（OCR）工具。本文介绍了AtlasOCR，这是首个使用3B参数视觉语言模型（VLM）微调构建的开源达里雅OCR模型。我们详细介绍了从利用OCRSmith库的合成生成和精心收集的真实世界数据构建独特的达里雅专用数据集的方法，到实施高效的微调策略。我们使用QLoRA和Unsloth对Qwen2.5-VL 3B进行参数高效训练，并进行了全面的消融研究以优化关键超参数。我们在新构建的AtlasOCRBench和已建立的KITAB-Bench上的评估显示了最先进的性能，挑战了更大的模型，并突显了AtlasOCR在达里雅和标准阿拉伯OCR任务中的鲁棒性和泛化能力。

Summary / 总结

The paper addresses the lack of specialized OCR tools for Darija, the Moroccan Arabic dialect, by introducing AtlasOCR, the first open-source Darija OCR model. This model is built by fine-tuning a 3B parameter Vision Language Model. The authors curate a unique Darija dataset using both synthetic generation and real-world data, and employ parameter-efficient training techniques like QLoRA and Unsloth. The model shows state-of-the-art performance on both the AtlasOCRBench and KITAB-Bench, highlighting its robustness and generalization capabilities for Darija and standard Arabic OCR tasks.

论文针对摩洛哥阿拉伯方言Darija缺乏专门的OCR工具，介绍了AtlasOCR这一开源OCR模型。该模型使用一个3B参数的Vision Language Model进行微调，利用合成生成和真实世界数据创建的独特数据集。模型在AtlasOCRBench和KITAB-Bench上的表现达到了最先进的水平，展示了其在Darija和标准阿拉伯OCR任务中的鲁棒性和泛化能力。

Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

Authors: Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye

First: 2025-09-27T14:13:41+00:00 · Latest: 2026-04-09T10:37:26+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

With the continuous expansion of Large Language Models (LLMs) and advances in reinforcement learning, LLMs have demonstrated exceptional reasoning capabilities, enabling them to address a wide range of complex problems. Inspired by these achievements, researchers have extended related techniques to Large Multimodal Models (LMMs). However, a critical limitation has emerged, reflected in the progressive loss of visual grounding. As the reasoning chain grows longer, LMMs tend to rely increasingly on the textual information generated in earlier steps, while the initially extracted visual information is rarely revisited or incorporated. This phenomenon often causes the reasoning process to drift away from the actual image content, resulting in visually implausible or even erroneous conclusions. To overcome this fundamental limitation, we propose a novel, training-free agentic paradigm that Decouples cognitive Reasoning from visual Perception (DRP). In this framework, a powerful LLM serves as a strategic Reasoner, orchestrating the inference process by explicitly querying an LMM-acting as a dedicated Observer-to retrieve fine-grained visual details on demand. This approach is lightweight, model-agnostic, and plug-and-play, necessitating no additional training or architectural modifications. Extensive experiments demonstrate our framework DRP's efficacy in regulating the visual reasoning trajectory, significantly mitigating reasoning drift, and enforcing robust visual grounding. Notably, on the MathVision benchmark, the integration of Qwen2.5-VL-7B and Qwen3-32B achieves an accuracy of 47.2\%, outperforming GPT-4o's 40.6\%. These findings underscore the potential of our approach to enhance multimodal reasoning reliability without the need for costly retraining. Our code is publicly available at https://github.com/hongruijia/DRP.

中文标题/摘要

标题：在大型多模态模型中缓解视觉上下文退化：一种无需训练的解耦自主框架

随着大型语言模型（LLMs）的不断扩展和强化学习的进步，LLMs 展现出了卓越的推理能力，使其能够解决各种复杂问题。受这些成就的启发，研究人员将相关技术扩展到了大型多模态模型（LMMs）中。然而，一个关键的限制出现了，表现为视觉定位的逐步丧失。随着推理链的延长，LMMs 越来越依赖于早期步骤生成的文本信息，而最初提取的视觉信息很少被重新访问或整合。这种现象往往导致推理过程偏离实际图像内容，产生视觉上不合理甚至错误的结论。为克服这一根本限制，我们提出了一种新颖的无需训练的自主范式，即解耦认知推理与视觉感知（DRP）。在该框架中，一个强大的 LLM 作为战略推理者，通过显式查询 LMM（作为专门的观察者）来按需检索细粒度的视觉细节，从而协调推理过程。该方法轻量级、模型无关且即插即用，无需额外的训练或架构修改。大量实验表明，我们的框架 DRP 在调节视觉推理轨迹、显著减轻推理漂移和确保稳健的视觉定位方面具有有效性。值得注意的是，在 MathVision 基准测试中，Qwen2.5-VL-7B 和 Qwen3-32B 的结合实现了 47.2% 的准确率，优于 GPT-4o 的 40.6%。这些发现强调了我们方法在无需昂贵的重新训练的情况下增强多模态推理可靠性的潜力。我们的代码已公开发布在 https://github.com/hongruijia/DRP。

Summary / 总结

The paper addresses the issue of visual context degradation in Large Multimodal Models (LMMs) by proposing a training-free Decoupled Agentic Framework (DRP). This framework separates cognitive reasoning from visual perception, allowing a powerful LLM to query an LMM for fine-grained visual details as needed. Experiments show that DRP effectively mitigates reasoning drift and enhances visual grounding, achieving 47.2% accuracy on the MathVision benchmark, surpassing GPT-4o's 40.6%. The approach is lightweight, model-agnostic, and requires no additional training or architectural changes.

论文提出了一种训练-free 的分离认知推理与视觉感知的框架（DRP），以解决大型多模态模型（LMMs）中的视觉上下文退化问题。该框架允许强大的语言模型根据需要查询LMM以获取细粒度的视觉细节。实验表明，DRP 有效缓解了推理漂移并增强了视觉定位，使其在 MathVision 基准测试中的准确率达到 47.2%，超过了 GPT-4o 的 40.6%。该方法轻量级、模型无关且无需额外的训练或架构修改。

3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

Authors: Hongcan Xiao, Xinyue Xiao, Yilin Wang, Yue Zhang, Yonggang Qi

Venue: CVPR 2026 Highlight

First: 2026-04-09T09:47:00+00:00 · Latest: 2026-04-09T09:47:00+00:00

Comments: CVPR 2026 Highlight

Abs · PDF · Code1 · Code2

Abstract

Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

中文标题/摘要

标题：3DrawAgent：通过早期对比体验教学LLM在3D中绘画

在3D空间中绘制可以表达关于形状、结构和空间关系的推理，但通过自然语言生成3D草图仍然是一个重大挑战。本文介绍了一种无需训练、由语言驱动的3D草图生成框架3DrawAgent，该框架利用大型语言模型（LLMs）在几何反馈下顺序绘制3D贝塞尔曲线。与之前的2D草图代理不同，我们的方法引入了一种相对经验优化策略，该策略适应了最近提出的组奖励策略优化（GRPO）范式。我们不依赖于显式的地面真实监督，而是基于CLIP感知奖励和LLM细粒度的定性评估构建生成草图的成对比较，每对包括一个相对较好的结果和一个较差的结果。这些经验随后被用来逐步细化3D绘画的先验知识，使模型在不更新参数的情况下增强其3D意识。实验表明，3DrawAgent可以从多种文本提示中生成复杂且连贯的3D贝塞尔草图，表现出几何推理能力，并能够泛化到新的形状，为无训练3D草图智能的发展建立了一个新的范式。

Summary / 总结

3DrawAgent is a training-free framework that uses large language models to generate 3D sketches through sequential drawing of Bezier curves with geometric feedback. It employs a relative experience optimization strategy based on Group Reward Policy Optimization (GRPO) and constructs pairwise comparisons using CLIP-based perceptual rewards and LLM-based qualitative assessments. This approach enables the model to iteratively refine its 3D drawing skills without parameter updates, resulting in the generation of complex and coherent 3D sketches from various textual prompts and demonstrating emergent geometric reasoning and generalization to new shapes.

3DrawAgent 是一个无需训练的框架，利用大型语言模型通过绘制贝塞尔曲线生成 3D 草图，并结合几何反馈。它采用基于 Group Reward Policy Optimization (GRPO) 的相对经验优化策略，并使用 CLIP 基础的感知奖励和 LLM 基础的细微质性评估构建两两对比。这种方法使模型能够在不更新参数的情况下逐步提高其 3D 意识和绘图质量。实验表明，3DrawAgent 能够从各种文本提示生成复杂且连贯的 3D 贝塞尔草图，并展示出几何推理能力，展示了其在推进 3D 草图智能方面的潜力。

LINE: LLM-based Iterative Neuron Explanations for Vision Models

Authors: Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula, Przemysław Biecek

First: 2026-04-09T09:43:26+00:00 · Latest: 2026-04-09T09:43:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.

中文标题/摘要

标题：LINE：基于LLM的迭代神经元解释方法用于视觉模型

解释深度神经网络中单个神经元所编码的概念是理解其复杂决策过程和确保AI安全的关键步骤。尽管在神经元标记方面取得了进展，但现有方法往往将搜索空间限制在预定义的概念词汇表中，或者产生过于具体的描述，无法捕捉到高层次的全局概念。我们提出了LINE，这是一种新型的无需训练的迭代方法，专门用于视觉模型中的开放词汇概念标记。在严格黑盒设置下，LINE 利用大型语言模型和文本到图像生成器，通过激活历史迭代地提出和细化概念，形成闭环。我们证明，LINE 在多个模型架构上达到了最先进的性能，ImageNet 上 AUC 提高了 0.18，Places365 上提高了 0.05，同时平均发现了 29% 被大规模预定义词汇表遗漏的新概念。除了识别顶级概念外，LINE 还提供了一整套生成历史，这使得多义性评估成为可能，并生成了与梯度依赖激活最大化方法相媲美的支持视觉解释。

Summary / 总结

The research aims to interpret the concepts encoded by individual neurons in deep neural networks to enhance AI safety. LINE, a training-free iterative method, is introduced for open-vocabulary concept labeling in vision models. LINE uses a large language model and a text-to-image generator to iteratively propose and refine concepts based on activation history. It achieves state-of-the-art performance, with AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, and discovers 29% more concepts than massive predefined vocabularies. Beyond top concept identification, LINE provides a complete generation history for polysemanticity evaluation and generates supporting visual explanations similar to gradient-dependent activation maximization methods.

LINE 是一种无需训练的迭代方法，用于在视觉模型中标记单个神经元，通过使用大型语言模型和文本到图像生成器在闭环中迭代提出和优化概念，解决了现有方法的局限性。它实现了最先进的性能，分别在 ImageNet 和 Places365 上将 AUC 提高了最多 0.18 和 0.05，并且比大规模预定义词汇表多发现 29% 的概念。LINE 还提供了完整的生成历史和与梯度依赖方法相媲美的视觉解释。

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

Authors: Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov

First: 2026-04-07T16:31:45+00:00 · Latest: 2026-04-09T09:40:36+00:00

Comments: 18 pages, 34 figures

Abs · PDF · Code1 · Code2

Abstract

Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CodecSight, a codec-guided streaming video analytics system, built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CodecSight treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CodecSight achieves an improvement in throughput of up to 3$\times$, and a reduction of up to 87% in GPU compute over state-of-the-art baselines, maintaining competitive accuracy with only 0$\sim$8% F1 drop.

中文标题/摘要

标题：CodecSight：利用视频编解码信号进行高效的流媒体VLM推理

视频流媒体分析是视觉语言模型服务中的关键工作负载，但多模态推理的高成本限制了其可扩展性。先前的系统通过利用视频流中的时间和空间冗余来降低推理成本，但它们仅针对视觉变换器（ViT）或有限视角的LLM，错过了端到端的机会。此外，现有方法在识别冗余方面产生了显著的开销，要么通过离线配置和训练，要么通过昂贵的在线计算，这使得它们不适合动态实时流媒体。我们提出了CodecSight，这是一种基于编解码器指导的流媒体视频分析系统，基于一个关键观察，即视频编解码器在压缩过程中已经提取了每个流的时间和空间结构。CodecSight 将这种编解码器元数据视为低成本的运行时信号，以统一视频解码、视觉处理和LLM预填充的优化，直接操作压缩位流是其固有的好处。这驱动了在ViT编码之前基于编解码器指导的补丁修剪，并在LLM预填充期间选择性地刷新关键值缓存，两者都是完全在线的，不需要离线训练。实验表明，CodecSight 的吞吐量提高了3倍，GPU计算减少了87%，并且仅以0-8%的F1分数下降保持了竞争力。

Summary / 总结

CodecSight leverages video codec signals to reduce the cost of multimodal inference in video streaming analytics. By treating codec metadata as a low-cost runtime signal, it optimizes video decoding, visual processing, and LLM prefilling, leading to up to 3x improvement in throughput and 87% reduction in GPU compute compared to state-of-the-art methods, with minimal accuracy drop.

CodecSight 利用视频编解码器信号来降低视频流分析中视觉-语言模型推理的成本。它使用编解码器元数据来统一优化视频解码、视觉处理和LLM预填充，相比最先进的方法，实现了最多3倍的吞吐量提升和87%的GPU计算量减少，同时保持了可接受的准确性，仅损失0-8%的F1分数。

History

20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553