When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Authors: Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai
Venue: CVPR 2026
First: 2026-04-09T17:59:57+00:00 · Latest: 2026-04-09T17:59:57+00:00
Comments: Accepted by CVPR 2026. Project page: https://h-embodvis.github.io/NUMINA
Abstract
Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.
中文标题/摘要
标题:数字发声:文本数字与视觉实例在文本到视频扩散模型中的对齐
文本到视频扩散模型已实现开放式的视频合成,但常难以生成提示中指定数量的对象。我们提出NUMINA,一种无需训练的识别后再引导框架,以提高数字对齐效果。NUMINA通过选择区分性的自注意力和跨注意力头来识别提示布局不一致,从而推导出可计数的潜在布局。然后,它保守地细化此布局,并调节跨注意力以引导再生。在引入的CountBench上,NUMINA在Wan2.1-1.3B模型上将计数准确性提高了7.4%,在5B和14B模型上分别提高了4.9%和5.5%。此外,CLIP对齐得到改善,同时保持时间一致性。这些结果表明,结构引导补充了种子搜索和提示增强,提供了一条通往计数准确的文本到视频扩散的实用路径。代码可在https://github.com/H-EmbodVis/NUMINA获取。
ParseBench: A Document Parsing Benchmark for AI Agents
Authors: Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Simon Suo
First: 2026-04-09T17:59:36+00:00 · Latest: 2026-04-09T17:59:36+00:00
Abstract
AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.
中文标题/摘要
标题:ParseBench:AI代理的文档解析基准
AI代理正在改变文档解析的要求。关键在于语义正确性:解析输出必须保留用于自主决策所需的结构和意义,包括正确的表格结构、精确的图表数据、语义相关的格式和视觉定位。现有基准未能充分捕捉企业自动化中的这一设置,依赖于狭窄的文档分布和文本相似性度量,这些度量忽略了代理关键的失败。我们引入了**ParseBench**,一个包含约2000个企业文档中的人工验证页面的基准,这些文档涵盖了保险、金融和政府领域,并围绕五个能力维度组织:表格、图表、内容忠实度、语义格式和视觉定位。在涵盖视觉语言模型、专门的文档解析器和LlamaParse在内的14种方法中,基准揭示了一个碎片化的能力建设:没有一种方法在所有五个维度上都表现出色。LlamaParse Agentic在综合得分上最高,达到\agenticoverall\%,而基准也突显了当前系统中的能力差距。数据集和评估代码可在\href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace}和\href{https://github.com/run-llama/ParseBench}{GitHub}上获取。
Summary / 总结
The research aims to address the need for semantic correctness in document parsing for AI agents, focusing on preserving structure, meaning, and visual elements. The study introduces ParseBench, a benchmark of over 2,000 human-verified enterprise documents covering insurance, finance, and government sectors. It evaluates 14 methods, including vision-language models and specialized parsers, across five dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. The results show no single method excels in all dimensions, with LlamaParse Agentic achieving the highest overall score, highlighting remaining capability gaps in current systems.
研究旨在评估AI代理在文档解析方面的语义正确性能力,包括保持表格结构、图表数据、语义格式和视觉定位。研究引入了ParseBench,包含约2,000份来自保险、金融和政府领域的企业文档。在包括视觉-语言模型和专门解析器在内的14种方法中,基准测试揭示了没有单一方法在所有五个能力维度上都表现出色。LlamaParse Agentic获得了最高的整体分数,但基准测试突显了当前系统中的显著能力差距。
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
Authors: Mu Nan, Muquan Yu, Weijian Mai, Jacob S. Prince, Hossein Adeli, Rui Zhang, Jiahang Cao, Benjamin Becker, John A. Pyles, Margaret M. Henderson, Chunfeng Song, Nikolaus Kriegeskorte, Michael J. Tarr, Xiaoqing Hu, Andrew F. Luo
Venue: CVPR 2026
First: 2026-04-09T17:59:32+00:00 · Latest: 2026-04-09T17:59:32+00:00
Comments: Accepted to CVPR 2026, website: https://github.com/ezacngm/brainCodec
Abstract
Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A field-wide goal is to achieve generalizable, cross-subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine-tuning separately for each subject. To address this challenge, we introduce a meta-optimized approach for semantic visual decoding from fMRI that generalizes to novel subjects without any fine-tuning. By simply conditioning on a small set of image-brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in-context learning of the new subject's encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross-subject and cross-scanner generalization across diverse visual backbones without retraining or fine-tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non-invasive brain decoding.
中文标题/摘要
标题:元学习使上下文内学习能够在无需训练的情况下实现跨个体脑解码
从脑信号进行视觉解码是计算机视觉与神经科学交叉领域的一个关键挑战,需要能够连接神经表示和视觉计算模型的方法。一个领域内的目标是实现可泛化的跨个体模型。这一目标的主要障碍是神经表示在个体间存在显著差异,这要求训练定制模型或为每个个体分别进行微调。为解决这一挑战,我们提出了一种用于fMRI的语义视觉解码的元优化方法,该方法能够在无需任何微调的情况下泛化到新的个体。通过仅对新个体的一小组图像-脑激活示例进行条件化,我们的模型能够快速推断出其独特的神经编码模式,从而实现稳健且高效的视觉解码。该方法明确针对新个体的编码模型的上下文学习进行了优化,并通过分层推理进行解码,逆向推导编码器。首先,对于多个脑区,我们通过构建多个刺激和响应的上下文来估计每个体素的视觉响应编码参数。其次,我们构建一个由多个体素的编码参数和响应值组成的上下文,以执行聚合功能逆向推导。我们展示了在多种视觉骨干网络上具有强大的跨个体和跨扫描仪泛化能力,无需重新训练或微调。此外,我们的方法不需要解剖对齐,也不需要刺激重叠。这项工作是实现非侵入性脑解码通用基础模型的重要一步。
Summary / 总结
This paper addresses the challenge of visual decoding from brain signals by introducing a meta-optimized approach that generalizes to novel subjects without fine-tuning. The method conditions on a small set of image-brain activation examples to infer unique neural encoding patterns, enabling robust and efficient visual decoding. The approach demonstrates strong cross-subject and cross-scanner generalization across various visual backbones, requiring no retraining or fine-tuning and no anatomical alignment or stimulus overlap.
该论文解决了不同个体之间脑信号的视觉解码问题,这是计算机视觉与神经科学交叉领域的一个关键挑战。作者提出了一种元优化方法,可以在无需微调的情况下泛化到新个体。通过条件化一小部分图像-脑激活示例,该模型可以推断出独特的神经编码模式,以实现稳健的视觉解码。该方法在无需重新训练或微调的情况下,展示了强大的跨个体和跨扫描仪泛化能力,且不需要解剖对齐或刺激重叠,标志着向非侵入性脑解码的一般基础模型迈出的重要一步。
What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric
Authors: Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang, Aladine Chetouani, Ulas Bagci, Alessandro Bruno
First: 2026-04-09T17:36:22+00:00 · Latest: 2026-04-09T17:36:22+00:00
Comments: Accepted at ETRA 2026 GenAI workshop
Abstract
Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.
中文标题/摘要
标题:他们所见,而不仅仅是注视的位置:通过VLMs和NLP度量的语义扫描路径相似性
扫描路径相似性度量是眼动研究的核心,但现有方法主要评估空间和时间对齐,而忽视了被注视图像区域之间的语义等价性。我们提出了一种语义扫描路径相似性框架,将视觉语言模型(VLMs)整合到眼动追踪分析中。每个注视点在受控视觉上下文中(基于块和标记策略)进行编码,并转换为简洁的文本描述,然后聚合为扫描路径级表示。语义相似性使用嵌入式和词法NLP度量进行计算,并与多匹配和DTW等现有空间度量进行比较。对自由观看眼动追踪数据的实验表明,语义相似性部分独立于几何对齐,揭示了尽管空间上存在差异但内容高度一致的情况。我们进一步分析了上下文编码对描述准确性和度量稳定性的影响。我们的研究结果表明,多模态基础模型使经典的扫描路径分析具有可解释的内容感知扩展,为ETRA社区的眼动研究提供了补充维度。
Summary / 总结
The research aims to improve scanpath similarity metrics by incorporating semantic analysis, which is often overlooked in existing methods. The study uses vision-language models to encode each fixation with controlled visual context, transforming fixations into textual descriptions and aggregating them into scanpath-level representations. Experiments show that semantic similarity captures additional variance not explained by geometric alignment, highlighting cases where content agreement exists despite spatial differences. The work also explores the impact of different contextual encoding strategies on description accuracy and metric stability, suggesting that multimodal models can provide a more content-aware analysis of eye movements.
研究旨在通过引入语义分析来改进扫描路径相似性度量,而现有方法往往忽略了这一点。研究使用视觉语言模型以受控的视觉上下文来编码每个注视点,并将其转化为文本描述,然后将这些描述聚合为扫描路径级别的表示。实验表明,语义相似性能够捕捉到几何对齐无法解释的额外差异,揭示了即使存在空间差异,内容也存在一致性的案例。研究还探讨了不同上下文编码策略对描述准确性和度量稳定性的影响,表明多模态模型可以提供一种更具有内容意识的眼动分析方法。
LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
Authors: Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang
First: 2026-04-09T17:14:00+00:00 · Latest: 2026-04-09T17:14:00+00:00
Abstract
Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.
中文标题/摘要
标题:LAMP: 将图像编辑提升为开放世界操纵的一般3D先验
在开放世界中实现类人的泛化仍然是机器人操纵中的一个基本挑战。现有的基于学习的方法,包括强化学习、模仿学习和视觉-语言-动作模型(VLAs),往往难以应对新的任务和未见过的环境。另一个有前景的方向是探索能够捕捉开放世界操纵中精细的空间和几何关系的一般化表示。虽然大型语言模型(LLMs)和视觉-语言模型(VLMs)提供了基于语言或标注的2D表示的强语义推理,但它们有限的3D意识限制了它们在精细操纵中的应用。为了解决这个问题,我们提出了LAMP,它将图像编辑提升为3D先验,以提取物体间的3D变换作为连续的、几何感知的表示。我们的核心见解是,图像编辑本质上编码了丰富的2D空间线索,将这些隐含的线索提升到3D变换中为开放世界操纵提供了精细和准确的指导。广泛的实验表明,LAMP 提供了精确的3D变换,并在开放世界操纵中实现了强大的零样本泛化。项目页面:https://zju3dv.github.io/LAMP/
Summary / 总结
The paper addresses the challenge of human-like generalization in open-world robotic manipulation. It proposes LAMP, which uses image-editing as 3D priors to extract continuous, geometry-aware representations of inter-object 3D transformations. Experiments show that LAMP provides precise 3D transformations and strong zero-shot generalization in open-world manipulation tasks.
论文解决了开放世界中类人的机器人操作泛化问题。提出LAMP,利用图像编辑作为3D先验提取连续的几何感知的物体间3D变换表示。实验表明,LAMP能够提供精确的3D变换,并在开放世界操作任务中实现强大的零样本泛化能力。
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
Authors: Haoxi Zeng, Qiankun Liu, Yi Bin, Haiyue Zhang, Yujuan Ding, Guoqing Wang, Deqiang Ouyang, Heng Tao Shen
First: 2026-04-09T16:57:11+00:00 · Latest: 2026-04-09T16:57:11+00:00
Comments: 14 pages, 12 figures, 5 tables
Abstract
Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).
中文标题/摘要
标题:OVS-DINO:通过结构对齐SAM-DINO实现开放词汇分割并辅以语言指导
开放词汇分割(OVS)旨在利用语义描述超越预定义类别集对图像区域进行分割。虽然基于CLIP的方法在语义泛化方面表现出色,但在密集预测所需的精细空间意识方面往往不足。最近的努力已经将视觉基础模型(VFMs)如DINO纳入其中,以缓解这些限制。然而,这些方法仍然难以应对高保真分割所需的精确边缘感知。在本文中,我们分析了DINO的内部表示,并发现其固有的边界意识并非不存在,而是随着特征过渡到更深的变压器块而逐渐衰减。为了解决这一问题,我们提出了OVS-DINO,这是一种新颖的框架,通过结构对齐SAM实现DINO的潜在边缘敏感性。具体来说,我们引入了结构感知编码器(SAE)和结构调制解码器(SMD),利用SAM的结构先验有效激活DINO的边界特征,并辅以利用SAM生成的伪掩码的监督策略。广泛的实验表明,我们的方法在多个弱监督OVS基准测试中实现了最先进的性能,平均得分提高了2.1%(从44.8%提高到46.9%)。值得注意的是,我们的方法在复杂、拥挤的场景中显著提高了分割精度,在Cityscapes上提高了6.3%(从36.6%提高到42.9%)。
Summary / 总结
OVS-DINO addresses the limitations of existing Open-Vocabulary Segmentation methods by revitalizing DINO's latent edge-sensitivity through structural alignment with SAM, resulting in state-of-the-art performance across multiple benchmarks. The method introduces a Structure-Aware Encoder and a Structure-Modulated Decoder to activate boundary features of DINO using SAM's structural priors, and achieves a 2.1% improvement in average score, with a 6.3% gain on Cityscapes in complex, cluttered scenarios.
OVS-DINO通过将DINO与SAM的结构先验进行结构对齐,来增强其边界感知能力。该框架包括结构感知编码器和结构调制解码器,提升了DINO的细粒度空间感知。实验结果显示,OVS-DINO在多个弱监督OVS基准测试中达到了最先进的性能,平均得分提高了2.1%,在Cityscapes上的复杂场景分割准确性提高了6.3%。
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
Authors: Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen, Sikai Chen, Bin Ran
First: 2026-04-09T16:52:04+00:00 · Latest: 2026-04-09T16:52:04+00:00
Abstract
Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.
中文标题/摘要
标题:CrashSight:一种面向阶段的基础设施中心视频基准,用于交通事故现场理解与推理
合作自动驾驶需要从车辆和基础设施视角理解交通场景。尽管视觉语言模型(VLMs)展示了强大的通用推理能力,但由于现有基准的以自我车辆为中心,它们在安全关键交通场景中的性能仍缺乏充分评估。为弥合这一差距,我们提出了**CrashSight**,一种使用真实道路旁摄像头数据进行道路事故理解的大规模视觉语言基准。该数据集包含250个事故视频,并用13K个多选题-答案对进行了标注,分为两层分类体系。第一层评估场景背景和涉事方的视觉定位,而第二层则探索更高层次的推理,包括事故机制、因果归因、时间进程以及事故后的结果。我们对8个最先进的VLMs进行了基准测试,并表明,尽管这些模型在场景描述方面表现出色,但在安全关键场景中的时间与因果推理方面仍存在困难。我们详细分析了失败场景,并讨论了改进VLM事故理解的方向。该基准提供了一种标准化的评估框架,用于合作自动驾驶中的基础设施辅助感知。CrashSight基准,包括完整数据集和代码,可在https://mcgrche.github.io/crashsight/获取。
Summary / 总结
CrashSight is a large-scale vision-language benchmark for understanding traffic crash scenes from both vehicle and infrastructure perspectives. It includes 250 real-world crash videos with 13K question-answer pairs, categorized into two tiers for evaluating visual grounding and higher-level reasoning. The benchmark evaluates 8 state-of-the-art vision-language models and finds that they struggle with temporal and causal reasoning in safety-critical scenarios, highlighting the need for improved crash understanding capabilities.
CrashSight 是一个用于从车辆和基础设施视角理解交通事故场景的新视觉语言基准。它包含250个真实世界的事故视频和13K注释的问题-答案对,分为两层来评估视觉定位和高层次推理。基准测试评估了8个最先进的视觉语言模型,并发现它们在安全关键场景中的时间因果推理方面存在困难,突显了改进的必要性。该数据集提供了一个标准化的评估框架,用于合作自动驾驶中的基础设施辅助感知。
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
Authors: Marcel Gröpl, Jaewoo Jung, Seungryong Kim, Marc Pollefeys, Sunghwan Hong
First: 2026-04-09T16:51:42+00:00 · Latest: 2026-04-09T16:51:42+00:00
Comments: Project Page : https://entropy-gradient-grounding.github.io/
Abstract
Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.
中文标题/摘要
标题:熵-梯度定位:无需训练的视觉-语言模型证据检索
尽管取得了快速进展,预训练的视觉-语言模型在依赖于微小视觉细节或需要结合分布在多个区域的线索时,仍然难以应对,尤其是在文档和组合查询中。我们通过将定位问题重新定义为测试时的证据检索来解决这一问题:给定一个查询,模型应该积极地识别下一步需要查看的位置以解决歧义。为此,我们提出了一种无需训练、模型内在的定位方法,使用不确定性作为监督。具体来说,我们计算模型的下一个标记分布的熵,并将其反向传播到视觉标记嵌入中,以获得一个熵-梯度相关性图,而无需辅助检测器或注意力图启发式方法。然后,我们提取并排序多个连贯区域以支持多证据查询,并引入一种迭代的缩放和重新定位过程,带有空间熵停止规则,以避免过度细化。在四个VLM架构的七个基准测试上进行的实验表明,该方法在现有方法上具有持续改进,最大的改进出现在细节关键和高分辨率设置中,同时生成了更可解释的证据定位。
Summary / 总结
This paper addresses the challenge of pretrained vision-language models in handling queries that require attention to small visual details or combining clues from multiple regions. It proposes a training-free grounding method called Entropy-Gradient Grounding, which uses the model's uncertainty (measured by entropy) to guide the retrieval of relevant visual evidence. The method computes an entropy-gradient relevance map and ranks multiple coherent regions to support complex queries, leading to consistent improvements across various benchmarks and architectures, especially in detail-critical scenarios.
研究针对预训练的视觉-语言模型在处理需要小视觉细节或结合多个区域线索的查询时遇到的挑战。提出了一种基于熵梯度的定位方法,利用模型的不确定性作为监督来检索相关视觉证据,无需额外训练或启发式方法。该方法通过计算下一个标记分布的熵来生成熵梯度相关图,提取多个一致区域,并通过迭代放大和重新定位过程避免过度细化。实验结果显示在各种基准测试中的一致改进,特别是在细节关键和高分辨率设置中表现尤为突出,并生成了更具解释性的证据定位。
BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields
Authors: Fan Yang, Wenrui Chen, Guorun Yan, Ruize Liao, Wanjun Jia, Dongsheng Luo, Kailun Yang, Zhiyong Li, Yaonan Wang
First: 2026-04-09T16:10:20+00:00 · Latest: 2026-04-09T16:10:20+00:00
Comments: Code will be publicly available at https://github.com/PopeyePxx/BLaDA
Abstract
In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.
中文标题/摘要
标题:BLaDA:在3DGS领域内将语言与功能性灵巧动作相结合
在非结构化环境中,功能性灵巧抓取需要语义理解、精确的3D功能定位和物理可解释的执行的紧密集成。模块化分层方法比端到端的VLA方法更可控和可解释,但现有的方法仍然依赖于预定义的功能性使用标签,缺乏功能性灵巧操作所需的语义-姿态耦合。为了解决这个问题,我们提出了BLaDA(在3DGS领域内将语言与灵巧动作相结合),这是一种可解释的零样本框架,将开放词汇指令作为功能性灵巧操作的感知和控制约束。BLaDA通过知识引导的语言解析(KLP)模块将自然语言解析为操作约束的结构化六元组,建立了可解释的推理链。为了实现姿态一致的空间推理,我们引入了三角功能点定位(TriLocation)模块,该模块利用3D高斯点积作为连续场景表示,并在三角几何约束下识别功能性区域。最后,3D关键点抓取矩阵变换执行(KGT3D+)模块将这些语义-几何约束解码为物理上合理的手腕姿态和手指级命令。在复杂基准上的广泛实验表明,BLaDA在功能性操作的精确功能性使用标签接地和不同类别及任务的成功率方面显著优于现有方法。代码将在https://github.com/PopeyePxx/BLaDA公开。
Summary / 总结
BLaDA is an interpretable zero-shot framework that integrates natural language instructions into perceptual and control constraints for functional dexterous manipulation. It uses a Knowledge-guided Language Parsing module to parse natural language into structured manipulation constraints, a Triangular Functional Point Localization module to identify functional regions, and a 3D Keypoint Grasp Matrix Transformation Execution module to generate physically plausible wrist poses and finger commands. Experiments show that BLaDA outperforms existing methods in both affordance grounding precision and functional manipulation success rate across various categories and tasks.
BLaDA 是一个可解释的零样本框架,将自然语言指令与 3D 场景理解结合以实现功能性灵巧操作。它使用知识引导的语言解析模块将指令解析为结构化的操作约束,使用三角功能点定位模块进行姿态一致的空间推理,并使用 3D 关键点抓取矩阵变换执行模块生成物理上可行的抓取命令。实验表明,BLaDA 在各种类别和任务中的功能性操作成功率和功能接地精度方面均优于现有方法。
Phantasia: Context-Adaptive Backdoors in Vision Language Models
Authors: Nam Duong Tran, Phi Le Nguyen
Venue: CVPR 2026
First: 2026-04-09T15:55:33+00:00 · Latest: 2026-04-09T15:55:33+00:00
Comments: CVPR 2026 Findings
Abstract
Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.
中文标题/摘要
标题:幻象:视觉语言模型中的上下文自适应后门
视觉语言模型(VLMs)的最新进展极大地增强了视觉感知与语言推理的整合,推动了多模态理解的迅速进步。尽管取得了这些成就,VLMs的安全性,尤其是它们对后门攻击的脆弱性,仍然被显著忽视。现有的VLM后门攻击仍处于早期发展阶段,大多数当前方法依赖于生成包含固定、易于识别模式的中毒响应。在本工作中,我们做出了两项关键贡献。首先,我们首次证明现有VLM后门攻击的隐蔽性被严重高估。通过将其他领域(如仅视觉和仅文本模型)中设计的防御技术进行适应,我们展示了多种最先进的攻击可以出乎意料地被检测到。其次,为了解决这一差距,我们引入了Phantasia,一种上下文自适应后门攻击,能够动态地使其中毒输出与每个输入的语义对齐。Phantasia 不是生成静态的中毒模式,而是鼓励模型生成上下文连贯但又恶意的响应,这些响应仍然具有说服力,从而显著提高了隐蔽性和适应性。广泛的实验表明,Phantasia 在各种防御设置下实现了最先进的攻击成功率,同时保持了良好的性能。
Summary / 总结
This work addresses the security vulnerabilities of Vision-Language Models (VLMs) by demonstrating that existing backdoor attacks are less stealthy than previously thought. It introduces Phantasia, a context-adaptive backdoor attack that generates contextually coherent yet malicious responses, improving stealth and adaptability. Experiments show Phantasia achieves high attack success rates while maintaining benign performance under various defensive settings.
该论文通过证明现有视觉语言模型(VLMs)后门攻击的隐蔽性被高估,探讨了VLMs的安全漏洞。作者引入了Phantasia,一种上下文自适应后门攻击,生成上下文相关但恶意的响应,提高了隐蔽性和适应性。实验表明,Phantasia在各种防御设置下实现了最先进的攻击成功率,同时保持了良好的性能。
Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents
Authors: Khushal Sethi
First: 2026-04-09T15:34:22+00:00 · Latest: 2026-04-09T15:34:22+00:00
Abstract
Inference-time compute scaling has emerged as a powerful technique for improving the reliability of large language model (LLM) agents, but existing methods apply compute uniformly: every decision step receives the same budget regardless of its difficulty. We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement. At each step, TrACE samples a small set of candidate next actions and measures how consistently the model commits to the same action. High agreement signals an easy decision; the controller commits immediately. Low agreement signals uncertainty; the controller samples additional rollouts up to a configurable cap before committing to the plurality action. No learned components, no external verifier, and no human labels are required. We evaluate TrACE against greedy decoding and fixed-budget self-consistency (SC-4, SC-8) on two benchmarks spanning single-step reasoning (GSM8K, n=50) and multi-step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct model running on CPU. TrACE-4 matches SC-4 accuracy while using 33% fewer LLM calls on GSM8K and 39% fewer on MiniHouse. TrACE-8 matches SC-8 accuracy with 55% fewer calls on GSM8K and 65% fewer on MiniHouse. We further show that inter-rollout agreement is a reliable signal of step-level success, validating the core hypothesis that the model's own output consistency encodes difficulty information that can be exploited without training. TrACE is the first training-free, per-timestep adaptive-compute controller for LLM agents to be evaluated on multi-step sequential decision tasks.
中文标题/摘要
标题:不要过度思考:基于-rollout行动一致性的免费自适应计算信号
推理时计算扩展已成为提高大型语言模型(LLM)代理可靠性的强大技术,但现有方法均匀分配计算资源:每个决策步骤获得相同的预算,不论其难度。我们引入了TrACE(Trajectorical Adaptive Compute via agrEement),这是一种无需训练的控制器,通过测量-rollout行动一致性来适应性地分配LLM调用,从而在代理时间步中进行计算。在每一步,TrACE都会采样一组候选的下一个动作,并测量模型是否一致地选择同一动作。高一致性表示容易的决策;控制器立即做出决定。低一致性表示不确定性;控制器在达到可配置上限之前进行额外的-rollout采样,然后选择多数动作。无需学习组件、外部验证器和人工标签。我们在两个基准测试上评估了TrACE,涵盖单步推理(GSM8K,n=50)和多步家庭导航(MiniHouse,n=30),使用Qwen 2.5 3B Instruct模型在CPU上运行。TrACE-4在GSM8K上的准确度与SC-4相同,但使用了33%更少的LLM调用,在MiniHouse上则减少了39%。TrACE-8在GSM8K上的准确度与SC-8相同,但在MiniHouse上减少了55%和65%的调用。我们进一步表明,-rollout一致性是步骤级成功的一个可靠信号,验证了核心假设,即模型自身输出的一致性包含了可以利用的难度信息,而无需训练。TrACE是第一个在多步序列决策任务中评估的无需训练、每时间步自适应计算的LLM代理控制器。
Summary / 总结
The paper introduces TrACE, a training-free method that allocates compute resources adaptively based on inter-rollout action agreement. It evaluates TrACE against greedy decoding and fixed-budget self-consistency on GSM8K and MiniHouse benchmarks, showing that TrACE can match or exceed the accuracy of fixed-budget methods while significantly reducing the number of LLM calls.
论文提出了TrACE,一种基于卷积内动作一致性分配计算资源的无训练方法。它在GSM8K和MiniHouse基准上将TrACE与贪婪解码和固定预算自我一致性进行对比,结果显示TrACE可以在显著减少LLM调用次数的同时达到固定预算方法的准确性。具体来说,TrACE-4和TrACE-8分别与SC-4和SC-8的准确性相当,但LLM调用次数分别减少了33%到65%。
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
Authors: Ruizhi Zhang, Ye Huang, Yuangang Pan, Chuanfu Shen, Zhilin Liu, Ting Xie, Wen Li, Lixin Duan
First: 2026-04-09T15:12:36+00:00 · Latest: 2026-04-09T15:12:36+00:00
Comments: Tech report
Abstract
While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.
中文标题/摘要
标题:PokeGym:一种基于视觉的长时程基准测试,用于视觉-语言模型
尽管视觉-语言模型(VLMs)在静态视觉理解方面取得了显著进展,但在复杂3D嵌入式环境中的部署仍然受到严重限制。现有基准存在四个关键缺陷:(1)被动感知任务绕过了互动动态;(2)简化的2D环境无法评估深度感知;(3)先验状态泄露绕过了真正的视觉处理;(4)人工评估成本高昂且无法扩展。我们引入了PokeGym,这是一种基于视觉的长时程基准测试,建立在《宝可梦传说:Z-A》这一视觉复杂的3D开放世界角色扮演游戏内。PokeGym 强制执行严格的代码级隔离:代理仅基于原始RGB观察进行操作,而独立评估者通过内存扫描验证成功,确保基于视觉的决策和自动化的可扩展评估。基准测试包括30项任务(30-220步),涵盖导航、交互和混合场景,具有三种指令粒度(视觉引导、步骤引导、仅目标)以系统地分解视觉定位、语义推理和自主探索能力。我们的评估揭示了当前VLMs的关键局限性:物理死锁恢复而非高级规划构成了主要瓶颈,死锁与任务成功之间存在强烈负相关。此外,我们发现了一种元认知分歧:较弱的模型主要遭受无意识死锁(对陷阱不知情),而先进的模型表现出有意识死锁(认识到陷阱但无法恢复)。这些发现突显了将显式空间直觉整合到VLM架构中的必要性。代码和基准测试将在GitHub上提供。
Summary / 总结
PokeGym is introduced as a visually-driven long-horizon benchmark for Vision-Language Models (VLMs) within a complex 3D open-world game environment, addressing limitations of existing benchmarks. It enforces strict code-level isolation, requiring agents to rely solely on raw RGB observations and automated evaluation for success verification. Key findings include that current VLMs struggle with physical deadlock recovery, with weaker models suffering from Unaware Deadlocks and more advanced models showing Aware Deadlocks. These results emphasize the need for integrating spatial intuition into VLM architectures.
PokeGym 是一个复杂 3D 环境中的视觉驱动长期基准,用于视觉-语言模型(VLMs),通过引入互动动态、深度感知和纯视觉决策来弥补现有基准的不足。它包含 30 个不同指令粒度的任务,发现当前 VLMs 在物理死锁恢复方面存在主要瓶颈,而高级模型则表现出意识到死锁但无法恢复的情况。这表明需要将空间直觉整合到 VLM 架构中。
Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow
Authors: Richard Petersen, Fredrik Kahl, Jennifer Alvén
Venue: MICCAI 2026
First: 2026-04-09T14:46:14+00:00 · Latest: 2026-04-09T14:46:14+00:00
Comments: Submitted to MICCAI 2026
Abstract
Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.
中文标题/摘要
标题:基于训练-free 引导的3D 修正流的弱监督肺结节分割
密集标注,如分割掩码,获取成本高且耗时,尤其是在3D医学图像中,需要专家逐像素标注。弱监督方法旨在解决这一限制,但通常依赖于基于归因的方法,这些方法难以准确捕捉如肺结节这样的小结构。本文提出了一种结合预训练的最新修正流和预测器模型的弱监督肺结节分割方法,以插件方式使用。我们的方法使用3D修正流模型的训练-free 引导,仅需使用图像级标签微调预测器,无需重新训练生成模型。所提出的方法为两个独立的预测器生成了高质量的分割结果,一致地检测了不同大小和形状的肺结节。LUNA16上的实验表明,该方法优于基线方法,突显了生成基础模型作为弱监督3D医学图像分割工具的潜力。
Summary / 总结
The research aims to address the high cost of dense annotations for 3D medical images by proposing a weakly-supervised lung nodule segmentation method. This method combines pretrained rectified flow and predictor models without retraining the generative model, using only image-level labels for fine-tuning. The approach significantly improves the quality of lung nodule segmentations, detecting various sizes and shapes of nodules more accurately than baseline methods, as shown in experiments on the LUNA16 dataset.
研究旨在解决3D医学图像中密集标注的高成本和时间消耗问题,尤其是对于肺结节。方法结合了预训练的矫正流模型和预测模型进行弱监督分割,仅需使用图像级标签进行微调。该方法提高了肺结节分割的质量,并能一致地检测不同大小和形状的结节,在LUNA16数据集上优于基线方法。
SeLaR: Selective Latent Reasoning in Large Language Models
Authors: Renyu Fu, Guibo Luo
Venue: ACL 2026
First: 2026-04-09T14:32:07+00:00 · Latest: 2026-04-09T14:32:07+00:00
Comments: Camera-ready for ACL 2026 (main conference)
Abstract
Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token's direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.
中文标题/摘要
标题:SeLaR:大型语言模型中的选择性潜在推理
思维链(CoT)已成为大型语言模型推理的基石,但其有效性受到离散令牌采样表达能力有限的限制。最近的潜在推理方法试图通过用软嵌入(概率加权的令牌嵌入混合物)或隐藏状态替换离散令牌来缓解这一限制,但它们通常存在两个问题:(1)全局激活会在高置信度步骤中注入扰动,损害推理稳定性;(2)软嵌入迅速向最高概率令牌靠拢,限制了对替代路径的探索。为了解决这些挑战,我们提出了SeLaR(选择性潜在推理),这是一种轻量级且无需训练的框架。SeLaR引入了一个熵门控机制,仅在低置信度步骤激活软嵌入,而在高置信度步骤保持离散解码。此外,我们提出了一种熵感知对比正则化,将软嵌入推向主导(最高概率)令牌的方向相反,鼓励对多种潜在推理路径的持续探索。在五个推理基准上的实验表明,SeLaR在标准CoT和最先进的无需训练方法中表现始终更优。
Summary / 总结
SeLaR is a lightweight and training-free framework that enhances the reasoning capabilities of large language models by selectively using soft embeddings at low-confidence steps while maintaining discrete decoding at high-confidence steps. It also introduces an entropy-aware contrastive regularization to encourage exploration of multiple reasoning paths. Experiments show that SeLaR outperforms standard Chain-of-Thought and other training-free methods on five reasoning benchmarks.
SeLaR 是一个轻量级且无需训练的框架,通过在低置信度步骤中选择性地使用软嵌入,同时在高置信度步骤中保持离散解码来增强大型语言模型的推理能力。它还引入了一种基于熵的对比正则化,以鼓励探索多种推理路径。实验表明,SeLaR 在五个推理基准上优于标准的链式思考和其它无需训练的方法。
Can Vision Language Models Judge Action Quality? An Empirical Evaluation
Authors: Miguel Monte e Freitas, Rui Henriques, Ricardo Rei, Pedro Henrique Martins
First: 2026-04-09T14:29:19+00:00 · Latest: 2026-04-09T14:29:19+00:00
Abstract
Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.
中文标题/摘要
标题:视觉语言模型能否评判动作质量?一项实证评估
动作质量评估(AQA)在物理治疗、体育教练和竞技评判中有着广泛的应用。尽管视觉语言模型(VLMs)在AQA方面具有很大的潜力,但它们在这一领域的实际表现仍然鲜有研究。我们对最先进的VLMs在活动领域(如健身、花样滑冰、跳水)、任务、表示和提示策略方面进行了全面评估。基线结果显示,Gemini 3.1 Pro、Qwen3-VL和InternVL3.5模型的表现仅略高于随机猜测,尽管融入骨骼信息、语义指导、推理结构和上下文学习等策略可以带来孤立的提升,但没有一种策略是始终有效的。对预测分布的分析揭示了两种系统性偏差:一种是无论视觉证据如何,都倾向于预测正确的执行,另一种是对表面语言框架的敏感性。通过对比重新表述任务以减轻这些偏差,带来的改进微乎其微,这表明模型的局限性超出了这些偏差,指向了精细动作质量评估的基本困难。我们的研究结果为未来基于VLM的AQA研究奠定了严格的基线,并提供了在可靠的实际应用部署前需要缓解的失败模式的可操作指南。
Summary / 总结
The study evaluates the performance of state-of-the-art Vision Language Models (VLMs) in Action Quality Assessment (AQA), a task with applications in physical therapy and sports coaching. Despite promising potential, the models perform only marginally better than random chance. Incorporating skeleton information, grounding instructions, and in-context learning provided some gains but were not consistently effective. Analysis revealed two biases: predicting correct execution regardless of visual evidence and sensitivity to linguistic framing. Reformulating tasks did not significantly improve results, indicating fundamental challenges in assessing fine-grained movement quality. The research sets a baseline for future AQA research using VLMs and highlights areas needing improvement for practical deployment.
研究评估了最先进的视觉语言模型(VLMs)在动作质量评估(AQA)中的表现,AQA在物理治疗和体育教练中有广泛应用。尽管模型有潜力,但其表现仅略高于随机猜测。虽然整合骨架信息、语义指令和上下文学习提供了一些改进,但并不总是有效。分析发现两种偏差:无视视觉证据预测正确执行和对语言框架的敏感性。重新表述任务并未显著改善结果,表明评估精细动作质量存在根本性困难。研究为未来基于VLM的AQA研究设定了基准,并指出了需要改进的领域以实现可靠的实际部署。
EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization
Authors: Xiangyuan Wang, Honghao Cai, Yunhao Bai, Tianze Zhou, Haohua Chen, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu
First: 2026-04-09T13:11:33+00:00 · Latest: 2026-04-09T13:11:33+00:00
Abstract
High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.
中文标题/摘要
标题:EditCaption: 通过监督微调和直接偏好优化实现图像编辑的人类对齐指令合成
高质量的训练三元组(带有精确编辑指令的源目标图像对)是扩展基于指令的图像编辑模型的关键瓶颈。视觉语言模型(VLMs)广泛用于自动化指令合成,但我们发现在图像对设置中存在三种系统性失败模式:方向不一致(例如,左右混淆)、视角模糊以及属性描述不足。人类评估显示,超过47%的来自强大基线VLM的指令包含关键错误,无法用于下游训练。我们提出EditCaption,一种基于VLM的指令合成的可扩展两阶段后训练管道。第一阶段构建一个包含10万监督微调(SFT)数据集,通过结合GLM自动注释、EditScore过滤和人类细化来提高空间、方向和属性级别的准确性。第二阶段收集1万个人类偏好对,针对三种失败模式,并应用直接偏好优化(DPO)以超越SFT本身实现对齐。在Eval-400、ByteMorph-Bench和HQ-Edit上,微调后的Qwen3-VL模型优于开源基线;2350亿参数模型在Eval-400上达到4.712(与Gemini-3-Pro 4.706、GPT-4.1 4.220、Kimi-K2.5 4.111相比)和在ByteMorph-Bench上达到4.588(与Gemini-3-Pro 4.522、GPT-4.1 3.412相比)。人类评估显示关键错误从47.75%降至23%,正确性从41.75%升至66%。该工作提供了一条实用路径,实现可扩展且人类对齐的图像编辑数据指令合成。
Summary / 总结
The research aims to address the challenge of creating high-quality training triplets for instruction-guided image editing models. It proposes EditCaption, a two-stage post-training pipeline that uses supervised fine-tuning and direct preference optimization to improve instruction synthesis. Stage 1 creates a 100K dataset with human refinement for accuracy, and Stage 2 uses human preference pairs to optimize alignment. The study shows that fine-tuned Qwen3-VL models outperform open-source baselines, with the 235B model achieving scores of 4.712 on Eval-400 and 4.588 on ByteMorph-Bench, and human errors decreasing from 47.75% to 23%. Correctness increased from 41.75% to 66%. This work provides a practical approach for scalable, human-aligned instruction synthesis for image editing data.
研究旨在解决为指令引导的图像编辑模型创建高质量训练三元组的挑战。提出了一种两阶段后训练管道EditCaption,结合监督微调和直接偏好优化来改进指令合成。第一阶段创建了一个包含人类校正的100K数据集以提高准确性,第二阶段使用人类偏好对进行优化以超越仅监督微调。研究显示,微调后的Qwen3-VL模型在Eval-400和ByteMorph-Bench上的得分分别为4.712和4.588,人类错误率从47.75%降至23%,正确性从41.75%提升至66%。这项工作提供了一种实用的方法,用于实现可扩展且与人类对齐的图像编辑数据指令合成。
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Authors: Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Armstrong Aboah
First: 2026-04-09T13:11:30+00:00 · Latest: 2026-04-09T13:11:30+00:00
Abstract
General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.
中文标题/摘要
标题:视觉-语言基础模型在综合路面状况评估中的应用
通用的视觉-语言模型在日常领域表现出色,但在需要精确术语、结构化推理和遵守工程标准的专业技术领域却表现不佳。本研究探讨了是否可以通过领域特定指令调优使视觉-语言模型能够进行全面的路面状况评估。PaveInstruct数据集包含278,889张图像-指令-响应对,覆盖32种任务类型,由九个异构路面数据集的注释统一而成。PaveGPT是基于该数据集训练的路面基础模型,其在感知、理解和推理任务上与最先进的视觉-语言模型进行了对比评估。指令调优提升了模型的能力,在空间定位、推理和生成任务上取得了超过20%的改进,同时生成符合ASTM D6433标准的输出。这些结果使交通部门能够部署统一的对话式评估工具,取代多个专业系统,简化工作流程并降低技术专长要求。该方法为桥梁检查、铁路维护和建筑状况评估等基础设施领域的指令驱动AI系统开发奠定了路径。
Summary / 总结
This study aims to enhance the performance of vision-language models in specialized technical fields, specifically for pavement condition assessment. The research introduces PaveInstruct, a dataset of 278,889 image-instruction-response pairs, and PaveGPT, a model trained on this dataset. PaveGPT outperformed state-of-the-art models in perception, understanding, and reasoning tasks, with significant improvements in spatial grounding, reasoning, and generation tasks. The model's outputs adhered to engineering standards, enabling transportation agencies to use unified conversational assessment tools, simplifying workflows and reducing technical expertise requirements.
该研究旨在增强视觉-语言模型在专门技术领域的性能,特别是用于路面状况评估。研究引入了包含278,889个图像-指令-响应对的PaveInstruct数据集和基于该数据集训练的PaveGPT模型。研究显示,指令调优显著提升了模型的能力,在空间定位、推理和生成任务中分别取得了超过20%的性能提升,并生成了符合标准的输出。这使得交通机构能够使用统一的对话评估工具,简化工作流程并减少技术专长要求。
MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
Authors: Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu, Lifeng Sun, Minfeng Xu
Venue: ICLR 2026
First: 2026-04-09T13:04:49+00:00 · Latest: 2026-04-09T13:04:49+00:00
Comments: Accepted by ICLR 2026
Abstract
Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.
中文标题/摘要
标题:MedVR:基于代理强化学习的无注释医学视觉推理
医学视觉-语言模型(VLMs)在复杂临床任务中具有巨大潜力,但其推理能力往往受限于仅基于文本的范式,无法将推断与视觉证据联系起来。这一限制不仅限制了对需要精细视觉分析的任务的性能,还增加了在安全关键应用中出现视觉幻觉的风险。因此,我们提出了MedVR,这是一种新颖的强化学习框架,使医学VLMs能够进行无注释的视觉推理。其核心创新在于两种协同机制:熵导向的视觉再定位(EVR)利用模型不确定性引导探索,而基于共识的信用分配(CCA)从回放一致性中提炼伪监督。在没有任何人类注释的情况下,MedVR在多种公开的医学VQA基准测试中达到了最先进的性能,显著优于现有模型。通过直接与视觉证据进行推理,MedVR促进了加速医学AI临床部署所需的稳健性和透明度。
Summary / 总结
MedVR is a reinforcement learning framework designed to enhance the visual reasoning capabilities of medical vision-language models (VLMs) without requiring human annotations. It uses Entropy-guided Visual Regrounding (EVR) to explore model uncertainty and Consensus-based Credit Assignment (CCA) to derive pseudo-supervision from rollout agreement. MedVR achieves state-of-the-art performance on various public medical VQA benchmarks, outperforming existing models and promoting robustness and transparency in medical AI applications.
MedVR 是一种无需人工标注的强化学习框架,旨在提升医疗视觉语言模型(VLMs)的视觉推理能力。它通过熵引导的视觉重新锚定(EVR)基于模型不确定性进行探索,并通过共识导向的信用分配(CCA)从回放一致性中提取伪监督。MedVR 在多种公开的医疗 VQA 基准测试中表现出色,超越现有模型,并促进医疗 AI 应用的稳健性和透明度。
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
Authors: Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj
Venue: CVPR 2026
First: 2026-01-28T12:02:58+00:00 · Latest: 2026-04-09T12:38:57+00:00
Comments: Accepted to CVPR 2026
Abstract
Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/
中文标题/摘要
标题:AnomalyVFM -- 将视觉基础模型转化为零样本异常检测器
零样本异常检测旨在无需访问任何领域内训练图像的情况下,检测和定位图像中的异常区域。虽然最近的方法利用视觉-语言模型(VLMs),如CLIP,来转移高级概念知识,但基于纯粹视觉基础模型(VFMs)的方法,如DINOv2,在性能上落后。我们认为这种差距源于两个实际问题:(i) 现有辅助异常检测数据集的多样性有限,(ii) VFM的适应策略过于浅显。为了解决这两个挑战,我们提出了AnomalyVFM,这是一种通用且有效的框架,能够将任何预训练的VFM转化为强大的零样本异常检测器。我们的方法结合了一种稳健的三阶段合成数据集生成方案和一种参数高效的适应机制,利用低秩特征适配器和置信加权像素损失。这些组件共同使现代VFMs在性能上显著优于当前最先进的方法。具体而言,以RADIO作为骨干,AnomalyVFM在9个不同数据集上的平均图像级AUROC为94.1%,比之前的方法高出显著的3.3个百分点。项目页面:https://maticfuc.github.io/anomaly_vfm/
Summary / 总结
AnomalyVFM is a framework that transforms pretrained vision foundation models (VFMs) into zero-shot anomaly detectors. It addresses the limitations of existing methods by using a robust synthetic dataset generation scheme and a parameter-efficient adaptation mechanism. AnomalyVFM significantly outperforms current state-of-the-art methods, achieving an average image-level AUROC of 94.1% across 9 diverse datasets with DINOv2 as the backbone, surpassing previous methods by 3.3 percentage points.
研究旨在通过解决现有方法的限制,利用视觉基础模型(VFMs)增强零样本异常检测。AnomalyVFM框架通过稳健的合成数据集生成和参数高效的适应机制,将任何预训练的VFMs转化为强大的零样本异常检测器。该方法在九个不同的数据集上实现了平均图像级AUROC为94.1%,比之前的方法高出3.3个百分点。
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
Authors: Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong, Yang Wang, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang
First: 2026-04-09T12:28:14+00:00 · Latest: 2026-04-09T12:28:14+00:00
Abstract
Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator for value estimation. Taking the current observation and robot proprioception as input, ViVa jointly predicts future proprioception and a scalar value for the current state. By leveraging the spatiotemporal priors of a pretrained video generator, our approach grounds value estimation in anticipated embodiment dynamics, moving beyond static snapshots to intrinsically couple value with foresight. Integrated into RECAP, ViVa delivers substantial improvements on real-world box assembly. Qualitative analysis across all three tasks confirms that ViVa produces more reliable value signals, accurately reflecting task progress. By leveraging spatiotemporal priors from video corpora, ViVa also generalizes to novel objects, highlighting the promise of video-generative models for value estimation.
中文标题/摘要
标题:ViVa:一种用于机器人强化学习的视频生成价值模型
视觉-语言-动作(VLA)模型通过大规模预训练提升了机器人的操作能力,但由于部分可观测性和延迟反馈,实际部署仍然具有挑战性。强化学习通过价值函数解决了这一问题,评估任务进展并指导策略改进。然而,现有的基于视觉-语言模型的价值模型难以捕捉时间动态,影响了长期任务中的可靠价值估计。在本文中,我们提出了一种视频生成价值模型ViVa,该模型重新利用了预训练的视频生成器进行价值估计。通过将当前观察和机器人本体感觉作为输入,ViVa联合预测未来的本体感觉和当前状态的标量值。通过利用预训练视频生成器的时空先验,我们的方法将价值估计与预期的本体动态联系起来,超越了静态快照,内在地将价值与前瞻性联系起来。集成到RECAP中,ViVa在实际的盒子组装任务中取得了显著的改进。在所有三个任务的定性分析中,证实了ViVa生成了更可靠的价值信号,准确反映了任务进展。通过利用视频语料库中的时空先验,ViVa还能够泛化到新的对象,突显了视频生成模型在价值估计中的潜力。
Summary / 总结
The paper proposes ViVa, a video-generative value model for robot reinforcement learning, addressing the challenge of partial observability and delayed feedback in real-world deployment. ViVa uses a pretrained video generator to predict future proprioception and a scalar value for the current state, grounding value estimation in anticipated embodiment dynamics. Experiments show that ViVa improves performance on real-world box assembly tasks and produces more reliable value signals compared to existing methods, demonstrating its potential for generalizing to novel objects.
论文提出了ViVa,一种视频生成的价值模型,旨在通过解决现有视觉-语言模型在捕捉时间动态方面的局限性来改进机器人强化学习。ViVa 利用预训练的视频生成器来预测未来的 proprioception 和当前状态的价值,将其集成到 RECAP 框架中。实验结果表明,ViVa 在现实世界的盒子组装任务中表现出色,提供了更可靠的价值信号,准确反映了任务进度,并且能够泛化到新的物体,突显了视频生成模型在价值估计中的潜力。
T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation
Authors: Pranjal Khadka
Venue: CVPR 2026
First: 2026-04-09T12:27:50+00:00 · Latest: 2026-04-09T12:27:50+00:00
Comments: Accepted at the PHAROS-AIF-MIH Workshop at CVPR 2026
Abstract
Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.
中文标题/摘要
标题:T-门适配器:一种轻量级的时空适配器用于视觉语言医学分割
医学图像分割传统上依赖于完全监督的3D架构,这需要大量来自临床专家的密集、体素级别的注释,这是一个极其昂贵的过程。视觉语言模型(VLMs)通过利用从数十亿张图像中学习到的广泛视觉语义表示提供了有力的替代方案。然而,当独立应用于3D扫描的2D切片时,这些模型通常会产生噪声大且解剖上不合理的分割,违反了解剖结构的内在连续性。我们提出了一种时空适配器,通过直接将相邻切片的上下文注入模型的视觉标记表示中来解决这一问题。适配器包括一个在标记级别跨固定上下文窗口进行时空变换的模块,一个在切片内进行上下文细化的空间上下文块,以及一个可调节时空和单切片特征的自适应门控。在FLARE22数据集的30个标注体素上进行训练,我们的方法在13个腹部器官上实现了平均Dice值为0.704,比没有时空上下文的基线VLM提高了0.206。在BTCV和AMOS22数据集上的零样本评估中,分别取得了+0.210和+0.230的改进,跨域性能下降从38.0%降低到24.9%。此外,在AMOS22 MRI的跨模态评估中,两模型均未接受MRI监督,我们的方法实现了平均Dice值为0.366,优于仅在CT上训练的完全监督3D基线(DynUNet,0.224),表明CLIP的视觉语义表示在不同成像模态之间泛化得更好,比卷积特征更自然。
Summary / 总结
The paper addresses the challenge of obtaining accurate 3D medical image segmentations by proposing a T-Gated Adapter, which injects temporal context into a Vision Language Model (VLM) to improve the quality of 2D slice segmentations. The method uses a temporal transformer, a spatial context block, and an adaptive gate to balance temporal and single-slice features. Experiments show that the T-Gated Adapter achieves a mean Dice score of 0.704 across 13 abdominal organs, outperforming the baseline VLM by +0.206. Zero-shot evaluations on BTCV and AMOS22 datasets also demonstrate consistent improvements, with a reduction in cross-domain performance drop from 38.0% to 24.9%. Additionally, the method performs better than a fully supervised 3D baseline on MRI data without any MRI supervision, indicating better generalization across imaging modalities.
论文提出了一种T-Gated Adapter方法,通过向Vision Language Model (VLM)注入时间上下文来提高2D切片分割的准确性。该方法使用了时间变压器、空间上下文块和自适应门控机制来平衡时间上下文和单片上下文特征。实验结果显示,T-Gated Adapter在13个腹部器官上实现了0.704的平均Dice分数,比基线VLM提高了0.206。在BTCV和AMOS22数据集上的零样本评估也显示出一致的改进,跨域性能下降从38.0%降低到24.9%。此外,该方法在没有MRI监督的情况下优于完全监督的3D基线(DynUNet,0.224)在MRI数据上的表现,表明其在不同成像模态之间的泛化能力更强。
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Authors: Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
First: 2026-04-09T11:40:25+00:00 · Latest: 2026-04-09T11:40:25+00:00
Comments: Project page and demo are available at https://FeiElysia.github.io/tempo-page/
Abstract
Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.
中文标题/摘要
标题:小型视觉-语言模型是长视频理解的高效压缩器
将多模态大型语言模型(MLLMs)适应一小时长的视频受到上下文限制的瓶颈。密集的视觉流耗尽了令牌预算并加剧了中间信息丢失的现象。现有的启发式方法,如稀疏采样或均匀池化,盲目地牺牲了保真度,丢弃了关键时刻并浪费带宽在无关的背景上。我们提出了 Tempo,一种高效的查询感知框架,用于压缩长视频以供下游理解。Tempo 利用小型视觉-语言模型(SVLM)作为局部时间压缩器,将令牌减少视为早期跨模态蒸馏过程,以在单次前向传递中生成紧凑且意图对齐的表示。为了在不破坏因果关系的情况下严格控制预算,我们引入了自适应令牌分配(ATA)。利用 SVLM 的零样本相关性先验和语义前加载,ATA 作为一种无需训练的 $O(1)$ 动态路由器。它将密集带宽分配给查询关键段,同时将冗余压缩为最小的时间锚点以保持全局故事情节。大量实验表明,我们的 6B 架构在激进的动态压缩(0.5-16 个令牌/帧)下实现了最先进的性能。在极端长的 LVBench(4101 秒)上,Tempo 在严格的 8K 视觉预算下得分为 52.3,优于 GPT-4o 和 Gemini 1.5 Pro。扩展到 2048 帧达到 53.7。至关重要的是,Tempo 显著压缩了小时长的视频,证明真正的长视频理解依赖于意图驱动的效率,而不是贪婪填充的上下文窗口。
Summary / 总结
The paper addresses the challenge of processing long videos using multimodal large language models (MLLMs) by proposing Tempo, an efficient query-aware framework. Tempo uses a Small Vision-Language Model (SVLM) to compress long videos, focusing on generating compact, intent-aligned representations through an early cross-modal distillation process. The framework introduces Adaptive Token Allocation (ATA) to allocate bandwidth dynamically, ensuring causality is maintained while compressing redundant information into minimal temporal anchors. Experiments demonstrate that Tempo achieves state-of-the-art performance with aggressive dynamic compression, outperforming GPT-4o and Gemini 1.5 Pro on the LVBench dataset, and compressing hour-long videos substantially below theoretical limits.
论文解决了使用多模态大型语言模型处理小时长视频时遇到的上下文限制问题。它提出了 Tempo,一种基于小视觉语言模型(SVLM)的高效压缩框架,将标记减少过程视为早期跨模态蒸馏过程,生成紧凑且意图对齐的表示。Adaptive Token Allocation (ATA) 机制确保在不破坏因果关系的情况下严格控制预算,将密集带宽分配给查询关键段,并将冗余压缩为最小时间锚点以保持全局故事情节。实验表明,Tempo 的 6B 架构在动态压缩下实现了最先进的性能,超越了 GPT-4o 和 Gemini 1.5 Pro 在 LVBench 数据集上的表现,并在 2048 帧上达到 53.7 的成绩。
OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation
Authors: Seungjae Moon, Seunghyun Oh, Youngmin Ro
First: 2026-04-09T11:28:43+00:00 · Latest: 2026-04-09T11:28:43+00:00
Abstract
Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.
中文标题/摘要
标题:OV-Stitcher:一种全局上下文感知的无训练框架,用于开放词汇语义分割
无训练开放词汇语义分割(TF-OVSS)由于能够利用大型视觉和视觉语言模型的预训练知识进行密集预测,而无需额外训练,最近引起了关注。然而,由于这些预训练编码器的输入分辨率有限,现有的TF-OVSS方法通常采用滑动窗口策略,独立处理裁剪的子图像。虽然这种方法对于处理高分辨率输入是有效的,但它会阻止对整个图像进行全局关注,导致特征表示碎片化和上下文推理能力有限。我们提出了一种名为OV-Stitcher的无训练框架,通过在最终编码器块内直接拼接碎片化的子图像特征来解决这一问题。通过从碎片化的子图像特征中重建注意力表示,OV-Stitcher能够在最终编码器块中实现全局关注,产生连贯的上下文聚合和空间上一致、语义对齐的分割图。在八个基准上的广泛评估表明,OV-Stitcher提供了一种可扩展且有效的开放词汇分割解决方案,与之前的无训练基线相比,在平均交并比(mIoU)上取得了显著改进,从48.7提高到50.7。
Summary / 总结
OV-Stitcher is a training-free framework for open-vocabulary semantic segmentation that addresses the limitation of fragmented feature representations by stitching sub-image features directly within the final encoder block. This approach enables global attention and coherent context aggregation, improving mean Intersection over Union (mIoU) from 48.7 to 50.7 compared to previous training-free methods.
论文提出了OV-Stitcher,一种训练-free 的开放词汇语义分割框架,通过在最终编码块内直接拼接分割的子图像特征来解决现有方法的局限性。这种方法能够实现全局注意力和上下文聚合的连贯性,使其在八个基准测试中相对于之前的训练-free 基线提高了平均交并比 (mIoU),从48.7提升到50.7。
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
Authors: Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu
First: 2025-12-03T07:51:03+00:00 · Latest: 2026-04-09T11:20:25+00:00
Abstract
Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
中文标题/摘要
标题:OpenTrack3D:朝向准确且泛化的开放词汇3D实例分割
将开放词汇3D实例分割(OV-3DIS)泛化到多样、无结构且无网格的环境中对于机器人技术和AR/VR至关重要,但仍然是一个重大挑战。我们将其归因于现有方法的两个关键限制:(1)提案生成依赖于数据集特定的提案网络或基于网格的超点,使其在无网格场景中不适用,并限制了对新场景的泛化;(2)基于CLIP的分类器的弱文本推理能力,难以识别组合性和功能性用户查询。为了解决这些问题,我们提出了OpenTrack3D,这是一种可泛化且准确的框架。与依赖预先生成提案的方法不同,OpenTrack3D采用了一种新颖的视觉-空间跟踪器,在线构建跨视图一致的对象提案。给定一个RGB-D流,我们的流水线首先利用2D开放词汇分割器生成掩码,然后使用深度信息将这些掩码提升到3D点云。掩码引导的实例特征随后使用DINO特征图提取,我们的跟踪器融合视觉和空间线索以保持实例一致性。核心流水线完全无网格,但我们还提供了一个可选的超点细化模块,当场景网格可用时,可以进一步增强性能。最后,我们用多模态大型语言模型(MLLM)取代了CLIP,显著增强了对复杂用户查询的组合性推理能力。在包括ScanNet200、Replica、ScanNet++和SceneFun3D在内的多种基准上的广泛实验表明,该方法具有最先进的性能和强大的泛化能力。
Summary / 总结
OpenTrack3D addresses the challenge of open-vocabulary 3D instance segmentation in diverse and unstructured environments by introducing a novel visual-spatial tracker that generates cross-view consistent object proposals online. The framework uses a 2D open-vocabulary segmenter to generate masks, which are then lifted to 3D point clouds. Instance features are extracted using DINO feature maps, and a tracker fuses visual and spatial cues to maintain instance consistency. Experiments on various benchmarks show that OpenTrack3D achieves state-of-the-art performance and strong generalization capabilities. The method also includes an optional superpoints refinement module for enhanced performance in mesh-based scenes and replaces CLIP with a multi-modal large language model to improve compositional reasoning for complex queries.
OpenTrack3D通过引入一种新型的视觉-空间跟踪器来在线构建跨视图一致的对象提议,解决了将开放词汇3D实例分割推广到多样且未结构化环境的挑战。该框架利用2D开放词汇分割器生成掩码,然后将这些掩码提升到3D点云。使用DINO特征图提取实例特征,并通过融合视觉和空间线索来保持实例一致性。在各种基准上的实验表明,OpenTrack3D在性能上超过了现有方法,并且具有很强的泛化能力。
Understanding Task Transfer in Vision-Language Models
Authors: Bhuvan Sachdeva, Karan Uppal, Abhinav Java, Vineeth N. Balasubramanian
Venue: CVPR 2026 Oral
First: 2025-11-24T05:37:52+00:00 · Latest: 2026-04-09T10:41:06+00:00
Comments: CVPR 2026 (Oral)
Abstract
Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. We introduce Perfection Gap Factor (PGF), a normalized metric that measures change in performance as a result of task transfer. We utilize PGF to compute Task Transferability, which captures both the breadth and the magnitude of transfer induced by a source task. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.
中文标题/摘要
标题:理解视觉语言模型的任务迁移
视觉语言模型(VLMs)在多模态基准测试中表现良好,但在深度估计或物体计数等视觉感知任务上落后于人类和专门模型。在一项任务上的微调可能会不可预测地影响其他任务的表现,使得针对特定任务的微调具有挑战性。在本文中,我们通过系统研究任务迁移性来应对这一挑战。我们研究了在一项感知任务上微调VLM如何影响其在其他任务上的零样本表现。我们引入了完美差距因子(PGF),这是一种归一化度量,用于衡量由于任务迁移而导致的表现变化。我们利用PGF计算任务迁移性,该指标捕捉了由源任务引起的迁移的广度和幅度。使用三个开源VLM在13项感知任务上的评估,我们构建了一个任务迁移图,揭示了感知任务之间以前未被观察到的关系。我们的分析揭示了正迁移和负迁移的模式,确定了相互影响的任务组,根据其迁移行为将任务组织成不同的角色,并展示了PGF如何指导数据选择以实现更高效的训练。这些发现突显了正迁移的机会和负干扰的风险,为推进VLM提供了可操作的指导。
Summary / 总结
This paper investigates how fine-tuning a Vision-Language Model (VLM) on one perception task affects its performance on others. The authors introduce the Perfection Gap Factor (PGF) to measure changes in performance due to task transfer and define Task Transferability to capture both the breadth and magnitude of such effects. Using three open-weight VLMs and 13 perception tasks, they construct a task transfer graph that reveals new relationships among tasks, identifies patterns of positive and negative transfer, and organizes tasks into personas based on their transfer behavior. This study provides actionable insights for improving VLMs by guiding data selection for more efficient training.
本文研究了对视觉语言模型(VLM)进行一个感知任务的微调如何影响其在其他任务上的表现。引入了完美差距因子(PGF)来衡量任务转移导致的性能变化,并计算任务转移性以捕捉转移的广度和幅度。使用三个开放权重的VLM和13个感知任务,构建了一个任务转移图,揭示了新的关系和正向与负向转移的模式,指导更高效的训练数据选择。
AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
Authors: Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane
First: 2026-04-09T10:38:23+00:00 · Latest: 2026-04-09T10:38:23+00:00
Abstract
Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.
中文标题/摘要
标题:AtlasOCR:使用视觉语言模型构建首个开源达里雅OCR模型
达里雅,摩洛哥阿拉伯方言,富含视觉内容但缺乏专门的光学字符识别(OCR)工具。本文介绍了AtlasOCR,这是首个使用3B参数视觉语言模型(VLM)微调构建的开源达里雅OCR模型。我们详细介绍了从利用OCRSmith库进行合成生成和精心收集的真实世界数据构建独特的达里雅专用数据集的方法,到实施高效的微调策略。我们使用QLoRA和Unsloth对Qwen2.5-VL 3B进行参数高效训练,并进行了全面的消融研究以优化关键超参数。我们在新构建的AtlasOCRBench和已建立的KITAB-Bench上的评估显示了最先进的性能,挑战了更大的模型,并突显了AtlasOCR在达里雅和标准阿拉伯OCR任务中的鲁棒性和泛化能力。
Summary / 总结
The paper aims to address the lack of specialized OCR tools for Darija, the Moroccan Arabic dialect, by introducing AtlasOCR, the first open-source Darija OCR model. This model is built by fine-tuning a 3B parameter Vision Language Model (VLM) and involves a comprehensive approach including dataset curation with synthetic generation and real-world data, as well as efficient fine-tuning strategies. The evaluation shows that AtlasOCR achieves state-of-the-art performance on newly curated benchmarks, demonstrating robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.
论文旨在解决摩洛哥阿拉伯方言Darija缺乏专门的OCR工具的问题,通过引入AtlasOCR,这是第一个开源的Darija OCR模型。该模型通过微调一个3B参数的Vision Language Model构建。作者使用合成生成和真实世界数据共同构建了一个独特的Darija数据集,并采用了高效的微调策略,如QLoRA和Unsloth。该模型在AtlasOCRBench和KITAB-Bench上表现出最先进的性能,展示了其在Darija和标准阿拉伯OCR任务中的鲁棒性和泛化能力。
Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework
Authors: Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye
First: 2025-09-27T14:13:41+00:00 · Latest: 2026-04-09T10:37:26+00:00
Abstract
With the continuous expansion of Large Language Models (LLMs) and advances in reinforcement learning, LLMs have demonstrated exceptional reasoning capabilities, enabling them to address a wide range of complex problems. Inspired by these achievements, researchers have extended related techniques to Large Multimodal Models (LMMs). However, a critical limitation has emerged, reflected in the progressive loss of visual grounding. As the reasoning chain grows longer, LMMs tend to rely increasingly on the textual information generated in earlier steps, while the initially extracted visual information is rarely revisited or incorporated. This phenomenon often causes the reasoning process to drift away from the actual image content, resulting in visually implausible or even erroneous conclusions. To overcome this fundamental limitation, we propose a novel, training-free agentic paradigm that Decouples cognitive Reasoning from visual Perception (DRP). In this framework, a powerful LLM serves as a strategic Reasoner, orchestrating the inference process by explicitly querying an LMM-acting as a dedicated Observer-to retrieve fine-grained visual details on demand. This approach is lightweight, model-agnostic, and plug-and-play, necessitating no additional training or architectural modifications. Extensive experiments demonstrate our framework DRP's efficacy in regulating the visual reasoning trajectory, significantly mitigating reasoning drift, and enforcing robust visual grounding. Notably, on the MathVision benchmark, the integration of Qwen2.5-VL-7B and Qwen3-32B achieves an accuracy of 47.2\%, outperforming GPT-4o's 40.6\%. These findings underscore the potential of our approach to enhance multimodal reasoning reliability without the need for costly retraining. Our code is publicly available at https://github.com/hongruijia/DRP.
中文标题/摘要
标题:在大型多模态模型中缓解视觉上下文退化:一种无需训练的解耦代理框架
随着大型语言模型(LLMs)的不断扩展和强化学习的进步,LLMs 展现出了卓越的推理能力,使其能够解决各种复杂问题。受这些成就的启发,研究人员将相关技术扩展到了大型多模态模型(LMMs)中。然而,一个关键的限制出现了,表现为视觉定位的逐步丧失。随着推理链的延长,LMMs 越来越依赖于早期步骤生成的文本信息,而最初提取的视觉信息很少被重新访问或整合。这种现象往往导致推理过程偏离实际图像内容,产生视觉上不合理甚至错误的结论。为克服这一根本限制,我们提出了一种新颖的、无需训练的代理范式,即解耦认知推理与视觉感知(DRP)。在该框架中,一个强大的 LLM 作为战略推理者,通过显式查询 LMM(作为专门的观察者)来按需检索细粒度的视觉细节,从而协调推理过程。该方法轻量级、模型无关且即插即用,无需额外的训练或架构修改。大量实验表明,我们的框架 DRP 在调节视觉推理轨迹、显著减轻推理漂移和确保稳健的视觉定位方面具有有效性。值得注意的是,在 MathVision 基准测试中,Qwen2.5-VL-7B 和 Qwen3-32B 的结合实现了 47.2% 的准确率,优于 GPT-4o 的 40.6%。这些发现强调了我们方法在无需昂贵的重新训练的情况下增强多模态推理可靠性的潜力。我们的代码已公开发布在 https://github.com/hongruijia/DRP。
Summary / 总结
The paper addresses the issue of visual grounding degradation in Large Multimodal Models (LMMs) as reasoning chains grow longer. It proposes a training-free Decoupled Reasoning from Perception (DRP) framework to mitigate this problem. In DRP, a strategic LLM queries an LMM to retrieve fine-grained visual details, ensuring robust visual grounding and reducing reasoning drift. Experiments show DRP significantly improves performance on the MathVision benchmark, achieving 47.2% accuracy compared to GPT-4o's 40.6%. This approach is lightweight, model-agnostic, and does not require additional training or architectural changes.
论文提出了一种名为Decoupled cognitive Reasoning from visual Perception (DRP)的训练免费代理框架,以解决大型多模态模型(LMMs)中的视觉上下文退化问题。该框架利用强大的LLM作为策略推理者,通过查询LMM(作为观察者)获取细粒度的视觉细节,从而减轻推理漂移并增强视觉接地。实验表明,DRP在MathVision基准测试中显著提高了LMM的准确性,达到47.2%,优于GPT-4o的40.6%。该方法轻量级且模型无关,无需额外的训练或架构修改。
3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
Authors: Hongcan Xiao, Xinyue Xiao, Yilin Wang, Yue Zhang, Yonggang Qi
Venue: CVPR 2026 Highlight
First: 2026-04-09T09:47:00+00:00 · Latest: 2026-04-09T09:47:00+00:00
Comments: CVPR 2026 Highlight
Abstract
Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.
中文标题/摘要
标题:3DrawAgent:通过早期对比体验教学LLM在三维空间中绘画
在三维空间中绘制可以表达关于形状、结构和空间关系的推理,然而通过自然语言生成三维草图仍然是一个重大挑战。本文介绍了一种无需训练、由语言驱动的三维草图生成框架3DrawAgent,该框架利用大型语言模型(LLM)在几何反馈下顺序绘制三维贝塞尔曲线。与之前的二维草图代理不同,我们的方法引入了一种相对经验优化策略,该策略适应了最近提出的组奖励策略优化(GRPO)范式。我们不依赖于显式的地面真实监督,而是基于CLIP感知奖励和LLM细粒度的定性评估构建生成草图的成对比较,每对包括一个相对较好的结果和一个较差的结果。这些经验随后被用来逐步细化三维绘画的先验知识,使模型在不更新参数的情况下实现黑盒强化学习。实验表明,3DrawAgent可以从多种文本提示中生成复杂且连贯的三维贝塞尔草图,表现出新兴的几何推理能力,并能够泛化到新的形状,为推进无需训练的三维草图智能领域树立了新的范式。
Summary / 总结
3DrawAgent is a training-free framework that uses large language models to generate 3D sketches through sequential drawing of Bezier curves with geometric feedback. It employs a relative experience optimization strategy based on Group Reward Policy Optimization (GRPO) and constructs pairwise comparisons using CLIP-based perceptual rewards and LLM-based qualitative assessments. The model self-refines its 3D drawing skills without parameter updates, demonstrating the ability to generate complex and coherent 3D sketches from various textual prompts and exhibit emergent geometric reasoning, thus advancing training-free 3D sketch intelligence.
研究旨在通过自然语言生成3D草图,引入了3DrawAgent框架,利用大型语言模型在几何反馈下绘制3D贝塞尔曲线。方法采用相对经验优化策略,基于CLIP和LLM评估进行成对比较,以迭代细化3D绘图知识。实验表明,3DrawAgent可以从多种文本提示生成复杂且连贯的3D贝塞尔草图,并展示出几何推理能力,展示了其在无需参数更新的情况下推动3D草图智能发展的潜力。
LINE: LLM-based Iterative Neuron Explanations for Vision Models
Authors: Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula, Przemysław Biecek
First: 2026-04-09T09:43:26+00:00 · Latest: 2026-04-09T09:43:26+00:00
Abstract
Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.
中文标题/摘要
标题:LINE:基于LLM的迭代神经元解释方法用于视觉模型
在深度神经网络中解释单个神经元所编码的概念是理解其复杂决策过程和确保AI安全的关键步骤。尽管在神经元标记方面取得了进展,但现有方法往往将搜索空间限制在预定义的概念词汇表中,或者产生过于具体的描述,无法捕捉到高层次的全局概念。我们提出了LINE,这是一种新颖的、无需训练的迭代方法,专门用于视觉模型中的开放词汇概念标记。在严格黑盒设置下,LINE 利用一个大型语言模型和一个文本到图像生成器,通过激活历史迭代地提出和细化概念,形成闭环。我们证明,LINE 在多个模型架构上达到了最先进的性能,ImageNet 上 AUC 提高了 0.18,Places365 上提高了 0.05,同时平均发现了 29% 的新概念,这些概念被大规模预定义词汇表所遗漏。除了识别顶级概念外,LINE 还提供了一个完整的生成历史,这使得多义性评估成为可能,并生成了与梯度依赖激活最大化方法相媲美的支持视觉解释。
Summary / 总结
The research aims to interpret the concepts encoded by individual neurons in deep neural networks to enhance AI safety. LINE, a training-free iterative method, is introduced to label open-vocabulary concepts in vision models. It uses a large language model and a text-to-image generator to iteratively propose and refine concepts based on activation history. LINE outperforms existing methods, achieving AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, and discovers 29% more concepts than massive predefined vocabularies. It also provides a complete generation history for polysemanticity evaluation and generates supporting visual explanations similar to gradient-dependent activation maximization methods.
研究旨在通过解释深度神经网络中单个神经元编码的概念来提高AI安全性。提出了LINE,一种无需训练的迭代方法,用于在视觉模型中标注开放词汇表的概念。通过使用大型语言模型和文本到图像生成器,LINE基于激活历史迭代地提出和改进概念。该方法达到最先进的性能,分别在ImageNet和Places365上AUC提高最多0.18和0.05,并且比大规模预定义词汇表多发现29%的概念。此外,它还提供了生成历史以进行多义性评估,并生成与梯度依赖激活最大化方法相当的视觉解释。
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
Authors: Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov
First: 2026-04-07T16:31:45+00:00 · Latest: 2026-04-09T09:40:36+00:00
Comments: 18 pages, 34 figures
Abstract
Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams.
We present CodecSight, a codec-guided streaming video analytics system, built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CodecSight treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CodecSight achieves an improvement in throughput of up to 3$\times$, and a reduction of up to 87% in GPU compute over state-of-the-art baselines, maintaining competitive accuracy with only 0$\sim$8% F1 drop.
中文标题/摘要
标题:CodecSight:利用视频编解码信号进行高效的流媒体VLM推理
视频流媒体分析是视觉语言模型服务中的关键工作负载,但多模态推理的高成本限制了其扩展性。先前的系统通过利用视频流中的时域和空域冗余来降低推理成本,但它们仅针对视觉变换器(ViT)或有限视角的LLM,错过了端到端的机会。此外,现有方法在识别冗余方面产生了显著的开销,无论是通过离线配置和训练还是昂贵的在线计算,这使得它们不适合动态实时流媒体。
我们提出了CodecSight,这是一种基于编解码器指导的流媒体视频分析系统,基于一个关键观察,即视频编解码器在压缩过程中已经提取了每个流的时域和空域结构。CodecSight 将这种编解码器元数据视为一种低成本的运行时信号,用于统一视频解码、视觉处理和LLM预填充的优化,直接操作压缩比特流本身具有减少传输的好处。这驱动了在ViT编码前的编解码器指导补丁修剪和在LLM预填充期间的选择性键值缓存刷新,两者都是完全在线的,不需要离线训练。实验表明,CodecSight 的吞吐量提高了3倍,GPU计算减少了87%,并且仅以0~8%的F1分数下降保持了竞争力。
Summary / 总结
CodecSight is a codec-guided streaming video analytics system that leverages existing codec metadata to reduce inference costs for vision-language model serving. By treating codec signals as low-cost runtime signals, it optimizes video decoding, visual processing, and LLM prefilling, leading to up to 3 times improvement in throughput and 87% reduction in GPU compute compared to state-of-the-art methods, with minimal accuracy drop.
CodecSight 利用视频编解码器信号来降低多模态推理的成本。通过将编解码器元数据视为低成本的运行时信号,它在视频解码、视觉处理和LLM预填充之间实现了优化的统一。实验表明,CodecSight 可以将吞吐量提高多达3倍,并将GPU计算量减少多达87%,同时仅导致F1分数最多下降8%。