arXiv 论文速递

Snapshot: 20260324_0400

Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods

Authors: Sebastian Gerard, Josephine Sullivan

Venue: Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), PMLR, Jan. 2026

First: 2026-03-20T17:59:15+00:00 · Latest: 2026-03-20T17:59:15+00:00

Comments: Accepted at NLDL 2026. This version contains small corrections compared to the initial publication, see appendix for details

Abs · PDF · Code1 · Code2 · Code3

Abstract

Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: https://github.com/SebastianGer/wildfire-spread-scenarios

中文标题/摘要

标题：野火蔓延场景：使用无训练方法增加分割扩散模型的样本多样性

在不确定环境中预测未来状态，如野火蔓延、医疗诊断或自动驾驶，需要能够考虑多种可能结果的模型。虽然扩散模型可以有效地学习这种多模态分布，但简单地从这些模型中采样是计算上低效的，可能需要数百个样本才能找到可能仍具有操作意义的低概率模式。在本文中，我们通过评估几种无训练采样方法来应对样本高效模糊分割的挑战，这些方法鼓励产生多样化的预测。我们对两种技术进行了适应，即粒子引导和SPELL，这两种技术最初是为生成多样化的自然图像而设计的，并将其应用于离散分割任务，同时提出了一种简单的基于聚类的技术。我们在LIDC医疗数据集、Cityscapes数据集的修改版本以及本文中引入的新模拟野火蔓延数据集MMFire上验证了这些方法。与简单采样相比，这些方法在MMFire上将HM IoU*指标提高了最多7.5%，在Cityscapes上提高了16.4%，表明无训练方法可以以较低的图像质量和运行时间成本来有效地增加分割扩散模型的样本多样性。

Summary / 总结

This work addresses the challenge of generating diverse and operationally relevant scenarios for wildfire spread prediction using diffusion models. It evaluates several training-free sampling methods, including particle guidance, SPELL, and a clustering-based technique, adapted for discrete segmentation tasks. On the MMFire wildfire dataset and Cityscapes, these methods improved the HM IoU metric by up to 7.5% and 16.4%, respectively, showing that these methods can enhance sample diversity with minimal impact on image quality and runtime.

该研究旨在通过训练-free 方法提高野火蔓延场景中模糊分割的样本效率。作者评估了原本用于生成多样化图像的粒子引导和SPELL技术，并提出了一种基于聚类的方法。这些方法在LIDC、Cityscapes和一个新的野火蔓延数据集MMFire上进行了验证。结果表明，这些方法在MMFire和Cityscapes上的HM IoU分别提高了7.5%和16.4%，表明这些方法可以在不显著损失图像质量和运行时间的情况下增强分割预测的多样性。

Adaptive Greedy Frame Selection for Long Video Understanding

Authors: Yuning Huang, Fengqing Zhu

First: 2026-03-20T17:55:32+00:00 · Latest: 2026-03-20T17:55:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

中文标题/摘要

标题：长视频理解中的自适应贪婪帧选择

大型视觉-语言模型（VLMs）越来越多地应用于长视频问答，但在推理过程中，输入帧的数量和由此产生的视觉标记往往成为瓶颈。简单的稀疏采样可能会错过关键时刻，而完全基于相关性的选择则经常陷入几乎重复的帧中，并牺牲了时间上相距较远的证据的覆盖范围。我们提出了一种问题自适应的贪婪帧选择方法，该方法在固定帧预算下联合优化查询相关性和语义代表性。我们的方法构建了一个1 FPS候选池（最多1000个），具有精确的时间戳对齐，将候选者嵌入两个互补的空间（SigLIP用于问题相关性，DINOv2用于语义相似性），并通过贪婪地最大化加权和的模块化相关性项和设施位置覆盖项来选择帧。该目标是归一化的、单调的和次模的，提供了标准的（1-1/e）贪婪近似保证。为了考虑问题之间相关性和覆盖之间的依赖性权衡，我们引入了四种预设策略和一个轻量级的仅文本问题类型分类器，将每个查询路由到其表现最佳的预设。在MLVU上的实验显示，在不同帧预算下，与均匀采样和一个强大的近期基线相比，该方法在准确率上都有一致的提升，尤其是在预算紧张的情况下，提升最为显著。

Summary / 总结

This paper addresses the challenge of efficient frame selection for long-video question answering using large vision-language models. It proposes an adaptive greedy frame selection method that optimizes query relevance and semantic representativeness. The method constructs a 1 FPS candidate pool, embeds candidates in two spaces for relevance and semantic similarity, and selects frames by maximizing a weighted sum of relevance and coverage terms. Experiments show consistent accuracy gains over uniform sampling and a strong baseline, with the largest improvements under tight frame budgets.

论文针对使用大型视觉-语言模型进行长视频问答时的帧选择效率问题，提出了一种适应性的贪婪帧选择方法，该方法同时优化查询的相关性和语义的代表性。该方法构建了一个1 FPS的候选池，将候选帧嵌入到两个空间（SigLIP和DINOv2），并通过最大化相关性和覆盖性的加权和来选择帧。实验结果显示，在各种帧预算下，该方法的一致性准确率提升优于均匀采样和一个强大的基线模型，特别是在帧预算紧张的情况下，提升最为显著。

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

Authors: Jiyu Lim, Youngwoo Yoon, Kwanghyun Park

Venue: ICRA 2026

First: 2026-03-20T17:40:21+00:00 · Latest: 2026-03-20T17:40:21+00:00

Comments: Accepted to ICRA 2026. 8 pages, 9 figures, Project page: https://limjiyu99.github.io/inner-critic/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner-critic/

中文标题/摘要

标题：机器人的内在批评家：通过基于VLM的重规划自我完善社会行为

传统的机器人社会行为生成在灵活性和自主性方面受到限制，依赖于预定义的动作或人类反馈。本研究提出了CRISP（批判与重规划以实现互动社会存在），这是一种自主框架，其中机器人通过利用视觉语言模型（VLM）作为“类似人类的社会批评家”来批判和重规划自己的行为。CRISP 结合了（1）通过分析机器人的描述文件（例如，MJCF）提取可移动关节和约束，（2）基于情境上下文生成逐步行为计划，（3）通过参考视觉信息（关节运动范围可视化）生成低级关节控制代码，（4）基于VLM评估社会适宜性和自然性，包括指出错误步骤，以及（5）通过基于奖励的搜索进行行为的迭代完善。该方法不依赖于特定的机器人API；仅使用机器人的结构文件，它可以在各种平台上生成微妙不同的、类似人类的动作。在涉及五种不同机器人类型和20种场景（包括移动操作臂和类人机器人）的用户研究中，我们提出的方法在偏好和情境适宜性评分方面显著优于先前的方法。这项研究提供了一个通用框架，最大限度地减少了人类干预，同时扩展了机器人的自主交互能力和跨平台适用性。详细的结果视频和关于此工作的补充信息可在：https://limjiyu99.github.io/inner-critic/

Summary / 总结

The study introduces CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and refines its social behaviors using a Vision-Language Model (VLM) as a social critic. CRISP involves extracting robot joints and constraints, generating behavior plans based on context, referencing visual information for joint control, evaluating social appropriateness with VLM, and iteratively refining behaviors. The method was tested across five robot types in 20 scenarios, showing significantly higher preference and situational appropriateness ratings compared to previous methods. This framework minimizes human intervention and enhances the robot's autonomous interaction capabilities across platforms.

研究提出了CRISP框架，该框架使机器人能够利用视觉语言模型自主批评和改进其社交行为。该过程包括提取机器人关节和约束、生成行为计划、参考视觉信息进行关节控制、使用视觉语言模型评估社交适宜性和进行迭代改进。在涉及五种不同类型机器人和20个场景的用户研究中，CRISP在偏好和情境适宜性评分方面优于以往方法。

On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

Authors: David Restrepo, Miguel L Martins, Chenwei Wu, Luis Filipe Nakayama, Diego M Lopez, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante

First: 2026-03-18T01:04:21+00:00 · Latest: 2026-03-20T15:22:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) exhibit a characteristic "cone effect" in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -- particularly in medical domains -- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {λ}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation; however, fully collapsing the gap is not always optimal, and intermediate, task-dependent separation yields the best results. These findings position the modality gap as a tunable property of multimodal representations rather than a quantity that should be universally minimized.

中文标题/摘要

标题：关于医学视觉-语言嵌入中的圆锥效应和模态差距

视觉-语言模型（VLMs）表现出一种“圆锥效应”，其中非线性编码器将嵌入映射到表示空间中高度集中的区域，导致跨模态分离，即模态差距。尽管这一现象已被广泛观察到，但其对监督多模态学习的实际影响——特别是在医学领域——仍然不清楚。在本文中，我们引入了一种轻量级的后处理机制，该机制在冻结预训练的VLM编码器的同时，通过单一超参数{λ}连续控制跨模态分离。这使得可以系统地分析模态差距如何影响下游多模态性能，而无需昂贵的重新训练。我们在监督多模态设置中对通用（CLIP，SigLIP）和医学专门化（BioMedCLIP，MedSigLIP）模型进行了跨医学和自然数据集的评估。结果一致表明，减少过度的模态差距可以提高下游性能，医学数据集对差距调节的敏感性更强；然而，完全消除差距并不总是最优的，任务相关的中间分离效果最佳。这些发现将模态差距定位为多模态表示的可调属性，而不是应该普遍最小化的量。

Summary / 总结

This study investigates the 'cone effect' and modality gap in vision-language models, focusing on their impact on supervised multimodal learning in medical domains. By introducing a lightweight mechanism to control cross-modal separation, the researchers analyze how modality gap affects performance without retraining. Results indicate that reducing the modality gap improves downstream performance, with medical datasets showing greater sensitivity to gap modulation. However, fully collapsing the gap is not always optimal, and intermediate separation yields the best results, suggesting the modality gap is a tunable property rather than a quantity to be universally minimized.

该研究探讨了视觉-语言模型中的‘锥形效应’和模态差距对医学领域监督多模态学习的影响。通过引入一种轻量级机制来控制跨模态分离，研究人员分析了模态差距如何影响性能而无需重新训练。结果表明，减少模态差距可以提高下游性能，医学数据集对差距调节的敏感度更高。然而，完全消除差距并不总是最优的，中间程度的分离效果最佳，这表明模态差距是一个可调节的属性而非应普遍最小化的量。

3D-Consistent Multi-View Editing by Correspondence Guidance

Authors: Josef Bengtson, David Nilsson, Dong In Lee, Yaroslava Lochman, Fredrik Kahl

First: 2025-11-27T08:48:36+00:00 · Latest: 2026-03-20T14:17:43+00:00

Comments: Added experiments with FLUX.1 editing method

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advancements in diffusion and flow models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian splat models. We propose a training-free guidance framework that enforces multi-view consistency during the image editing process. The key idea is that corresponding points should look similar after editing. To achieve this, we introduce a consistency loss that guides the denoising process toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. We also show that this increased consistency enables high-quality Gaussian splat editing with sharp details and strong fidelity to user-specified text prompts. Please refer to our project page for video results: https://3d-consistent-editing.github.io/

中文标题/摘要

标题：基于对应指导的3D一致多视图编辑

最近在扩散和流模型方面的进展极大地提高了基于文本的图像编辑效果，但独立编辑图像的方法往往会在同一场景的不同视图中产生几何和光度不一致的结果。这种不一致性特别影响3D表示的编辑，如NeRF或高斯点模型。我们提出了一种无需训练的指导框架，在图像编辑过程中强制执行多视图一致性。关键思想是编辑后对应点应该看起来相似。为此，我们引入了一致性损失，以引导去噪过程向一致的编辑方向发展。该框架具有灵活性，可以与各种不同的图像编辑方法结合使用，支持密集和稀疏的多视图编辑设置。实验结果表明，与现有的多视图编辑方法相比，我们的方法显著提高了3D一致性。我们还展示了这种增加的一致性使得高分辨率的高斯点编辑具有清晰的细节和强烈的用户指定文本提示的保真度。请参阅我们的项目页面以获取视频结果：https://3d-consistent-editing.github.io/

Summary / 总结

The paper addresses the issue of geometric and photometric inconsistencies in multi-view image editing, which is common when editing images independently. It proposes a training-free guidance framework that enforces consistency across different views by ensuring corresponding points look similar after editing. The framework uses a consistency loss to guide the denoising process, making it flexible and compatible with various image editing methods. Experiments show that this approach significantly improves 3D consistency and enables high-quality Gaussian splat editing with sharp details and strong fidelity to user prompts. The paper also includes experiments with the FLUX.1 editing method.

该论文提出了一种无需训练的引导框架，以解决多视图图像编辑中的几何和光度不一致性问题。方法通过确保编辑后对应点看起来相似来实现一致性，并使用一致性损失引导去噪过程。实验结果表明，与现有方法相比，该方法在3D一致性方面有显著改进，能够实现具有清晰细节和强烈用户文本提示一致性的高质Gaussian splat编辑。

HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction

Authors: Ruicheng Yuan, Zhenxuan Zhang, Anbang Wang, Liwei Hu, Xiangqian Hua, Yaya Peng, Jiawei Luo, Guang Yang

First: 2026-03-20T13:58:02+00:00 · Latest: 2026-03-20T13:58:02+00:00

Comments: 10 pages, 1 figures, 3 tables

Abs · PDF · Code1 · Code2

Abstract

Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.

中文标题/摘要

标题：HiPath：分层视觉-语言对齐以预测结构化病理报告

病理报告是结构化的、多粒度的文档，编码了诊断结论、组织学分级和一个或多个解剖部位的辅助测试结果；然而，现有的病理视觉-语言模型（VLMs）将这种输出简化为一个扁平的标签或自由形式的文本。我们提出了HiPath，这是一种基于冻结的UNI2和Qwen3骨干的轻量级VLM框架，将其结构化报告预测作为主要训练目标。三个可训练模块总计1500万个参数解决了问题的不同方面：一个分层补丁聚合器（HiPA）用于多图像视觉编码，分层对比学习（HiCL）通过最优传输进行跨模态对齐，以及基于槽的掩码诊断预测（Slot-MDP）用于结构化诊断生成。HiPath在来自三家医院的749,000例真实世界中文病理病例上进行训练，实现了68.9%的严格准确率和74.7%的临床可接受准确率，安全率为97.3%，在相同的冻结骨干下优于所有基线。跨医院评估证实了泛化能力，仅在严格准确率上下降3.4个百分点，同时保持97.1%的安全率。

Summary / 总结

HiPath is a lightweight vision-language model framework designed to predict structured pathology reports. It uses a frozen UNI2 and Qwen3 backbone and includes three trainable modules: HiPA for multi-image visual encoding, HiCL for cross-modal alignment, and Slot-MDP for structured diagnosis generation. Trained on 749K Chinese pathology cases, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines.

研究旨在通过解决病理报告的层次性和多粒度特性，提高结构化病理报告预测的准确性和安全性。HiPath 是一个轻量级的视觉-语言模型框架，包含三个可训练模块：HiPA 负责视觉编码，HiCL 负责跨模态对齐，Slot-MDP 负责结构化诊断生成。该模型基于749K份真实世界病例训练，实现了68.9%的严格准确率和74.7%的临床可接受准确率，安全率为97.3%，并优于所有基线模型，且在跨医院测试中保持了良好的泛化能力。

CARES: Context-Aware Resolution Selector for VLMs

Authors: Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

First: 2025-10-22T11:44:31+00:00 · Latest: 2026-03-20T12:24:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

中文标题/摘要

标题：CARES：面向VLMs的上下文感知分辨率选择器

大型视觉-语言模型（VLMs）通常以原生或高分辨率处理图像，以在各种任务中保持有效性。这导致视觉标记通常占总标记的97-99%，即使低分辨率图像足以使用时也会导致高计算量和延迟。我们引入了\emph{CARES}-一个\textbf{上下文感知} \textbf{分辨率} \textbf{选择器}，这是一个轻量级的预处理模块，给定一个图像-查询对，预测\emph{最小}足够的输入分辨率。CARES使用一个紧凑的VLM（350M）来提取特征并预测目标预训练VLM的响应何时达到其正确回答能力的峰值。尽管作为离散分类器在一组可选分辨率上进行训练，但在推理时CARES可以插值连续分辨率以实现细粒度控制。在涵盖文档和自然图像的五个跨模态基准以及各种目标VLM中，CARES在减少高达80%的计算量的同时保持了任务性能。

Summary / 总结

CARES is a context-aware resolution selector designed to reduce the computational cost of large vision-language models (VLMs) by predicting the minimal sufficient input resolution for image-query pairs. It uses a compact VLM to extract features and determine when the target VLM's response converges to its peak performance. CARES achieves up to 80% compute reduction while maintaining task performance across various benchmarks and VLMs.

CARES 是一个轻量级预处理模块，能够预测图像查询对的最小必要输入分辨率，从而在各种基准和 VLM 上将计算量最多减少 80%，同时保持任务性能。它使用紧凑型 VLM 提取特征并预测目标 VLM 的响应何时达到最佳水平，并在推理时进行连续分辨率的插值以实现精细控制。

Medical Image Spatial Grounding with Semantic Sampling

Authors: Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura, Mingrui Yang, Xiaojuan Li, Vipin Chaudhary

Venue: MICCAI 2026

First: 2026-03-15T19:54:46+00:00 · Latest: 2026-03-20T11:58:10+00:00

Comments: 10 pages, 2 figures, under review at MICCAI 2026

Abs · PDF · Code1 · Code2

Abstract

Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges. In this study, we examine image modalities, slice directions, and coordinate systems as differentiating factors for vision components of VLMs, and the use of anatomical, directional, and relational terminology as factors for the language components. We then demonstrate that visual and textual prompting systems such as labels, bounding boxes, and mask overlays have varying effects on the spatial grounding ability of VLMs. To enable measurement and reproducibility, we introduce MIS-Ground, a benchmark that comprehensively tests a VLM for vulnerabilities against specific modes of Medical Image Spatial Grounding. We release MIS-Ground to the public at https://anonymous.4open.science/r/mis-ground. In addition, we present MIS-SemSam, a low-cost, inference-time, and model-agnostic optimization of VLMs that improve their spatial grounding ability with the use of Semantic Sampling. We find that MIS-SemSam improves the accuracy of Qwen3-VL-32B on MIS-Ground by 13.06%.

中文标题/摘要

标题：医学图像空间定位的语义采样

视觉语言模型（VLMs）在图像和视频的空间定位方面显示出显著的潜力。在医学成像研究中，VLMs 代表了从目标检测和分割到理解与生成的桥梁。然而，在医学图像的三维空间中对解剖结构进行空间定位提出了许多独特的挑战。在本研究中，我们探讨了图像模态、切片方向和坐标系统作为 VLM 视觉组件的区分因素，以及解剖学、方向和关系术语作为语言组件的因素。然后我们证明了标签、边界框和掩码叠加等视觉和文本提示系统对 VLM 的空间定位能力有不同的影响。为了实现测量和可重复性，我们引入了 MIS-Ground，这是一个基准测试，全面测试了 VLM 对特定医学图像空间定位模式的脆弱性。我们已将 MIS-Ground 公开发布在 https://anonymous.4open.science/r/mis-ground。此外，我们提出了 MIS-SemSam，这是一种低成本、推理时和模型无关的 VLM 优化方法，通过使用语义采样提高了其空间定位能力。我们发现 MIS-SemSam 将 Qwen3-VL-32B 在 MIS-Ground 上的准确性提高了 13.06%。

Summary / 总结

This study investigates the challenges of spatial grounding in medical images using vision language models (VLMs). It examines image modalities, slice directions, and coordinate systems as differentiating factors for vision components, and the use of anatomical, directional, and relational terminology as factors for language components. The authors introduce MIS-Ground, a benchmark for testing VLMs, and present MIS-SemSam, a low-cost optimization method that enhances spatial grounding accuracy by 13.06% on Qwen3-VL-32B.

本研究旨在利用视觉语言模型（VLMs）解决医学图像中解剖结构的空间定位问题。研究考察了图像模态、切片方向和坐标系统对VLMs的影响，以及解剖学、方向和关系术语在语言组件中的作用。作者引入了MIS-Ground，这是一个用于测试VLMs在医学图像空间定位能力的基准。此外，还提出了MIS-SemSam，这是一种低成本的优化技术，在推理时可以提高VLMs的空间定位能力。MIS-SemSam使Qwen3-VL-32B在MIS-Ground上的准确性提高了13.06%。

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

Authors: Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

First: 2026-03-20T11:25:55+00:00 · Latest: 2026-03-20T11:25:55+00:00

Comments: Accepted at CVPR2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

中文标题/摘要

标题：IsoCLIP：分解CLIP投影器以实现高效的同模态对齐

像CLIP这样的视觉-语言模型广泛用于涉及视觉和文本模态的跨模态任务。然而，当个体模态编码器应用于如图像到图像检索等同模态任务时，它们的性能会因同模态对齐不良而受到影响。本文我们研究CLIP中的同模态对齐问题，重点关注将预投影图像和文本嵌入映射到共享嵌入空间的投影器的作用。通过分析应用于投影特征的余弦相似性的形式及其与对比CLIP损失的相互作用，我们表明，在训练过程中存在一个跨模态操作符负责对齐两种模态，而第二个同模态操作符仅执行同模态归一化，但不促进同模态对齐。通过跨模态操作符的频谱分析，我们确定了一个同轴的子空间，在该子空间中两种模态对齐良好，以及每种模态特有的各向异性方向。我们证明，这种对齐的子空间可以直接从投影器权重中获得，去除各向异性方向可以提高同模态对齐。我们在同模态检索和分类基准上的实验表明，我们的无训练方法减少了同模态对齐不良，大大降低了延迟，并在多个预训练的CLIP-like模型上优于现有方法。代码可在：https://github.com/simomagi/IsoCLIP公开获取。

Summary / 总结

The paper addresses the issue of intra-modal misalignment in CLIP models when used for image-to-image tasks. By analyzing the projectors in CLIP, the authors identify an inter-modal operator responsible for aligning modalities and an intra-modal operator that only normalizes embeddings. They propose IsoCLIP, a method that decomposes the projectors to directly obtain an aligned subspace and remove anisotropic directions, which improves intra-modal alignment and reduces latency. Experiments show that IsoCLIP outperforms existing methods on intra-modal retrieval and classification benchmarks across multiple CLIP-like models.

本文研究了CLIP模型在进行图像到图像检索等内在模态任务时出现的内在模态对齐问题。通过分析CLIP中的投影器，作者发现一个负责模态对齐的跨模态操作符和一个仅进行嵌入归一化但不促进内在模态对齐的内在模态操作符。他们提出IsoCLIP方法，通过从投影器中去除各向异性方向来提高内在模态对齐，从而降低延迟并在多个预训练CLIP模型基准测试中表现出色。

From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models

Authors: Weile Gong, Yiping Zuo, Zijian Lu, Xin He, Weibei Fan, Chen Dai

First: 2026-03-20T09:28:09+00:00 · Latest: 2026-03-20T09:28:09+00:00

Comments: 10 pages, 5 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.

中文标题/摘要

标题：从可行性到可验证性：风险控制生成OCR技术在视觉语言模型中的应用

现代视觉语言模型（VLMs）可以作为生成OCR引擎使用，但开放式的解码可能会暴露一些罕见但后果严重的失败。我们识别出生成OCR中的一个核心部署不匹配：自回归解码倾向于语义上的可行性，而OCR需要输出具有视觉依据且几何上可验证的内容。这种不匹配导致了严重的错误，尤其是在过度生成和不支持的替换方面，即使基准准确性保持较高，也会带来部署风险。因此，我们将冻结的VLM OCR视为一种选择性接受/弃权问题，并提出了一种模型无关的几何风险控制器。控制器对同一输入的多个结构化视图进行探查，应用轻量级的结构筛选，并在跨视图一致性和稳定性满足预定义标准时才接受转录，从而产生一组操作点。在冻结的VLM主干和标准OCR基准上的实验显示，一致地减少了极端错误风险和灾难性过度生成，并且在可预测的覆盖成本下实现了这些改进。使用冻结的VLM进行生成OCR的可靠部署需要明确的系统级风险控制，而不是无约束的生成。

Summary / 总结

The study addresses the risk associated with open-ended decoding in vision-language models (VLMs) used as generative OCR engines, where semantic plausibility may conflict with visual grounding and geometric verifiability. To mitigate this, the authors propose a Geometric Risk Controller that screens multiple structured views of input images, accepting transcriptions only when consensus and stability criteria are met, thus reducing extreme errors and over-generation while maintaining benchmark accuracy. Experiments demonstrate consistent risk reduction with minimal coverage impact. This approach emphasizes the need for explicit risk control in deploying generative OCR with frozen VLMs over unconstrained generation methods.

研究关注视觉语言模型（VLMs）生成OCR时由于自回归解码倾向于语义合理性与需要视觉可验证输出之间的不匹配所导致的罕见但严重的错误风险。提出了一种几何风险控制器，根据跨视图的一致性和稳定性选择性地接受或拒绝转录，从而减少极端错误和过度生成，同时保持基准准确率。实验表明，在标准OCR基准上可以一致地减少风险并保持最小的覆盖率影响。

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

Authors: Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang

Venue: CVPR 2026

First: 2025-12-17T07:51:36+00:00 · Latest: 2026-03-20T08:25:26+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Video-based spatial reasoning -- such as estimating distances, judging directions, or understanding layouts from multiple views -- requires selecting informative frames and, when needed, actively seeking additional viewpoints during inference. Existing multimodal large language models (MLLMs) consume a fixed set of uniformly sampled frames and cannot request new views once reasoning begins, often missing the geometric cues necessary for reliable spatial judgments. We present EagleVision, a dual-stage framework that combines geometry-aware frame selection with active, Bird's-Eye-View (BEV)-grounded reasoning. In the first stage (macro perception), a semantics-perspective-fusion determinantal point process (SPF-DPP) selects a compact set of keyframes that jointly maximize semantic relevance and viewpoint diversity under a fixed token budget. In the second stage (micro verification), the model performs iterative spatial Chain-of-Thought: at each step it can either reason in text or predict a pose on the BEV plane to retrieve the nearest real frame, forming a closed-loop hypothesize-look-verify cycle. The querying policy is trained purely via reinforcement learning with a spatial grounding reward, requiring no human-annotated reasoning traces. On VSI-Bench and SQA3D, EagleVision achieves state-of-the-art performance among open-source vision-language models.

中文标题/摘要

标题：EagleVision：基于BEV定位的两阶段框架及其链式推理方法以增强空间智能

基于视频的空间推理——如估算距离、判断方向或从多视角理解布局——需要选择信息丰富的帧，并在必要时在推理过程中主动寻求新的视角。现有的多模态大型语言模型（MLLMs）消耗固定数量的均匀采样帧，推理开始后无法请求新视角，往往无法获得可靠的几何线索。我们提出了EagleVision，这是一种结合几何感知帧选择与主动、鸟瞰图（BEV）定位推理的两阶段框架。在第一阶段（宏观感知），一种语义视角融合确定性点过程（SPF-DPP）选择一组关键帧，以在固定标记预算下同时最大化语义相关性和视角多样性。在第二阶段（微观验证），模型进行迭代的空间链式推理：在每一步，它可以进行文本推理或预测BEV平面上的姿态以检索最近的真实帧，形成一个假设-查看-验证的闭环。查询策略仅通过强化学习并使用空间定位奖励进行训练，无需人工标注的推理痕迹。在VSI-Bench和SQA3D上，EagleVision实现了开源视觉-语言模型中的最佳性能。

Summary / 总结

EagleVision is a dual-stage framework that combines geometry-aware frame selection and active, BEV-grounded reasoning for video-based spatial reasoning. In the first stage, a compact set of keyframes is selected to maximize semantic relevance and viewpoint diversity. In the second stage, the model performs iterative spatial Chain-of-Thought, either reasoning in text or predicting a pose on the BEV plane to retrieve the nearest real frame, forming a closed-loop hypothesize-look-verify cycle. EagleVision achieves state-of-the-art performance on VSI-Bench and SQA3D among open-source vision-language models.

EagleVision 是一种结合几何感知帧选择和主动、BEV 基准推理的双阶段框架，用于基于视频的空间推理。第一阶段选择一组关键帧以最大化语义相关性和视点多样性。第二阶段，模型执行迭代的空间推理链，既可以文本推理，也可以预测 BEV 平面姿态以检索最近的真实帧，形成一个假设-查看-验证的闭环。EagleVision 在 VSI-Bench 和 SQA3D 上实现了开源视觉-语言模型中的最佳性能。

VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension

Authors: Hyejin Park, Junhyuk Kwon, Suha Kwak, Jungseul Ok

Venue: CVPR 2026

First: 2026-01-19T07:21:19+00:00 · Latest: 2026-03-20T08:14:05+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Referring Expression Comprehension (REC) aims to localize the image region corresponding to a natural language query. Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning, decomposing queries into structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate. However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning chain, yielding high-confidence false positives even when no target is present in the image. To address this limitation, we introduce Verification-Integrated Reasoning Operators (VIRO), a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps. Each operator executes and validates its output, such as object existence or spatial relationships, allowing the system to robustly handle no-target cases through verification-aware abstention. Our framework achieves state-of-the-art performance, reaching 61.1% balanced accuracy across target-present and no-target settings, and demonstrates generalization to real-world egocentric data. VIRO also shows high reliability with a program failure rate of at most 0.3%, efficient per-query runtime, and scalability through decoupled program generation and execution.

中文标题/摘要

标题：VIRO：具有验证的鲁棒高效神经符号推理以理解引用表达

引用表达理解（REC）旨在将自然语言查询对应到图像区域。最近的神经符号REC方法利用大型语言模型（LLMs）和视觉语言模型（VLMs）进行组合推理，将查询分解为结构化程序并逐步执行。虽然这些方法实现了可解释的推理和强大的零样本泛化，但它们假设中间推理步骤是准确的。然而，这种假设会导致级联错误：错误检测和无效关系通过推理链传播，即使图像中没有目标，也会产生高置信度的假阳性。为解决这一局限性，我们引入了验证集成推理操作符（VIRO），这是一种嵌入轻量级验证器于推理步骤中的神经符号框架。每个操作符执行并验证其输出，如对象存在或空间关系，从而使系统能够通过验证感知的弃权来稳健地处理无目标情况。我们的框架实现了最先进的性能，跨目标存在和无目标设置的平衡准确率达到61.1%，并展示了对真实世界第一人称数据的泛化能力。VIRO还展示了高可靠性，程序失败率不超过0.3%，每次查询运行高效，并通过解耦程序生成和执行实现可扩展性。

Summary / 总结

The paper introduces VIRO, a neuro-symbolic framework for Referring Expression Comprehension that integrates lightweight verifiers into reasoning steps to handle no-target cases robustly. This approach improves accuracy and reliability, achieving 61.1% balanced accuracy and a program failure rate of at most 0.3%, while maintaining efficient runtime and scalability.

该论文提出了VIRO，一种将轻量级验证器集成到推理步骤中的神经符号框架，以确保鲁棒性。该框架在真实世界数据上的表现达到最佳水平，准确率为61.1%，同时保持低程序失败率和高效的查询处理能力。

WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

Authors: Ziya Erkoç, Angela Dai, Matthias Nießner

Venue: www

First: 2026-03-20T07:22:41+00:00 · Latest: 2026-03-20T07:22:41+00:00

Comments: Webpage: https://ziyaerkoc.com/worldagents/ Video: https://www.youtube.com/watch?v=Mj2FqqhurdI

Abs · PDF · Code1 · Code2

Abstract

Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.

中文标题/摘要

标题：WorldAgents：基础图像模型能否成为3D世界模型的代理？

鉴于2D基础图像模型生成高保真输出的卓越能力，我们探讨了一个基本问题：2D基础图像模型是否天生具备3D世界模型的能力？为回答这一问题，我们系统地评估了多种最先进的图像生成模型和视觉-语言模型（VLMs）在3D世界合成任务上的表现。为了利用和评估它们潜在的隐含3D能力，我们提出了一种代理框架，以促进3D世界的生成。我们的方法采用多代理架构：基于VLM的导演制定提示以引导图像合成，生成器生成新的图像视图，以及基于VLM的两步验证器评估并有选择地筛选生成的帧，从2D图像和3D重建空间中。关键的是，我们证明我们的代理方法提供了连贯且稳健的3D重建，生成的输出场景可以通过渲染新视角进行探索。通过在各种基础模型上进行广泛的实验，我们证明2D模型确实掌握了3D世界的理解。通过利用这种理解，我们的方法成功地合成了广阔、真实且3D一致的世界。

Summary / 总结

The study investigates whether 2D foundation image models inherently possess 3D world modeling capabilities by evaluating multiple state-of-the-art models on 3D world synthesis tasks. It introduces an agentic approach using a multi-agent system, including a director, a generator, and a verifier, to facilitate 3D world generation. The results show that 2D models can produce coherent and robust 3D reconstructions, enabling the synthesis of expansive, realistic, and 3D-consistent worlds.

研究探讨了2D基础图像模型是否天然具备3D世界建模能力，通过评估最先进的图像生成模型和VLM在3D世界合成任务中的表现。提出了一种多智能体框架，包括导演、生成器和验证器，以指导和评估3D世界生成。研究显示2D模型能够生成连贯且稳健的3D重建，从而合成出广阔、逼真且3D一致的世界。

Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification

Authors: Kunlun Xu, Haotong Cheng, Jiangmeng Li, Xu Zou, Jiahuan Zhou

Venue: CVPR 2026

First: 2026-03-20T06:23:05+00:00 · Latest: 2026-03-20T06:23:05+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9\%-2.2\% and 2.1\%-2.5\% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR

中文标题/摘要

标题：终身人员再识别中的视觉-语言属性解耦与强化

终身人员再识别（LReID）旨在通过学习不同领域来获得统一的人员检索模型。现有LReID方法通常从头开始学习或基于视觉分类预训练模型，而视觉-语言模型（VLM）在多种任务中展示了可泛化的知识。尽管现有方法可以直接适应VLM，但由于它们仅考虑全局感知学习，因此细粒度的属性知识被低估，导致获取能力和抗遗忘能力有限。为了解决这一问题，我们提出了一种新的VLM驱动的LReID方法，称为视觉-语言属性解耦与强化（VLADR）。我们的核心思想是明确建模普遍共享的人类属性，以提高跨域知识迁移，从而有效利用历史知识强化新知识学习并减轻遗忘。具体而言，VLADR包括一种多粒度文本属性解耦机制，用于挖掘图像的全局和多样化的局部文本属性。然后，开发了一种跨域跨模态属性强化方案，该方案通过引入跨模态属性对齐来指导视觉属性提取，并通过跨域属性对齐实现细粒度知识迁移。实验结果表明，我们的VLADR在抗遗忘能力和泛化能力上分别优于最新方法1.9%-2.2%和2.1%-2.5%。我们的源代码可在https://github.com/zhoujiahuan1991/CVPR2026-VLADR获取

Summary / 总结

The research aims to improve lifelong person re-identification by leveraging the generalizable knowledge from Vision-Language Models (VLMs). The proposed method, Vision-Language Attribute Disentanglement and Reinforcement (VLADR), introduces a Multi-grain Text Attribute Disentanglement mechanism to model universally shared human attributes and an Inter-domain Cross-modal Attribute Reinforcement scheme to enhance cross-modal attribute alignment and inter-domain knowledge transfer. Experimental results show that VLADR outperforms existing methods by 1.9%-2.2% and 2.1%-2.5% in anti-forgetting and generalization capacity respectively.

论文提出了VLADR，一种新颖的视觉-语言属性解耦和强化方法，用于终身人员再识别。该方法利用视觉-语言模型明确建模通用的人类属性，提高跨域知识迁移并缓解遗忘问题。实验结果显示，VLADR在抗遗忘和泛化能力方面分别优于现有方法1.9%-2.2%和2.1%-2.5%。

ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

Authors: Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong

First: 2025-09-29T14:20:05+00:00 · Latest: 2026-03-20T05:52:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space, but risk dropping regions needed for accurate prediction. We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the lightweight projection layer. This measures how small random perturbations affect the projected features and enables efficient approximation of each token's influence without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ZOO-Prune consistently outperforms prior methods while pruning up to 94.4% of tokens without sacrificing accuracy. Our method also improves efficiency, reaching up to 2.30x faster end-to-end inference compared to the baseline.

中文标题/摘要

标题：ZOO-Prune: 无需训练的视觉-语言模型中零阶梯度估计的令牌剪枝

大型视觉-语言模型（VLMs）能够实现强大的多模态推理，但冗余的视觉令牌会导致沉重的推理成本。令牌剪枝可以缓解这一问题，但现有方法存在局限性。基于注意力的方法依赖于原始的注意力分数，这些分数在不同层和头之间往往不稳定，可能导致冗余选择。基于多样性的方法通过在特征空间中选择相距较远的令牌来提高鲁棒性，但可能会舍弃对于准确预测必要的区域。我们提出了ZOO-Prune，这是一种无需训练的框架，其基于直觉，即高度敏感的令牌对模型输出有更强的影响，并捕捉互补的视觉线索而非冗余的线索。为了实现这一点，我们使用轻量级投影层的零阶扰动来估计令牌的敏感性。这种方法通过测量小的随机扰动如何影响投影特征来衡量每个令牌的影响，并在无需反向传播的情况下进行高效近似。在多个VLMs和基准上的广泛实验表明，ZOO-Prune在剪枝高达94.4%的令牌的同时，仍能保持准确率，并且始终优于先前的方法。我们的方法还提高了效率，与基线相比，端到端推理速度可提高至2.30倍。

Summary / 总结

ZOO-Prune is a training-free token pruning method for Vision-Language Models (VLMs) that uses zeroth-order gradient estimation to identify and prune highly sensitive tokens. This approach avoids the instability and redundancy issues of attention-based methods and the risk of dropping necessary regions in diversity-based methods. Experiments show that ZOO-Prune prunes up to 94.4% of tokens without accuracy loss and improves inference speed by up to 2.30x compared to the baseline.

ZOO-Prune 是一种无需训练的 Vision-Language 模型（VLM）剪枝方法，通过零阶梯度估计来识别并剪枝冗余的视觉标记。该方法通过轻量级扰动估计标记的敏感性，避免了反向传播的需求。实验表明，ZOO-Prune 可以剪枝高达 94.4% 的标记而不损失准确性，并且相比基线方法可将端到端推理速度提高至最多 2.30 倍。

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Authors: Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang

First: 2025-11-22T02:25:00+00:00 · Latest: 2026-03-20T05:12:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

中文标题/摘要

标题：FastMMoE：通过动态专家激活和路径感知的标记剪枝加速多模态大型语言模型

多模态大型语言模型（MLLMs）已经取得了令人印象深刻的性能，但高分辨率的视觉输入导致了大量的视觉标记序列和显著的推理延迟。减少冗余的视觉标记对于减轻计算/内存负担同时保持性能至关重要，从而使得MLLM能够在资源受限或延迟敏感的场景中部署。当前的视觉标记剪枝方法主要依赖于基于注意力的冗余分析，并且针对密集架构。我们提出了Fast Multimodal Mixture-of-Experts（FastMMoE），这是一种基于混合专家（MoE）的MLLM的无训练加速框架，从路径分析的角度开发。FastMMoE 结合了两种互补策略：(i) 为了最小化不必要的专家计算，减少视觉标记的专家激活；(ii) 路径感知的标记剪枝，利用路由概率分布的相似性来识别并移除高度冗余的视觉标记。在大规模MoE-MLLMs如DeepSeek-VL2和InternVL3.5上的实验表明，FastMMoE 可以将FLOPs减少高达55.0%，同时保留约95.5%的原始性能，并且在多个保留率下始终优于包括FastV和SparseVLM在内的密集模型剪枝基线。

Summary / 总结

FastMMoE is a training-free acceleration framework for multimodal large language models (MLLMs) that reduces redundant visual tokens to ease computational and memory burdens while preserving performance. It employs expert activation reduction and routing-aware token pruning to minimize unnecessary expert computation and identify highly redundant visual tokens, respectively. Experiments on large-scale MLLMs show that FastMMoE can reduce FLOPs by up to 55.0% while maintaining 95.5% of the original performance, outperforming dense-model pruning methods like FastV and SparseVLM.

FastMMoE 是一种无需训练的加速框架，用于减少多模态大型语言模型（MLLMs）的推理延迟，通过最小化不必要的专家计算和修剪冗余的视觉标记。它结合了专家激活减少和路由感知标记修剪，实现了高达55.0%的FLOPs减少，同时保持了原始性能的95.5%，并优于如FastV和SparseVLM等密集模型修剪方法。

ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding

Authors: Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, Cong Wang

First: 2026-03-20T03:30:32+00:00 · Latest: 2026-03-20T03:30:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\sim1.8\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\times$ on LLaVA-Onevision-72B and 2.42$\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.

中文标题/摘要

标题：ParallelVLM：具有视觉对齐感知并行推测解码的无损视频-LLM加速

尽管当前的视频-LLMs在视频理解任务中取得了令人印象深刻的性能，但它们的自回归解码效率仍然受到大量视频标记数量的限制。视觉标记剪枝可以部分缓解这一瓶颈，但现有方法仍然存在信息丢失的问题，并且只能在解码方面取得微小的加速。在本文中，我们提出了一种无需训练的先草稿后验证的推测解码框架ParallelVLM，该框架克服了长视频设置中草稿模型和目标模型之间相互等待和速度提升比例有限的问题。ParallelVLM 包含两个并行化阶段，最大化硬件利用率，并采用无偏验证器引导剪枝策略，通过消除基于注意力的剪枝中的位置偏差，更好地对齐草稿模型和目标模型。广泛的实验表明，ParallelVLM 有效扩展了草稿窗口1.6至1.8倍，且在高接受长度下加速了各种视频理解基准测试，LLaVA-Onevision-72B 上加速了3.36倍，Qwen2.5-VL-32B 上加速了2.42倍，与传统的自回归解码相比。

Summary / 总结

ParallelVLM is a speculative decoding framework designed to accelerate Video-LLMs by addressing the limitations of autoregressive decoding. It uses two parallel stages to enhance hardware utilization and includes an Unbiased Verifier-Guided Pruning strategy to align the draft and target models. Experiments show that ParallelVLM expands the draft window by 1.6 to 1.8 times and accelerates video understanding benchmarks by 3.36 times on LLaVA-Onevision-72B and 2.42 times on Qwen2.5-VL-32B compared to traditional autoregressive decoding.

ParallelVLM 是一种 speculative 解码框架，旨在通过解决自回归解码的限制来加速 Video-LLMs。它使用两个并行阶段和无偏验证者引导剪枝策略来对齐草稿和目标模型，减少位置偏差。实验表明，ParallelVLM 将草稿窗口扩展了 1.6 到 1.8 倍，并将视频理解基准加速了最多 3.36 倍（LLaVA-Onevision-72B）和 2.42 倍（Qwen2.5-VL-32B），相比传统的自回归解码。

FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement

Authors: Ming Hu, Yongsheng Huo, Mingyu Dou, Jianfu Yin, Peng Zhao, Yao Wang, Cong Hu, Bingliang Hu, Quan Wang

First: 2026-03-20T03:25:56+00:00 · Latest: 2026-03-20T03:25:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.

中文标题/摘要

标题：FB-CLIP：细粒度零样本异常检测的前景-背景解耦

细粒度异常检测在工业和医疗应用中至关重要，但由于标注的异常样本稀缺，零样本检测极具挑战性。尽管像CLIP这样的视觉-语言模型提供了有希望的解决方案，但它们在前景-背景特征纠缠和粗粒度文本语义方面存在困难。我们提出FB-CLIP框架，通过多策略文本表示和前景-背景分离增强异常定位。在文本模态中，它结合了End-of-Text特征、全局池化表示和注意力加权的标记特征，以获得更丰富的语义线索。在视觉模态中，多视角软分离沿着身份、语义和空间维度进行，并结合背景抑制，减少了干扰并提高了可区分性。语义一致性正则化（SCR）将图像特征与正常和异常的文本原型对齐，抑制不确定匹配并扩大语义差距。实验表明，FB-CLIP在零样本设置下能够有效地区分异常与复杂背景，实现准确的细粒度异常检测和定位。

Summary / 总结

The research aims to address the challenge of fine-grained zero-shot anomaly detection in scenarios where labeled anomalies are scarce. FB-CLIP, a proposed framework, enhances anomaly localization by disentangling foreground and background features and using multi-strategy textual representations. The method combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, it employs multi-view soft separation and background suppression to reduce interference and improve discriminability. The framework also uses Semantic Consistency Regularization to align image features with normal and abnormal textual prototypes. Experimental results demonstrate that FB-CLIP can effectively distinguish anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.

FB-CLIP 通过分离前景和背景特征并使用增强的文本表示来实现细粒度的零样本异常检测。它结合了多种文本特征，并在视觉模态中采用多视角软分离和背景抑制。该框架还包括语义一致性正则化，将图像特征与文本原型对齐。实验表明，FB-CLIP 可以在复杂背景下准确检测和定位异常，实现零样本设置下的精确细粒度异常检测和定位。

K-GMRF: Kinetic Gauss-Markov Random Field for First-Principles Covariance Tracking on Lie Groups

Authors: ZhiMing Li

First: 2026-03-20T03:16:36+00:00 · Latest: 2026-03-20T03:16:36+00:00

Comments: 33 pages, 13 figures

Abs · PDF · Code1 · Code2

Abstract

Tracking non-stationary covariance matrices is fundamental to vision yet hindered by existing estimators that either neglect manifold constraints or rely on first-order updates, incurring inevitable phase lag during rapid evolution. We propose K-GMRF, an online, training-free framework for covariance tracking that reformulates the problem as forced rigid-body motion on Lie groups. Derived from the Euler-Poincaré equations, our method interprets observations as torques driving a latent angular velocity, propagated via a structure-preserving symplectic integrator. We theoretically prove that this second-order dynamics achieves zero steady-state error under constant rotation, strictly superior to the proportional lag of first-order baselines. Validation across three domains demonstrates robust tracking fidelity: (i) on synthetic ellipses, K-GMRF reduces angular error by 30x compared to Riemannian EMA while maintaining stability at high speeds; (ii) on SO(3) stabilization with 20% dropout, it decreases geodesic error from 29.4° to 9.9°; and (iii) on OTB motion-blur sequences, it improves loU from 0.55 to 0.74 on BlurCar2 with a 96% success rate. As a fully differentiable symplectic module, K-GMRF provides a plug-and-play geometric prior for data-constrained scenarios and an interpretable layer within modern deep architectures.

中文标题/摘要

标题：K-GMRF：动力学高斯-马尔可夫随机场在李群上的一阶原理协方差跟踪

协方差矩阵的非平稳跟踪是视觉中的基本问题，但现有的估计器要么忽视流形约束，要么依赖于一阶更新，导致在快速演变过程中不可避免地产生相位滞后。我们提出K-GMRF，这是一种无需训练的在线框架，将协方差跟踪问题重新表述为李群上的强制刚体运动。我们的方法源自欧拉-庞加莱方程，将观测值解释为驱动潜在角速度的力矩，并通过结构保持的辛积分器传播。我们理论上证明，在恒定旋转下，这种二阶动力学实现了零稳态误差，严格优于一阶基线的成比例滞后。在三个领域的验证表明，K-GMRF具有稳健的跟踪精度：(i) 在合成椭圆上，K-GMRF将角误差减少了30倍，同时在高速情况下保持稳定性；(ii) 在SO(3)稳定化实验中，20%的数据丢失后，它将测地线误差从29.4°降低到9.9°；(iii) 在OTB运动模糊序列中，它将BlurCar2的loU从0.55提高到0.74，成功率高达96%。作为完全可微分的辛模块，K-GMRF为数据受限场景提供了一个即插即用的几何先验，并作为现代深度架构中的可解释层。

Summary / 总结

K-GMRF is an online, training-free framework for covariance tracking on Lie groups, addressing the limitations of existing estimators by reformulating the problem as forced rigid-body motion. It uses a structure-preserving symplectic integrator to propagate observations as torques driving a latent angular velocity, achieving zero steady-state error under constant rotation. Experiments show K-GMRF outperforms first-order methods in terms of angular and geodesic errors, and improves localization accuracy in motion-blur sequences.

K-GMRF 是一种在线、无需训练的协方差跟踪框架，通过将问题重新表述为在李群上的刚体运动来解决现有方法的局限性。它使用欧拉-庞加莱方程将观测解释为驱动潜在角速度的力矩，并在三个领域进行了验证，显示出显著的跟踪精度提升。具体来说，它在合成椭圆上将角误差减少了30倍，在SO(3)稳定化中将地理误差从29.4°降低到9.9°，并在BlurCar2上将loU从0.55提高到0.74，成功率高达96%。

Tinted Frames: Question Framing Blinds Vision-Language Models

Authors: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta

First: 2026-03-19T17:53:09+00:00 · Latest: 2026-03-20T02:52:21+00:00

Comments: Preprint. Project page: https://davidhalladay.github.io/tinted_frames_demo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

中文标题/摘要

标题：着色框：问题框架使视觉语言模型失明

视觉语言模型（VLMs）已被证明是失明的，即使在需要视觉推理的任务中，它们也经常未能充分利用视觉输入。在本研究中，我们展示了VLMs是选择性失明的。它们根据语言框架调整对视觉输入的注意力程度，即使存在其他框架要求相同的视觉推理。通过使用视觉注意力作为探针，我们量化了框架如何改变对图像的关注量及其分布。受限的框架，如多项选择和是/否，相比开放式框架，显著降低了对图像上下文的关注，减少了对任务相关区域的关注，并将注意力转移到了无信息性标记上。我们进一步证明，这种注意力分配不当是导致准确度下降和跨框架不一致的主要原因。基于这一机制洞察，我们引入了一种轻量级的提示调优方法，使用可学习标记来鼓励在开放式设置中观察到的稳健、视觉接地的注意力模式，从而提高视觉接地并改善不同框架下的性能。

Summary / 总结

This study explores why Vision-Language Models (VLMs) underutilize visual inputs, especially in tasks requiring visual reasoning. By analyzing visual attention patterns, the researchers found that VLMs adjust their attention based on the linguistic framing of questions, even when different framings require the same visual reasoning. The study shows that constrained framings like multiple choice or yes/no lead to less attention on image context and more focus on uninformative tokens, causing performance degradation. The researchers propose a lightweight prompt-tuning method to improve visual grounding and performance across different framings.

研究探讨了为什么视觉语言模型（VLMs）在需要视觉推理的任务中对视觉输入是选择性失明的。通过使用视觉注意力作为探针，研究发现，当给定如多项选择或是非题等受限框架时，VLMs会减少对图像的注意力，相比之下，开放性问题则能更好地引导视觉注意力。这种注意力分配的偏差导致了较低的准确性和不同框架之间的不一致性。研究还提出了一种使用可学习标记的提示调优方法，以鼓励在开放性问题设置中观察到的稳健且视觉导向的注意力模式，从而提高跨框架的性能。

MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation

Authors: Kaixin Cai, Pengzhen Ren, Jianhua Han, Yi Zhu, Hang Xu, Jianzhuang Liu, Xiaodan Liang

First: 2026-03-20T02:37:38+00:00 · Latest: 2026-03-20T02:37:38+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named "MagicSeg". Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset's effectiveness in enhancing open-world semantic segmentation capabilities. Project website: https://github.com/ckxhp/magicseg.

中文标题/摘要

标题：MagicSeg：基于反事实扩散自动生成的开放世界分割预训练

开放世界语义分割目前很大程度上依赖于大量的图像-文本对数据集，这些数据集往往缺乏对足够类别的精细像素注释。获取此类数据由于大量的人工劳动和时间投入而变得经济上不可行。鉴于扩散模型强大的图像生成能力，我们提出了一种基于扩散模型的新颖管道，用于自动生成适应开放世界语义分割需求的数据集，命名为“MagicSeg”。我们的MagicSeg从类别标签开始，生成高保真的文本描述，这些描述作为指导，帮助扩散模型生成图像。我们的过程不仅生成每个标签的正样本，还同时生成相应的负图像，作为对比训练的反事实样本对。为了为开放世界分割预训练提供自我监督信号，我们的MagicSeg结合了开放式词汇检测模型和交互式分割模型，根据提供的类别标签从图像中提取对象掩码作为精确的分割标签。通过将我们的数据集应用于带有伪掩码监督和辅助反事实对比训练的对比语言-图像预训练模型，下游模型在开放世界语义分割上表现出色。我们在PASCAL VOC、PASCAL Context和COCO上评估了我们的模型，分别取得了62.9%、26.7%和40.2%的性能，证明了我们的数据集在提高开放世界语义分割能力方面的有效性。项目网站：https://github.com/ckxhp/magicseg.

Summary / 总结

MagicSeg is a novel pipeline for open-world semantic segmentation pretraining that uses counterfactual diffusion-based auto-generation. It starts with class labels to generate high-fidelity textual descriptions, which guide the generation of images, including both positive and negative samples for contrastive training. The generated dataset is then used for contrastive language-image pretraining with pseudo mask supervision and auxiliary counterfactual contrastive training, leading to strong performance on open-world semantic segmentation tasks. The model achieves SOTA results on PASCAL VOC, PASCAL Context, and COCO with scores of 62.9%, 26.7%, and 40.2%, respectively.

MagicSeg 是一种使用反事实扩散驱动的自动生成方法来实现开放世界语义分割预训练的新管道。它从类别标签开始生成高保真文本描述，这些描述指导扩散模型生成图像。这个过程包括生成正样本和负样本进行对比训练。MagicSeg 结合了开放词汇检测模型和交互式分割模型，从图像中提取精确的分割标签，提供自监督信号进行预训练。该模型在 PASCAL VOC、PASCAL Context 和 COCO 上分别取得了 62.9%、26.7% 和 40.2% 的最佳性能，证明了其在增强开放世界语义分割能力方面的有效性。

CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management

Authors: Chao Wang, Xudong Tan, Jianjian Cao, Kangcong Li, Tao Chen

First: 2026-03-20T02:28:57+00:00 · Latest: 2026-03-20T02:28:57+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.

中文标题/摘要

标题：CurveStream：通过曲率感知的分层视觉记忆管理提升MLLMs对流媒体视频的理解

多模态大型语言模型在离线视频理解方面取得了显著成功，但在流媒体视频的应用中却受到视觉标记线性爆炸的严重限制，这通常会导致内存溢出（OOM）错误或灾难性遗忘。现有的视觉保留和内存管理方法通常依赖于均匀采样、低级物理指标或被动缓存淘汰。然而，这些策略往往缺乏内在语义意识，可能会破坏上下文连贯性并模糊短暂但关键的语义过渡。为了解决这些限制，我们提出了CurveStream，这是一种无需训练、曲率感知的分层视觉记忆管理框架。我们的方法受到这样一个关键观察的启发：连续特征轨迹上的高曲率区域与关键全局语义过渡紧密对齐。基于这一几何洞察，CurveStream 通过曲率评分实时评估语义强度，并结合在线 K-Sigma 动态阈值，在严格标记预算下将帧动态路由到清晰和模糊的记忆状态。在不同时间尺度上的评估证实，这个轻量级框架 CurveStream 在 StreamingBench 上绝对性能提高了 10.69%（例如），在 OVOBench 上提高了 13.58%，建立了流媒体视频感知的新最先进的结果。代码将在 https://github.com/streamingvideos/CurveStream 发布。

Summary / 总结

CurveStream is a curvature-aware hierarchical visual memory management framework designed to enhance the streaming video understanding capabilities of Multimodal Large Language Models. Motivated by the need to manage the linear explosion of visual tokens and avoid OOM errors, CurveStream evaluates real-time semantic intensity using a Curvature Score and dynamically routes frames into clear and fuzzy memory states. Experiments show that CurveStream consistently outperforms existing methods by over 10% on various benchmarks, setting new state-of-the-art results for streaming video perception.

CurveStream 是一种曲率感知的分层视觉记忆管理框架，旨在提升多模态大型语言模型在流式视频理解中的表现。该方法受到管理视觉令牌爆炸而不引发 OOM 错误或灾难性遗忘的需求驱动，通过实时评估语义强度的曲率得分和集成在线 K-Sigma 动态阈值来适应性地管理记忆状态。实验表明，CurveStream 在各种基准测试中始终比现有方法高出超过 10%，并建立了流式视频感知的新最佳结果。

RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation

Authors: Yash Jangir, Yidi Zhang, Pang-Chi Lo, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki

First: 2025-10-27T17:41:38+00:00 · Latest: 2026-03-20T00:27:12+00:00

Comments: Website: https://robotarenainf.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. We introduce RobotArena Infinity, a new benchmarking framework that overcomes these challenges by shifting vision-language-action (VLA) evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated vision-language-model-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, including textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.

中文标题/摘要

标题：RobotArena $\infty$：通过实到模拟转换实现可扩展的机器人基准测试

机器人通才，即能够在多种环境中执行多种任务的代理，需要严格的可扩展评估。然而，机器人策略的现实世界测试仍然受到根本限制：它劳动密集型，速度慢，大规模不安全，难以复制。随着策略的范围和复杂性扩大，这些障碍只会加剧，因为机器人成功往往依赖于执行质量的微妙的人类判断。我们引入了RobotArena Infinity，这是一种新的基准测试框架，通过将视觉-语言-动作（VLA）评估转移到增强有在线人类反馈的大规模模拟环境中来克服这些挑战。利用视觉-语言模型、2D到3D生成建模和可微渲染的最新进展，我们的方法自动将广泛使用的机器人数据集中的视频演示转换为模拟对应物。在这些数字孪生中，我们使用自动化的视觉-语言模型指导评分和从众包收集的可扩展的人类偏好判断来评估VLA策略，将人类参与从繁琐的场景设置、重置和安全监督转变为轻量级的偏好比较。为了衡量鲁棒性，我们系统地沿多个轴线扰动模拟环境，包括纹理和物体放置，对控制变化下的策略泛化进行压力测试。结果是一个不断演进、可复制和可扩展的基准测试，用于现实世界训练的机器人操作策略，解决了当今机器人领域的一个关键缺失能力。

Summary / 总结

RobotArena $\infty$ addresses the need for scalable evaluation of robot generalists by shifting vision-language-action (VLA) evaluation into simulated environments with online human feedback. It uses advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering to convert real-world video demonstrations into simulated counterparts. This framework assesses VLA policies through automated scoring and human preference judgments, enabling robustness testing under controlled variations. Key findings include a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies.

RobotArena $\infty$ 通过利用大规模模拟环境和在线人类反馈来评估能够在多种环境中执行各种任务的机器人策略。该框架使用视觉语言模型和可微渲染技术将现实世界的视频演示转换为模拟场景，并通过自动评分和人类偏好判断来评估策略。研究还系统地对模拟环境进行扰动以测试策略的鲁棒性。主要发现包括一个不断进化的、可重复的和可扩展的基准，用于评估现实世界训练的机器人操作策略。

Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion

Authors: Sima Ashayer, Hoang H. Nguyen, Yu Liang, Mina Sartipi

First: 2026-03-20T00:19:34+00:00 · Latest: 2026-03-20T00:19:34+00:00

Comments: Accepted to IEEE Intelligent Vehicles Symposium (IV) 2026. 8 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.

中文标题/摘要

标题：基于心理特征和变换器融合的行人过街意图预测

行人意图预测对于自动驾驶车辆在城市环境中安全导航至关重要。我们提出了一种轻量级、社会导向的行人意图预测架构。该架构通过高速公路编码器、紧凑的4令牌变换器和全局自注意力池化融合了四种行为流（注意力、位置、情境和互动）。为了量化不确定性，我们引入了两个互补的头部：一个变分瓶颈，其KL散度捕捉先验不确定性；以及一个马氏距离检测器，用于识别分布偏移。这些组件共同提供了校准的概率和可操作的风险评分，同时不牺牲效率。在PSI 1.0基准测试中，我们的模型仅使用结构化、可解释的特征，优于最近的视觉语言模型，实现了0.9 F1、0.94 AUC-ROC和0.78 MCC。在更多样化的PSI 2.0数据集中，我们建立了初始基准，F1得分为0.78，AUC-ROC为0.79。基于马氏距离得分的选择性预测在80%覆盖率下将测试准确率提高了0.4个百分点。定性的注意力热图进一步展示了模型在不确定性下如何在跨流中转移其焦点。所提出的方法是模态无关的，易于与视觉语言管道集成，并适用于资源受限平台的风险感知意图预测。

Summary / 总结

The research aims to improve pedestrian intention prediction for autonomous vehicles in urban settings. It proposes a lightweight model that integrates four behavioral streams using highway encoders and a compact Transformer, achieving high performance on the PSI 1.0 benchmark with F1, AUC-ROC, and MCC scores of 0.9, 0.94, and 0.78, respectively. On the more diverse PSI 2.0 dataset, the model establishes a baseline of 0.78 F1 and 0.79 AUC-ROC. The model incorporates uncertainty quantification through a variational bottleneck and a Mahalanobis distance detector, enhancing accuracy and providing actionable risk scores.

研究旨在提高自动驾驶车辆在城市环境中的行人意图预测。提出了一种轻量级架构，通过高速公路编码器和紧凑型Transformer融合行为流，并通过变分瓶颈和马氏距离检测器量化不确定性。该模型在PSI 1.0基准上表现出色，并在更具多样性的PSI 2.0数据集上建立了强大的基线，通过选择性预测提高测试准确率，并提供可操作的风险评分。

dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3

Authors: Saikat Dutta, Biplab Banerjee, Hamid Rezatofighi

First: 2026-03-19T23:57:28+00:00 · Latest: 2026-03-19T23:57:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce dinov3.seg, extending dinov3.txt into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch-level visual features from ViT-based encoder, effectively combining semantic discrimination with fine-grained spatial locality. Third, unlike prior approaches that rely primarily on post hoc similarity refinement, we perform early refinement of visual representations prior to image-text interaction, followed by late refinement of the resulting image-text correlation features, enabling more accurate and robust dense predictions in cluttered scenes. Finally, we propose a high-resolution local-global inference strategy based on sliding-window aggregation, which preserves spatial detail while maintaining global context. We conduct extensive experiments on five widely adopted OVSS benchmarks to evaluate our approach. The results demonstrate its effectiveness and robustness, consistently outperforming current state-of-the-art methods.

中文标题/摘要

标题：dinov3.seg: 使用DINOv3的开放词汇语义分割

开放词汇语义分割（OVSS）将像素级标签分配给开放集合中的文本定义类别，在推理时需要可靠地泛化到未见过的类别。尽管现代视觉-语言模型（VLMs）支持强大的开放词汇识别，但通过全局对比目标学习到的表示对于密集预测仍然不够理想，促使许多OVSS方法依赖于有限的适应或图像-文本相似性图的细化。这反过来限制了在复杂、杂乱场景中的空间精度和鲁棒性。我们引入了dinov3.seg，将dinov3.txt扩展为一个专门的OVSS框架。我们的贡献有四个方面。首先，我们设计了一个针对此主干的任务特定架构，系统地适应了先前开放词汇分割工作中确立的设计原则。其次，我们联合利用与全局[CLS]标记和局部补丁级视觉特征对齐的文本嵌入，有效地结合了语义区分与细粒度的空间局部性。第三，不同于依赖于主要的后处理相似性细化的先前方法，我们在图像-文本交互之前对视觉表示进行早期细化，随后对结果的图像-文本相关特征进行晚期细化，从而在杂乱场景中实现更准确和鲁棒的密集预测。最后，我们提出了一种基于滑动窗口聚合的高分辨率局部-全局推理策略，该策略保留了空间细节同时保持了全局上下文。我们在五个广泛采用的OVSS基准上进行了广泛的实验以评估我们的方法。结果表明其有效性和鲁棒性，始终优于当前最先进的方法。

Summary / 总结

The paper introduces dinov3.seg, an extension of dinov3.txt for Open-Vocabulary Semantic Segmentation (OVSS), addressing limitations in spatial precision and robustness in complex scenes. It designs a task-specific architecture, jointly leverages text embeddings with global and local visual features, and proposes an early and late refinement strategy for visual representations. Experiments on five benchmarks show that dinov3.seg outperforms current state-of-the-art methods in terms of accuracy and robustness in cluttered scenes.

研究旨在通过解决现有方法依赖有限的图像-文本相似性映射适应或精炼的局限性，提高开放词汇语义分割的效果。方法引入了dinov3.seg，将其扩展为一个专门用于开放词汇语义分割的框架。关键发现表明，通过早期和后期视觉表示的精炼以及高分辨率的局部-全局推理策略，该方法在杂乱场景中优于当前最先进的方法，展示了更好的空间精度和鲁棒性。

Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation

Authors: Chenruo Liu, Hongjun Liu, Zeyu Lai, Yiqiu Shen, Chen Zhao, Qi Lei

First: 2025-08-12T02:16:04+00:00 · Latest: 2026-03-19T23:10:54+00:00

Abs · PDF · Code1 · Code2

Abstract

To enhance group robustness to spurious correlations, prior work often relies on auxiliary group annotations and assumes identical sets of groups across training and test domains. To overcome these limitations, we propose to leverage superclasses -- categories that lie higher in the semantic hierarchy than the task's actual labels -- as a more intrinsic signal than group labels for discerning spurious correlations. Our model incorporates superclass guidance from a pretrained vision-language model via gradient-based attention alignment, and then integrates feature disentanglement with a theoretically supported minimax-optimal feature-usage strategy. As a result, our approach attains robustness to more complex group structures and spurious correlations, without the need to annotate any training samples. Experiments across diverse domain generalization tasks show that our method significantly outperforms strong baselines and goes well beyond the vision-language model's guidance, with clear improvements in both quantitative metrics and qualitative visualizations.

中文标题/摘要

标题：基于超类的表示解缠以减轻虚假相关性

为了增强对虚假相关性的群体鲁棒性，先前的工作通常依赖于辅助群体注释，并假设训练和测试领域中的群体集是相同的。为了克服这些限制，我们提出利用超类——比任务实际标签更高层次的类别——作为比群体标签更内在的信号来区分虚假相关性。我们的模型通过基于梯度的注意力对齐从预训练的视觉-语言模型中获取超类指导，并结合具有理论支持的最小-最大最优特征利用策略进行特征解缠。因此，我们的方法能够在无需标注任何训练样本的情况下，实现对更复杂的群体结构和虚假相关性的鲁棒性。跨多种领域泛化任务的实验表明，我们的方法显著优于强大的基线方法，并远远超越了视觉-语言模型的指导，在定量指标和定性可视化方面均有所改进。

Summary / 总结

The research aims to improve group robustness against spurious correlations by using superclass annotations instead of group labels. The method involves using a pretrained vision-language model to guide feature disentanglement and integrate a minimax-optimal feature-usage strategy. The experiments demonstrate that the proposed approach outperforms strong baselines across various domain generalization tasks, showing better robustness to complex group structures and spurious correlations without requiring any training sample annotations.

研究旨在通过使用更高层次的类别（即超类）而非组标签来提高对伪相关性的组鲁棒性。方法包括使用预训练的视觉-语言模型进行梯度基注意力对齐，并结合特征分离与理论上支持的最小-最大最优特征使用策略。实验结果显示，该方法在各种领域泛化任务中显著优于强基线，实现了对复杂组结构和伪相关性的更好鲁棒性，且无需对任何训练样本进行标注。

ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding

Authors: Oishi Banerjee, Sung Eun Kim, Alexandra N. Willauer, Julius M. Kernbach, Abeer Rihan Alomaish, Reema Abdulwahab S. Alghamdi, Hassan Rayhan Alomaish, Mohammed Baharoon, Xiaoman Zhang, Julian Nicolas Acosta, Christine Zhou, Pranav Rajpurkar

First: 2026-03-19T22:54:28+00:00 · Latest: 2026-03-19T22:54:28+00:00

Comments: 11 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.

中文标题/摘要

标题：ReXInTheWild: 一种统一的医学照片理解基准

日常使用普通相机拍摄的照片已经在远程医疗和其他在线健康对话中广泛使用，但尚未有全面的基准来评估视觉-语言模型是否能够解释其医学内容。分析这些图像需要精细的自然图像理解和特定领域的医学推理，这既挑战了通用模型，也挑战了专门模型。我们引入了ReXInTheWild，这是一个包含484张照片的基准，这些照片来自生物医学文献，涵盖七个临床主题的955个由临床医生验证的多项选择题。在ReXInTheWild上评估时，领先的多模态大型语言模型表现出显著的性能差异：Gemini-3达到78%的准确率，其次是Claude Opus 4.5（72%）和GPT-5（68%），而医学专科模型MedGemma仅达到37%。系统性错误分析还揭示了四种常见错误类别，从低级几何错误到高级推理失败，需要不同的缓解策略。ReXInTheWild提供了一个具有挑战性的、临床导向的基准，结合了自然图像理解和医学推理。数据集可在HuggingFace上获取。

Summary / 总结

The motivation for ReXInTheWild is to evaluate the ability of vision-language models to interpret medical content in everyday photographs, which are increasingly used in telemedicine. The method involves creating a benchmark with 955 clinician-verified multiple-choice questions from 484 photographs, covering seven clinical topics. Key findings show that leading multimodal models like Gemini-3 achieve 78% accuracy, while the medical specialist model MedGemma only reaches 37%, highlighting the challenges in combining natural image understanding and medical reasoning. Errors are categorized into geometric, low-level, and high-level reasoning failures, indicating the need for targeted improvements in these areas.

这项工作的动机是评估视觉-语言模型在解读日常拍摄的照片中的医学内容方面的能力，这些照片在远程医疗中越来越常用。作者引入了ReXInTheWild基准，包含955个由临床专家验证的问题，覆盖484张照片中的七个临床主题。关键发现表明，领先的多模态模型如Gemini-3和Claude Opus 4.5的表现优于医学专家模型MedGemma，Gemini-3的准确率为78%。研究还指出了四种类型的错误，表明需要在几何和推理能力方面改进这些模型。

Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis

Authors: Sheng Lu, Hao Chen, Rui Yin, Juyan Ba, Yu Zhang, Yuanzhe Li

First: 2026-03-19T22:47:27+00:00 · Latest: 2026-03-19T22:47:27+00:00

Comments: Computer Vision and Pattern Recognition 2026

Abs · PDF · Code1 · Code2

Abstract

Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.

中文标题/摘要

标题：Gastric-X：胃癌分析的多模态多阶段基准数据集，以促进视觉-语言模型的发展

近年来，视觉-语言模型（VLMs）在自然领域展示了强大的泛化能力和多模态推理能力。然而，它们在医疗诊断中的应用受限于缺乏能够捕捉真实临床工作流程的全面且结构化的数据集。为了促进VLMs在临床应用中的发展，特别是在胃癌领域，我们引入了Gastric-X，这是一个大规模的多模态基准数据集，提供了1700个病例。每个病例包含静息和动态CT扫描配对、内窥镜图像、一系列结构化的生化指标、专家撰写的诊断笔记以及肿瘤区域的边界框注释，反映了现实的临床条件。我们系统地考察了最近的VLMs在五个核心任务上的能力：视觉问答（VQA）、报告生成、跨模态检索、疾病分类和病灶定位。这些任务模拟了临床工作流程的关键阶段，从视觉理解与推理到多模态决策支持。通过这种评估，我们不仅旨在评估模型性能，还旨在探究VLM的理解本质：当前的VLMs能否有意义地将生化信号与空间肿瘤特征和文本报告联系起来？我们设想Gastric-X是使机器智能与医生的认知和证据推理过程相一致的一步，并作为激励下一代医疗VLMs发展的资源。

Summary / 总结

Gastric-X is a large multimodal dataset for gastric cancer analysis, including 1.7K cases with CT scans, endoscopic images, biochemical indicators, and diagnostic notes. It evaluates recent vision-language models on tasks like VQA, report generation, and lesion localization, aiming to assess their ability to correlate biochemical signals with spatial tumor features and textual reports. This benchmark seeks to advance VLMs for clinical applications in gastric cancer diagnosis.

Gastric-X 是一个包含 1.7K 例胃癌病例的大规模多模态数据集，每个病例包含配对的 CT 扫描、内镜图像、生物化学指标、诊断笔记和肿瘤标注。该数据集旨在评估视觉-语言模型（VLMs）在视觉问答、报告生成、跨模态检索、疾病分类和病灶定位等任务上的性能。研究显示，当前的 VLMs 能够将生物化学信号与空间肿瘤特征和文本报告联系起来，表明其在临床决策支持中的潜力。

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Authors: Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park

First: 2026-03-19T21:23:56+00:00 · Latest: 2026-03-19T21:23:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.

中文标题/摘要

标题：无指令调优大型视觉语言模型以实现医学指令遵循

大型视觉语言模型（LVLM）在广泛的任务中展现了令人印象深刻的性能。这些能力主要源自视觉指令调优，即在由精心策划的图像-指令-输出三元组构成的数据集上对模型进行微调。然而，在医学领域，构建大规模、高质量的指令数据集特别具有挑战性，因为需要专门的专家知识。为了解决这一问题，我们提出了一种无指令调优方法，减少了对手工指令的依赖，仅利用图像-描述对进行微调。具体来说，我们引入了一个动量代理指令作为精心策划的文本指令的替代品，这保留了预训练的LVLM的指令遵循能力，同时促进了在推理期间仍然有效的参数的更新。因此，微调后的LVLM可以在没有明确指令的情况下灵活响应特定领域的指令。此外，我们还引入了响应洗牌策略来减轻模型对先前词语的过度依赖，从而促进更有效的微调。我们的方法在SKINCON、WBCAtt、CBIS和MIMIC-CXR数据集上的多项选择视觉问答任务中达到了最先进的准确率，显著提高了LVLM在医学领域的微调效率。

Summary / 总结

The research aims to address the challenge of creating large-scale, high-quality instruction datasets in the medical domain by proposing an instruction-free tuning approach for large vision language models (LVLMs). This method leverages image-description pairs and introduces a momentum proxy instruction to preserve the instruction-following capability while promoting valid parameter updates. The approach also includes a response shuffling strategy to improve fine-tuning. Experimental results show that the fine-tuned LVLM achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across various medical datasets, enhancing the efficiency of LVLMs in medical domains.

研究旨在通过提出一种无指令调优方法来解决医学领域构建大规模高质量指令数据集的挑战，该方法利用图像描述对并引入动量代理指令来保持指令跟随能力的同时促进有效参数更新。此外，该方法还采用了响应打乱策略以提高调优效率。实验结果显示，调优后的LVLM在SKINCON、WBCAtt、CBIS和MIMIC-CXR等多个医学数据集上的多项选择视觉问答任务中达到了最先进的准确率，增强了模型对医学指令的适应性，而无需明确的调优指令。

LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

Authors: Myeongkyun Kang, Yanting Yang, Xiaoxiao Li

First: 2026-03-19T20:24:38+00:00 · Latest: 2026-03-19T20:24:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.

中文标题/摘要

标题：LoFi：基于位置的细粒度表示学习方法在胸部X光片中的应用

在胸部X光片中，细粒度表示学习对于检索和短语定位至关重要，因为临床相关的发现通常在空间上是局限的。然而，对比模型中缺乏区域级别的监督，以及大型视觉语言模型在外部验证中捕捉细粒度表示的能力有限，导致这些任务上的表现不佳。为了解决这些限制，我们提出了基于位置的细粒度表示学习（LoFi），该方法使用轻量级大型语言模型联合优化Sigmoid、captioning和位置感知captioning损失。位置感知captioning损失通过定位和密集captioning目标提供区域级别的监督，从而促进细粒度表示学习。基于这些表示，我们将细粒度编码器集成到基于检索的上下文学习中，以增强胸部X光片在不同环境下的定位。广泛的实验表明，我们的方法在MIMIC-CXR和PadChest-GR上实现了更好的检索和短语定位性能。

Summary / 总结

The paper addresses the challenge of fine-grained representation learning in chest X-rays, where clinically relevant findings are spatially confined. It proposes LoFi, which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. This approach enables region-level supervision and improves fine-grained representation learning. Experiments show that LoFi outperforms existing methods on retrieval and phrase grounding tasks on MIMIC-CXR and PadChest-GR datasets.

LoFi 方法通过联合优化 Sigmoid、captioning 和位置感知 captioning 损失，使用轻量级的大语言模型来解决胸部 X 光片中细粒度表示学习的局限性。这种方法实现了区域级别的监督，并在 MIMIC-CXR 和 PadChest-GR 数据集上提高了检索和短语定位的性能。

History

20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553