MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
Authors: Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee
First: 2026-03-26T17:59:58+00:00 · Latest: 2026-03-26T17:59:58+00:00
Abstract
Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.
中文标题/摘要
标题:MuRF:解锁视觉基础模型的多尺度潜力
视觉基础模型(VFMs)已成为现代计算机视觉的基石,提供跨多种任务的稳健表示。尽管最近的进步允许这些模型在训练过程中处理不同大小的输入,但在推理时通常仍局限于单一固定尺度。这一普遍的单尺度范式忽视了视觉感知的一个基本特性:不同分辨率提供互补的归纳偏置,低分辨率视图在全局语义识别方面表现出色,而高分辨率视图对于细粒度的精炼至关重要。在本文中,我们提出了一种多分辨率融合(MuRF)策略,这是一种简单而普遍有效的策略,在推理时利用这种协同作用。MuRF 不依赖单一视图,而是通过冻结的 VFM 对图像在多个分辨率下进行处理并融合结果特征来构建统一表示。MuRF 的普适性是其最引人注目的特点。它不依赖于特定架构,而是作为视觉表示的一种基本、无需训练的增强。我们通过将其应用于多种不同的 VFM 家族中的关键计算机视觉任务来实证验证这一点,主要使用 DINOv2,同时也展示了其成功泛化到对比模型如 SigLIP2。
Summary / 总结
The work addresses the limitation of single-scale inference in Vision Foundation Models (VFMs) by proposing Multi-Resolution Fusion (MuRF), which processes images at multiple resolutions and fuses the resulting features to enhance representation. MuRF is applied to various computer vision tasks using different VFM families, showing improved performance without requiring additional training.
该研究针对视觉基础模型(VFMs)在单尺度推理中的局限性,提出了多尺度融合(MuRF)方法,该方法在多个尺度下处理图像并融合特征以增强表示能力。MuRF被应用于多种计算机视觉任务的不同VFM家族中,显示出改进的效果且无需额外训练。
Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs
Authors: Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah
First: 2026-03-26T17:53:49+00:00 · Latest: 2026-03-26T17:53:49+00:00
Abstract
Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.
中文标题/摘要
标题:视觉关注以应对幻觉:视觉注意力在幻觉鲁棒性MDLLMs中的应用
多模态扩散大型语言模型(MDLLMs)通过并行遮蔽解码实现高并发生成,但架构仍易受到多模态幻觉的影响。这种结构上的脆弱性源于算法缺陷:解码器根据文本可能性对候选词进行排序,而未验证局部视觉支持。我们证明这种仅语言的排序导致了目标不匹配,其中语言概率质量充当了对多模态任务的不恰当代理。因此,我们将幻觉重新解释为局部优化错误,即解码器利用语言捷径以最大化代理分数,而牺牲了视觉接地。为解决这种目标不匹配,我们引入了VISAGE,这是一种无需训练的解码框架,在推理时校准目标。VISAGE通过量化跨注意力分布的空间熵来估计代理差异。通过在注意力头之间强制执行定位共识,该方法惩罚空间均匀分布并重新排序词元承诺,以有利于视觉接地的结果。我们提供了一个分析性稳定性保证,表明在估计误差下VISAGE保持有界的目标损失。在幻觉敏感和通用基准上的评估表明该框架的鲁棒性,分别在MMMU-val上获得8.59%的相对增益,在HallusionBench上获得7.75%的相对增益。
Summary / 总结
The paper addresses the issue of multimodal hallucinations in MDLLMs by introducing VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE uses spatial entropy of cross-attention distributions to estimate the proxy discrepancy and penalizes spatially uniform distributions to favor visually grounded outcomes. Evaluations show that VISAGE improves robustness, achieving relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.
论文通过引入VISAGE解码框架来解决MDLLMs中的多模态幻觉问题,该框架在推理时校准目标以更好地与视觉接地对齐。VISAGE通过量化交叉注意力分布的空间熵来估计代理偏差,并惩罚空间均匀分布以 favor 视觉接地的结果。实验表明,VISAGE提高了鲁棒性,在MMMU-val上取得了8.59%的相对增益,在HallusionBench上取得了7.75%的相对增益。
Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos
Authors: Abdullah Hamdi, Changchun Yang, Xin Gao
First: 2026-03-26T16:58:43+00:00 · Latest: 2026-03-26T16:58:43+00:00
Comments: preprint
Abstract
Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .
中文标题/摘要
标题:Colon-Bench:一种代理驱动的工作流,用于全程序结肠镜检查视频中可扩展密集病灶注释
结肠镜检查早期筛查对于结肠癌预防至关重要,但开发该领域的稳健AI系统受到缺乏密集标注、长序列视频数据集的阻碍。现有数据集主要集中在单类息肉检测,缺乏用于评估现代多模态大型语言模型(MLLMs)所需的丰富空间、时间和语言注释。为解决这一关键缺口,我们引入了Colon-Bench,通过一种新颖的多阶段代理驱动工作流生成。我们的流水线无缝集成时间提议、边界框跟踪、AI驱动的视觉确认和人工在环审查,以可扩展的方式标注全程序视频。最终验证基准覆盖了528个视频、14个不同的病灶类别(包括息肉、溃疡和出血)、超过30万个边界框、21.3万个分割掩码和13.3万个临床描述词。我们利用Colon-Bench严格评估了最先进的MLLMs在病灶分类、开放词汇视频对象分割(OV-VOS)和视频视觉问答(VQA)方面的表现。MLLM结果在医学领域显示出令人惊讶的高定位性能,优于SAM-3。最后,我们分析了MLLMs常见的VQA错误,提出了新的“结肠技能”提示策略,提高了大多数MLLMs的零样本性能最多9.7%。数据集和代码可在https://abdullahamdi.com/colon-bench 获取。
Summary / 总结
The paper introduces Colon-Bench, a densely annotated dataset for full-procedure colonoscopy videos, addressing the lack of such datasets in the field. The pipeline uses a multi-stage agentic workflow to annotate 528 videos with 14 lesion categories, 300,000 bounding boxes, and 133,000 words of clinical descriptions. Colon-Bench is used to evaluate state-of-the-art Multimodal Large Language Models (MLLMs) in lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA), showing high localization performance and introducing a novel prompting strategy to improve MLLM performance by up to 9.7%.
研究旨在解决缺乏全面密集标注的结肠镜视频数据集的问题,以开发用于结肠癌预防的 robust AI 系统。研究引入了 Colon-Bench,通过多阶段代理工作流生成,包括时间提议、边界框跟踪、AI 驱动的视觉确认和人工在环审查。基准数据集包含 528 个视频、14 种病变类别和大量标注,使 MLLMs 在病变分类、OV-VOS 和 VQA 方面得到严格评估。研究发现 MLLMs 在定位性能方面表现出色,并提出了一种新的“结肠技能”提示策略,可将零样本性能提高多达 9.7%。
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Authors: Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu
Venue: CVPR 2026
First: 2025-12-06T22:27:59+00:00 · Latest: 2026-03-26T16:13:10+00:00
Comments: Accepted at CVPR 2026
Abstract
Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, demonstrating MedVidBench's efficacy, while our MedGRPO framework further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains. Our project website is available at https://yuhaosu.github.io/MedGRPO/.
中文标题/摘要
标题:MedGRPO:多任务强化学习在异质医疗视频理解中的应用
大型视觉-语言模型在医疗视频理解方面存在困难,因为需要精确的空间定位、时间推理和临床语义。为了解决这一问题,我们首先引入了MedVidBench,这是一个包含531,850个视频指令对的大规模基准数据集,覆盖8个医疗来源,包括视频、片段和帧级任务,通过严格的质控流程,结合专家引导的提示和双模型验证进行编目。虽然在MedVidBench上进行监督微调可以取得显著的改进,但标准的强化学习(RL)由于不同数据集之间的奖励尺度不平衡,导致优化不稳定并导致训练崩溃。为克服这一问题,我们引入了MedGRPO,这是一种新的RL框架,用于平衡多数据集训练,包含两个关键创新:(1)跨数据集奖励归一化,将每个数据集的中位性能映射到一个共同的奖励值,确保无论难度如何都能公平优化,以及(2)医疗LLM裁判,通过比较相似度评分评估字幕质量的五个临床维度。在MedVidBench上对Qwen2.5-VL-7B进行监督微调,在所有任务上显著优于GPT-4.1和Gemini-2.5-Flash,证明了MedVidBench的有效性,而我们的MedGRPO框架进一步提高了SFT基线在定位和字幕任务上的表现。我们的工作为推进视觉-语言模型在医疗领域的应用奠定了基础,并建立了稳健的训练方法。我们的项目网站可在https://yuhaosu.github.io/MedGRPO/访问。
Summary / 总结
The paper introduces MedVidBench, a large-scale benchmark for medical video understanding, and MedGRPO, a novel reinforcement learning framework. MedVidBench includes 531,850 video-instruction pairs from 8 medical sources, while MedGRPO uses cross-dataset reward normalization and a medical LLM judge to stabilize training and improve performance. Supervised fine-tuning of Qwen2.5-VL-7B on MedVidBench outperforms GPT-4.1 and Gemini-2.5-Flash, and MedGRPO further enhances grounding and captioning tasks, demonstrating the framework's effectiveness in medical domains.
研究通过引入MedVidBench大规模基准和MedGRPO新型RL框架,解决了医学视频理解的挑战。MedVidBench包含来自8个医学来源的531,850个视频-指令对,MedGRPO使用跨数据集奖励归一化和医学LLM裁判来平衡多数据集训练。Qwen2.5-VL-7B的监督微调在MedVidBench上优于GPT-4.1和Gemini-2.5-Flash,并且MedGRPO进一步提升了在语义定位和字幕生成任务上的表现,证明了所提方法在医学领域的有效性。
Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting
Authors: Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca Moro
Venue: AAAI 2026
First: 2026-03-02T09:41:26+00:00 · Latest: 2026-03-26T14:20:43+00:00
Comments: Please cite the definitive, copyrighted, and peer-reviewed version of this article published in AAAI 2026, edited by Sven Koenig et al., AAAI Press, Vol. 40, No. 36, Technical Track, pp. 30726-30734, 2026. DOI: https://doi.org/10.1609/aaai.v40i36.40329
Abstract
Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks, predominantly boxes with numeric identifiers, before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.
中文标题/摘要
标题:Graph-of-Mark:通过基于图的视觉提示促进多模态语言模型的空间推理
近年来,无需训练的视觉提示技术,如Set-of-Mark,已成为增强多模态语言模型(MLM)定位能力的有前途的方向。这些技术通过将输入图像分割成对象区域并在其上标注标记(主要为带有数字标识的框)来工作,然后将增强后的图像输入到MLM中。然而,这些方法将标记的对象视为孤立的实体,未能捕捉它们之间的关系。基于此,我们提出了Graph-of-Mark(GoM),这是第一个在输入图像上叠加场景图的像素级视觉提示技术,用于空间推理任务。我们在3个开源MLM和4个不同数据集上评估了GoM,并对绘制组件进行了广泛的消融分析,研究了文本提示中辅助图描述的影响。我们的结果表明,GoM在解释对象位置和相对方向方面的一次性能力得到了一致的提高,在视觉问答和定位方面的基线准确性提高了11个百分点。
Summary / 总结
The research aims to enhance the spatial reasoning abilities of multimodal language models by proposing Graph-of-Mark (GoM), a pixel-level visual prompting technique that incorporates scene graphs into the input image. Evaluations across three open-source MLMs and four datasets show that GoM significantly improves the models' zero-shot performance in understanding object positions and relative directions, with up to 11 percentage point increases in accuracy for visual question answering and localization tasks.
研究旨在通过引入Graph-of-Mark (GoM) 技术,利用场景图捕捉物体之间的关系,提升多模态语言模型的空间推理能力。研究在四个数据集上评估了三种开源多模态语言模型的表现,结果显示在涉及物体位置和相对方向的零样本任务中,GoM 可以显著提高准确性,视觉问答和定位任务的准确率最高可提升11个百分点。
GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
Authors: Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed
First: 2026-03-26T14:08:41+00:00 · Latest: 2026-03-26T14:08:41+00:00
Abstract
Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.
中文标题/摘要
标题:GridVAD:通过分层帧网格的空间推理实现开放集视频异常检测
视觉-语言模型(VLMs)是强大的开放集推理器,但在视频监控中直接用作异常检测器却很脆弱:没有校准的异常先验,它们会在漏检和虚假警报之间交替。我们认为问题不在于VLM本身,而在于其使用方式。VLM应该作为异常提议者,生成开放集候选描述,然后由专门构建的空间和时间模块进行接地和跟踪。我们通过GridVAD这一无需训练的管道实例化了这一提议-接地-传播原则,该管道在没有任何领域特定训练的情况下生成像素级异常掩码。VLM对视频片段的分层网格表示进行推理,生成自然语言异常提议。自我一致性聚合(SCC)通过仅保留跨多次独立采样中重复出现的提议来过滤虚假警报。DINO锚定每个幸存的提议到一个边界框,SAM2将其作为密集掩码在异常区间内传播。每段视频的VLM预算固定为M+1次调用,无论视频长度如何,M可以根据需要进行设置。在UCSD Ped2上,GridVAD在所有比较方法中实现了最高的像素-AUROC(77.59),甚至超过了部分微调的TAO(75.11),在对象级RBDC上也比其他零样本方法高出5倍以上。消融实验表明,SCC提供了可控制的精确度-召回率权衡:过滤可以改善所有像素级别指标,同时在对象级别召回率上付出适度的代价。效率实验表明,GridVAD比均匀的每帧VLM查询效率高2.7倍,同时还能生成密集分割掩码。代码和定性视频结果可在https://gridvad.github.io/获取。
Summary / 总结
GridVAD proposes a method for open-set video anomaly detection using a Vision-Language Model (VLM) to generate anomaly proposals, which are then grounded and tracked by spatial and temporal modules. It achieves the highest Pixel-AUROC (77.59) on UCSD Ped2, surpassing even partially fine-tuned TAO (75.11) and outperforming other zero-shot approaches by over 5x on object-level RBDC. Ablations show that Self-Consistency Consolidation (SCC) improves precision-recall tradeoff, and efficiency experiments demonstrate GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while producing dense segmentation masks.
GridVAD 提出了一种使用 Vision-Language 模型 (VLM) 生成异常提案的方法,然后由空间和时间模块进行定位和跟踪。该方法在 UCSD Ped2 上实现了最高的像素 AUROC (77.59),超过了部分微调的 TAO (75.11),并在对象级别的 RBDC 上比其他零样本方法高出 5 倍以上。消融实验表明,Self-Consistency Consolidation (SCC) 改善了精确度-召回率权衡,而效率实验显示 GridVAD 比均匀的每帧 VLM 查询效率高 2.7 倍,同时生成密集的分割掩码。
Mario: Multimodal Graph Reasoning with Large Language Models
Authors: Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan
Venue: CVPR 2026
First: 2026-03-05T13:49:41+00:00 · Latest: 2026-03-26T13:51:37+00:00
Comments: CVPR 2026
Abstract
Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.
中文标题/摘要
标题:马里奥:大规模语言模型的多模态图推理
大规模语言模型(LLMs)的最新进展为多模态推理开辟了新的途径。然而,大多数现有方法仍然依赖预训练的视觉-语言模型(VLMs)来孤立地编码图像-文本对,忽略了真实世界多模态数据自然形成的关联结构。这促使我们在多模态图(MMGs)上进行推理,其中每个节点具有文本和视觉属性,边提供结构线索。在保持图拓扑的同时,使基于LLM的多模态异构信号推理引入了两个关键挑战:解决弱跨模态一致性并处理异构模态偏好。为了解决这些问题,我们提出了一种统一框架Mario,该框架同时解决了上述两个挑战,并使基于LLM的MMGs推理变得有效。Mario由两个创新阶段组成。首先,一种图条件下的VLM设计,通过由图拓扑引导的细粒度跨模态对比学习联合精炼文本和视觉特征。其次,一种模态自适应图指令调优机制,将对齐的多模态特征组织成图意识指令视图,并使用可学习的路由器为每个节点及其邻域呈现最相关信息模态配置给LLM。在各种多模态图基准上的广泛实验表明,Mario在节点分类和链接预测的监督和零样本场景中均优于最先进的图模型。代码将在https://github.com/sunyuanfu/Mario上提供。
Summary / 总结
The research aims to leverage large language models (LLMs) for multimodal reasoning by addressing the limitations of existing methods that rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation. Mario, a unified framework, is proposed to resolve weak cross-modal consistency and handle heterogeneous modality preference on multimodal graphs (MMGs). Mario consists of two stages: a graph-conditioned VLM for cross-modal refinement and a modality-adaptive graph instruction tuning mechanism for effective LLM-based reasoning. Experiments show that Mario outperforms state-of-the-art graph models in node classification and link prediction across various MMG benchmarks.
论文提出了Mario,一种使用大型语言模型进行多模态图推理的统一框架。通过设计图条件下的视觉-语言模型和模态自适应图指令调优机制,解决弱跨模态一致性和异质模态偏好带来的挑战。Mario在各种多模态图基准上的节点分类和链接预测任务中均优于现有最佳图模型。
Cross-Model Disagreement as a Label-Free Correctness Signal
Authors: Matt Gorbett, Suman Jana
First: 2026-03-26T13:46:22+00:00 · Latest: 2026-03-26T13:46:22+00:00
Abstract
Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.
中文标题/摘要
标题:跨模型分歧作为无标签正确性信号
在没有真实标签的情况下检测语言模型的错误是安全部署中的一个基本挑战。现有方法依赖于模型自身的不确定性,如标记熵或置信度分数,但这些信号在最危险的失败模式上失效:自信错误,即模型虽然错误但非常确定。在本文中,我们引入跨模型分歧作为正确性指标——这是一种简单且无需训练的信号,可以直接插入现有的生产系统、路由管道和部署监控基础设施中。给定模型生成的答案,跨模型分歧通过单次前向传播计算第二个验证模型对该答案的惊讶程度或不确定性。无需从验证模型生成内容,也不需要正确性标签。我们通过跨模型困惑度(CMP)和跨模型熵(CME)来实现这一原则,CMP衡量验证模型对生成模型答案标记的惊讶程度,CME衡量验证模型在这些位置的不确定性。CMP和CME在涵盖推理、检索和数学问题解决(MMLU、TriviaQA和GSM8K)的基准测试中均优于内部模型不确定性基线。在MMLU上,CMP相对于内部熵基线的平均AUROC为0.75。这些结果确立了跨模型分歧作为实用的、无需训练的无标签正确性估计方法,具有直接应用于部署监控、模型路由、选择性预测、数据过滤和生产语言模型系统的可扩展监督的应用前景。
Summary / 总结
This study addresses the challenge of identifying when a language model is incorrect without ground truth labels. It introduces cross-model disagreement as a new correctness indicator, specifically Cross-Model Perplexity (CMP) and Cross-Model Entropy (CME), which measure a verifying model's surprise and uncertainty towards a generating model's answer. These methods outperform within-model uncertainty baselines across various benchmarks, demonstrating their effectiveness in deployment monitoring and model routing.
该研究解决了在没有 ground truth 标签的情况下识别语言模型错误的问题。它引入了跨模型分歧作为新的正确性指标,具体包括跨模型困惑度 (CMP) 和跨模型熵 (CME),这些方法衡量验证模型对生成模型答案的惊讶程度和不确定性。这些方法在各种基准测试中优于内部模型的不确定性基线,展示了它们在部署监控和模型路由中的有效性。
HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
Authors: Huizhi Liang, Yichao Shen, Yu Deng, Sicheng Xu, Zhiyuan Feng, Tong Zhang, Yaobo Liang, Jiaolong Yang
Venue: CVPR 2026
First: 2026-03-26T13:08:12+00:00 · Latest: 2026-03-26T13:08:12+00:00
Comments: Accepted by CVPR 2026. Project page: https://microsoft.github.io/HiSpatial
Abstract
Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.
中文标题/摘要
标题:HiSpatial:视觉语言模型中层次化三维空间理解的框架
为使视觉语言模型(VLMs)具备类人的空间智能,需要从二维观察中推断三维结构,识别三维空间中的物体属性和关系,并进行高级空间推理。本文提出了一种原理性的层次化框架,将VLMs中三维空间理解的学习分解为从几何感知到抽象空间推理的四个逐步复杂层次。基于此框架,我们构建了一个自动化流水线,处理约500万张图像,包含超过4500万个物体,生成跨多种任务和场景的三维空间VQA配对,用于监督微调VLMs。我们还开发了一个RGB-D VLM,结合度量尺度点云作为辅助输入,进一步增强空间理解。大量实验表明,我们的方法在多个空间理解和推理基准测试中达到了最先进的性能,超越了专门的空间模型和大型专有系统,如Gemini-2.5-pro和GPT-5。此外,我们的分析揭示了层次任务级别之间的清晰依赖关系,为多层次任务设计如何促进三维空间智能的涌现提供了新的见解。
Summary / 总结
The research aims to enhance vision-language models' spatial intelligence by addressing 3D spatial understanding from geometric perception to abstract reasoning. It proposes a hierarchical framework and constructs an automated pipeline to generate 3D spatial VQA pairs from 5 million images. The approach outperforms specialized spatial models and large proprietary systems on multiple benchmarks, highlighting the importance of hierarchical task design in developing 3D spatial intelligence.
研究旨在通过从几何感知到抽象推理逐步解决3D空间理解问题,以提升视觉语言模型的空间智能。提出了一个分层框架,并构建了一个自动化管道,从500万张图像中生成3D空间VQA对。该方法在多个基准测试中超越了专门的空间模型和大型专有系统,强调了分层任务设计对发展3D空间智能的重要性。
Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models
Authors: Eyal Hadad, Mordechai Guri
First: 2026-03-26T12:53:49+00:00 · Latest: 2026-03-26T12:53:49+00:00
Comments: 13 pages, 8 figures
Abstract
On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.
中文标题/摘要
标题:形状与实质:面向本地视觉-语言模型的双层侧信道攻击
设备端视觉-语言模型(VLMs)通过本地执行承诺了数据隐私。然而,我们展示了向动态高分辨率预处理(例如AnyRes)的架构转变引入了固有的算法侧信道。与静态模型不同,动态预处理会根据图像的长宽比将图像分解为不同数量的块,从而产生工作负载依赖的输入。我们展示了一种针对本地VLMs的双层攻击框架。在第一层中,未授权的攻击者可以利用标准的未授权操作系统指标来可靠地指纹输入的几何形状。在第二层中,通过分析最后级缓存(LLC)争用,攻击者可以解决相同几何形状内的语义模糊性,区分视觉密集(例如,医学X光片)和稀疏(例如,文本文档)内容。通过评估最先进的模型如LLaVA-NeXT和Qwen2-VL,我们展示了结合这些信号可以可靠地推断出隐私敏感的上下文。最后,我们分析了缓解这一漏洞的安全工程权衡,揭示了使用恒定工作量填充带来的显著性能开销,并提出了安全边缘AI部署的实用设计建议。
Summary / 总结
The research addresses the security vulnerability in on-device Vision-Language Models (VLMs) due to dynamic preprocessing, which decomposes images into variable patches based on aspect ratio. The study introduces a dual-layer attack framework: Tier 1 uses standard OS metrics to fingerprint the input's geometry, and Tier 2 profiles LLC contention to resolve semantic ambiguity. Evaluations on models like LLaVA-NeXT and Qwen2-VL show that combining these signals can reliably infer privacy-sensitive contexts. The study also discusses the performance overhead of mitigation strategies and suggests practical design recommendations for secure Edge AI deployments.
研究旨在揭示由于动态预处理导致的on-device Vision-Language Models (VLMs) 安全漏洞,动态预处理引入了工作负载依赖的输入。研究采用了一种双层攻击框架:第一层使用标准的OS指标来指纹识别输入的几何形状,第二层通过分析LLC争用情况来解决语义上的模糊性。对LLaVA-NeXT和Qwen2-VL等模型的评估表明,结合这些信号可以可靠地推断出隐私敏感的上下文。研究还讨论了缓解这种漏洞的性能开销,并提出了针对安全边缘AI部署的实用设计建议。
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
Authors: Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang, Jie Zhou, Jiwen Lu
First: 2024-11-24T15:14:05+00:00 · Latest: 2026-03-26T12:46:40+00:00
Comments: Accepted by IEEE TIP
Abstract
Recent advancements in pre-trained vision-language models like CLIP have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to the image-level contrastive learning and fully global feature interaction, ViT-based CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis of ViT-based CLIP reveals that anomaly tokens emerge during the forward process, attracting disproportionate attention from normal patch tokens and thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to generate finer representations while preserving its original generalization ability-without introducing new parameters or relying on additional backbones. Specifically, we mitigate the negative impact of anomaly tokens from two complementary perspectives. First, we explicitly identify the anomaly tokens and replace them based on local context. Second, we reduce their influence on normal tokens by enhancing feature discriminability and attention correlation, leveraging the inherent semantic consistency within CLIP's mid-level features. In addition, we introduce a two-pass strategy that effectively integrates multi-level features to enrich local details under the training-free setting. Together, these strategies enhance CLIP's feature representations with improved granularity and semantic coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across all datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at https://github.com/SuleBai/SC-CLIP.
中文标题/摘要
标题:自校准CLIP用于无需训练的开放式词汇分割
预训练的视觉-语言模型如CLIP的最新进展使开放式词汇分割任务成为可能。CLIP在各种需要整体图像理解的下游任务中展示了令人印象深刻的零样本能力。然而,由于基于图像级别的对比学习和完全全局特征交互,基于ViT的CLIP难以捕捉局部细节,导致在分割任务中的表现不佳。我们对基于ViT的CLIP的分析表明,在前向过程中会出现异常令牌,这些异常令牌会吸引正常补丁令牌不成比例的注意力,从而削弱空间意识。为了解决这一问题,我们提出了一种无需训练的方法——自校准CLIP(SC-CLIP),该方法在不引入新参数或依赖额外骨干网络的情况下,校准CLIP以生成更精细的表示,同时保持其原始的泛化能力。具体而言,我们从两个互补的角度减轻了异常令牌的负面影响。首先,我们明确识别异常令牌并基于局部上下文进行替换。其次,我们通过增强特征可区分性和注意力相关性来减少其对正常令牌的影响,利用CLIP中间层特征内的固有语义一致性。此外,我们引入了一种两阶段策略,有效地在无需训练的情况下整合多级特征,以丰富局部细节。这些策略共同提高了CLIP的特征表示的粒度和语义连贯性。实验结果表明,SC-CLIP的有效性,其在所有数据集上的表现均达到最佳,并且比以前的方法高出9.5%。值得注意的是,SC-CLIP将vanilla CLIP ViT-L/14的性能提升了6.8倍。我们的源代码可在https://github.com/SuleBai/SC-CLIP/获取。
Summary / 总结
The research aims to improve the performance of CLIP in open-vocabulary segmentation tasks by addressing its limitations in capturing local details. The method, Self-Calibrated CLIP (SC-CLIP), calibrates CLIP without training, using a two-pass strategy to enhance feature representations and reduce the impact of anomaly tokens. Experiments show that SC-CLIP outperforms previous methods, achieving state-of-the-art results and significantly boosting vanilla CLIP ViT-L/14's performance by 6.8 times.
论文针对CLIP在分割任务中难以捕捉局部细节的问题,提出了一种无需训练的Self-Calibrated CLIP (SC-CLIP) 方法,通过识别和替换异常标记以及增强特征可区分性来生成更精细的表示。实验结果显示,SC-CLIP 达到了最先进的效果,超越了之前的方法9.5%,并且将 vanilla CLIP ViT-L/14 的性能提升了6.8倍。
FusionLog: Cross-System Log-based Anomaly Detection via Fusion of General and Proprietary Knowledge
Authors: Xinlong Zhao, Tong Jia, Minghua He, Xixuan Yang, Ying Li
First: 2025-11-08T06:30:50+00:00 · Latest: 2026-03-26T11:47:39+00:00
Comments: 12 pages, 5 figures, and 2 tables
Abstract
Log-based anomaly detection is critical for ensuring the stability and reliability of web systems. One of the key problems in this task is the lack of sufficient labeled logs, which limits the rapid deployment in new systems. Existing works usually leverage large-scale labeled logs from a mature web system and a small amount of labeled logs from a new system, using transfer learning to extract and generalize general knowledge across both domains. However, these methods focus solely on the transfer of general knowledge and neglect the disparity and potential mismatch between such knowledge and the proprietary knowledge of target system, thus constraining performance. To address this limitation, we propose FusionLog, a novel zero-label cross-system log-based anomaly detection method that effectively achieves the fusion of general and proprietary knowledge, enabling cross-system generalization without any labeled target logs. Specifically, we first design a training-free router based on semantic similarity that dynamically partitions unlabeled target logs into 'general logs' and 'proprietary logs.' For general logs, FusionLog employs a small model based on system-agnostic representation meta-learning for direct training and inference, inheriting the general anomaly patterns shared between the source and target systems. For proprietary logs, we iteratively generate pseudo-labels and fine-tune the small model using multi-round collaborative knowledge distillation and fusion based on large language model (LLM) and small model (SM) to enhance its capability to recognize anomaly patterns specific to the target system. Experimental results on three public log datasets from different systems show that FusionLog achieves over 90% F1-score under a fully zero-label setting, significantly outperforming state-of-the-art cross-system log-based anomaly detection methods.
中文标题/摘要
标题:FusionLog:通过融合通用和专有知识的跨系统日志异常检测方法
基于日志的异常检测对于确保网络系统的稳定性和可靠性至关重要。这一任务中的一个关键问题是缺乏足够的标记日志,这限制了其在新系统中的快速部署。现有工作通常利用成熟网络系统的大规模标记日志和新系统的小规模标记日志,通过迁移学习提取和泛化跨两个领域的通用知识。然而,这些方法仅专注于通用知识的迁移,而忽视了此类知识与目标系统专有知识之间的差异和潜在不匹配,从而限制了性能。为了解决这一局限,我们提出了一种名为FusionLog的新型零标签跨系统日志异常检测方法,该方法有效地实现了通用和专有知识的融合,能够在没有目标系统任何标记日志的情况下实现跨系统的泛化。具体而言,我们首先基于语义相似性设计了一个无需训练的路由器,动态地将未标记的目标日志划分为“通用日志”和“专有日志”。对于通用日志,FusionLog 使用基于系统无关表示元学习的小型模型进行直接训练和推理,继承了源系统和目标系统之间共享的通用异常模式。对于专有日志,我们通过多轮基于大型语言模型(LLM)和小型模型(SM)的协作知识蒸馏和融合迭代生成伪标签并微调小型模型,以增强其识别目标系统特定异常模式的能力。在三个不同系统的公开日志数据集上的实验结果显示,在完全零标签设置下,FusionLog 的 F1 分数超过 90%,显著优于最先进的跨系统日志异常检测方法。
Summary / 总结
FusionLog is a novel zero-label cross-system log-based anomaly detection method that integrates general and proprietary knowledge to enhance performance. It uses a training-free router based on semantic similarity to partition logs into general and proprietary categories. For general logs, it employs a small model for direct training and inference, while for proprietary logs, it iteratively generates pseudo-labels and fine-tunes the model using collaborative knowledge distillation. Experiments on three public log datasets show that FusionLog achieves over 90% F1-score, outperforming existing methods.
FusionLog 是一种新型的零标签跨系统日志异常检测方法,通过融合通用和专有知识来提高性能,无需目标系统的标注日志。它使用基于语义相似性的无训练路由器将日志划分为通用和专有部分。对于通用日志,它使用小型模型进行直接训练和推理。对于专有日志,它通过迭代生成伪标签并结合大型语言模型和小型模型进行多轮协作知识蒸馏和融合进行微调。实验结果表明,FusionLog 在三个不同系统的公共日志数据集上实现了超过 90% 的 F1 分数,显著优于现有方法。
Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
Authors: Sheng Lu, Hao Chen, Rui Yin, Juyan Ba, Yu Zhang, Yuanzhe Li
First: 2026-03-19T22:47:27+00:00 · Latest: 2026-03-26T11:12:17+00:00
Comments: Computer Vision and Pattern Recognition 2026
Abstract
Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.
中文标题/摘要
标题:Gastric-X:胃癌分析的多模态多阶段基准数据集,以促进视觉-语言模型的发展
近年来,视觉-语言模型(VLMs)在自然领域展示了强大的泛化能力和多模态推理能力。然而,它们在医疗诊断中的应用受限于缺乏能够捕捉真实临床工作流程的全面和结构化的数据集。为了促进VLMs在临床应用中的发展,特别是在胃癌领域,我们引入了Gastric-X,这是一个大规模的多模态基准数据集,提供了1700个病例。每个病例包含配对的静止和动态CT扫描、内镜图像、一系列结构化的生化指标、专家撰写的诊断笔记以及肿瘤区域的边界框注释,反映了现实的临床条件。我们系统地考察了最近的VLMs在五个核心任务上的能力:视觉问答(VQA)、报告生成、跨模态检索、疾病分类和病灶定位。这些任务模拟了临床工作流程的关键阶段,从视觉理解与推理到多模态决策支持。通过这种评估,我们不仅旨在评估模型性能,还旨在探究VLM的理解本质:当前的VLMs能否有意义地将生化信号与空间肿瘤特征和文本报告联系起来?我们设想Gastric-X是使机器智能与医生的认知和证据推理过程相一致的一步,并作为开发下一代医疗VLMs的资源。
Summary / 总结
Gastric-X is a large multimodal dataset for gastric cancer analysis, including CT scans, endoscopic images, biochemical indicators, and expert notes. It evaluates recent vision-language models on tasks like VQA, report generation, and disease classification, aiming to assess their ability to correlate biochemical signals with tumor features and textual reports. The dataset reflects real clinical conditions and is intended to advance the development of medical VLMs.
Gastric-X 是一个包含 CT 扫描、内窥镜图像、生化指标和诊断笔记的大规模多模态数据集,旨在通过评估其在 VQA、报告生成和病灶定位等任务上的表现来推进 VLMs 在临床应用中的发展。关键发现表明,当前的 VLMs 在将生化信号与空间肿瘤特征和文本报告关联方面存在困难,突显了在医疗领域中提高多模态理解的必要性。
DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers
Authors: Shu Wan, Saketh Vishnubhatla, Iskander Kushbay, Tom Heffernan, Aaron Belikoff, Raha Moraffah, Huan Liu
First: 2026-03-26T10:33:12+00:00 · Latest: 2026-03-26T10:33:12+00:00
Abstract
Directed Acyclic Graphs (DAGs) are widely used to represent structured knowledge in scientific and technical domains. However, datasets for real-world DAGs remain scarce because constructing them typically requires expert interpretation of domain documents. We study Doc2SemDAG construction: recovering a preferred semantic DAG from a document together with the cited evidence and context that explain it. This problem is challenging because a document may admit multiple plausible abstractions, the intended structure is often implicit, and the supporting evidence is scattered across prose, equations, captions, and figures. To address these challenges, we leverage scientific papers containing explicit DAG figures as a natural source of supervision. In this setting, the DAG figure provides the DAG structure, while the accompanying text provides context and explanation. We introduce DAGverse, a framework for constructing document-grounded semantic DAGs from online scientific papers. Its core component, DAGverse-Pipeline, is a semi-automatic system designed to produce high-precision semantic DAG examples through figure classification, graph reconstruction, semantic grounding, and validation. As a case study, we test the framework for causal DAGs and release DAGverse-1, a dataset of 108 expert-validated semantic DAGs with graph-level, node-level, and edge-level evidence. Experiments show that DAGverse-Pipeline outperforms existing Vision-Language Models on DAG classification and annotation. DAGverse provides a foundation for document-grounded DAG benchmarks and opens new directions for studying structured reasoning grounded in real-world evidence.
中文标题/摘要
标题:DAGverse:从科学论文构建文档导向的语义DAG
有向无环图(DAGs)在科学和技术领域广泛用于表示结构化知识。然而,由于构建它们通常需要专家对领域文档的解释,因此现实世界中的DAG数据集仍然稀缺。我们研究了Doc2SemDAG构建:从文档中恢复出一个优选的语义DAG,同时包含解释它的引证证据和上下文。这个问题具有挑战性,因为一个文档可能允许多种合理的抽象,意图结构往往隐含,支持的证据分散在文字、方程、标题和图表中。为了解决这些挑战,我们利用包含明确DAG图的科学论文作为自然的监督来源。在这种情况下,DAG图提供了DAG结构,而伴随的文本提供了上下文和解释。我们引入了DAGverse,一种从在线科学论文中构建文档导向的语义DAG的框架。其核心组件DAGverse-Pipeline是一种半自动系统,旨在通过图分类、图重建、语义接地和验证来生成高精度的语义DAG示例。作为案例研究,我们测试了该框架在因果DAG上的应用,并发布了包含108个专家验证的语义DAG及其图级、节点级和边级证据的DAGverse-1数据集。实验表明,DAGverse-Pipeline在DAG分类和注释上优于现有视觉-语言模型。DAGverse为文档导向的DAG基准提供了基础,并为基于实际证据的结构化推理研究开辟了新方向。
Summary / 总结
The research aims to build document-grounded semantic Directed Acyclic Graphs (DAGs) from scientific papers, addressing the challenge of expert interpretation required for constructing such graphs. The method involves a semi-automatic system, DAGverse-Pipeline, which uses figure classification, graph reconstruction, semantic grounding, and validation to produce high-precision semantic DAG examples. The key experimental finding is that DAGverse-Pipeline outperforms existing Vision-Language Models in DAG classification and annotation, demonstrating its effectiveness in creating document-grounded semantic DAGs.
研究旨在通过科学论文构建文档导向的语义有向无环图(DAG),解决构建此类图所需的专家解释问题。方法是利用包含明确DAG图的科学论文作为监督,其中图提供了结构,而附带的文本提供了上下文。关键发现是,提出的DAGverse-Pipeline在DAG分类和注释上优于现有视觉-语言模型,展示了该框架在创建高精度语义DAG示例方面的有效性。
Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models
Authors: Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, Curtis Langlotz
Venue: CVPR 2026
First: 2026-03-26T09:53:04+00:00 · Latest: 2026-03-26T09:53:04+00:00
Comments: CVPR 2026 main track, Codes are available at https://github.com/YBZh/OpenOOD-VLM
Abstract
Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline{T}est-time \underline{A}ctivated \underline{N}egative \underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5\% to 9.8\%. Codes are available at \href{https://github.com/YBZh/OpenOOD-VLM}{YBZh/OpenOOD-VLM}.
中文标题/摘要
标题:激活至关重要:测试时激活的负标签用于视觉-语言模型的OOD检测
离群分布(OOD)检测旨在识别与分布内(ID)样本不同的样本。一种流行的管道通过引入远离ID类别的负标签,并基于这些标签与OOD样本的距离来检测OOD。然而,这些标签在OOD样本上的激活可能较差,无法捕捉OOD特征。为解决这一问题,我们提出了测试时激活的负标签(TANL),通过动态评估语料库数据集中的激活水平并在测试过程中挖掘具有高激活响应的候选标签。具体而言,TANL在线识别高置信度测试图像,并通过语料库累积其分配概率来构建标签激活度量。这种度量利用历史测试样本自适应地与测试分布对齐,从而选择分布适应的激活负标签。通过进一步探索当前测试批次内的激活信息,我们引入了一种更细粒度的、批次适应的变体。为了充分利用标签激活知识,我们提出了一种激活感知评分函数,强调具有更强激活的负标签,从而提高性能并增强其对标签数量的鲁棒性。我们的TANL无需训练,测试高效,并基于理论依据。在多种骨干网络和广泛的任务设置下,实验验证了其有效性。值得注意的是,在大规模ImageNet基准测试中,TANL将FPR95从17.5%显著降低到9.8%。代码可在https://github.com/YBZh/OpenOOD-VLM获取。
Summary / 总结
The paper addresses the challenge of out-of-distribution (OOD) detection by proposing TANL, which dynamically selects activated negative labels during testing. TANL evaluates activation levels across the corpus and selects labels with high activation responses for OOD detection. Experiments show TANL reduces FPR95 from 17.5% to 9.8% on ImageNet, demonstrating its effectiveness. The method is training-free and test-efficient, with theoretical justification.
论文提出了一种名为Test-time Activated Negative Labels (TANL)的方法,通过在测试时动态选择基于激活水平的负标签来解决分布外(OOD)检测问题。TANL识别高置信度的测试图像,并累积它们的分配概率来构建标签激活度量,从而选择与测试分布相适应的负标签。实验结果显示,TANL在ImageNet基准上将FPR95从17.5%显著降低到9.8%,证明了其有效性和对使用的标签数量的鲁棒性。该方法无需训练,测试高效,并具有理论依据和跨多种骨干网络和任务的实际验证。
Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments
Authors: Jonas Hein, Lilian Calvet, Matthias Seibold, Siyu Tang, Marc Pollefeys, Philipp Fürnstahl
First: 2026-03-26T09:28:19+00:00 · Latest: 2026-03-26T09:28:19+00:00
Comments: Accepted at IJCARS: IPCAI 2026
Abstract
Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.
中文标题/摘要
标题:无需训练的检测与未见手术器械的6D姿态估计
目的:准确检测和6D姿态估计对手术器械至关重要,但监督方法缺乏对新或未见工具的灵活性,并需要大量标注数据。本研究提出了一种无需训练的管道,用于准确估计未见手术器械的多视角6D姿态,仅需使用纹理化的CAD模型作为先验知识。方法:我们的管道包括两个主要阶段。首先,对于检测,我们在每个视图中生成对象掩码提案,并使用预训练的特征提取器计算其与渲染模板的相似度得分。检测结果在视图间匹配,三角化为3D实例候选,并通过多视角几何一致性进行过滤。其次,对于姿态估计,一组姿态假设通过跨视图注意力和特征度量得分迭代优化和评分。最佳假设通过一种新颖的多视角、考虑遮挡的轮廓注册进行最终优化,该方法最小化未遮挡轮廓点的再投影误差。结果:所提出的方法在MVPSP数据集的现实手术数据上进行了严格评估。在受控条件下,该方法实现了毫米级准确的姿态估计,同时保持了对未见器械的完全泛化能力。这些结果表明,在手术场景中实现无需训练、无标记的检测和跟踪的可行性,并突显了手术环境中的独特挑战。结论:我们提出了一种新颖且灵活的管道,有效结合了最先进的基础模型、多视角几何和基于轮廓的优化,以实现高精度的手术器械6D姿态估计,无需针对特定任务的训练。该方法使在动态临床环境中实现稳健的器械跟踪和场景理解成为可能。
Summary / 总结
The research aims to develop a training-free method for accurate detection and 6D pose estimation of unseen surgical instruments, crucial for computer-assisted interventions. The method uses a textured CAD model and a two-stage pipeline: first, object mask proposals are scored and matched across views to triangulate 3D instance candidates, and then pose hypotheses are refined and scored using feature-metric scores and cross-view attention. The final pose is refined using a multi-view, occlusion-aware contour registration. The method achieves millimeter-accurate pose estimates comparable to supervised methods while maintaining generalization to unseen instruments.
研究旨在无需大量训练数据的情况下实现未见过的手术器械的准确检测和6D姿态估计。方法使用预训练的特征提取器生成对象掩码提案,并根据渲染模板的相似性进行评分。然后在不同视图之间匹配检测结果,使用特征度量分数进行姿态假设的迭代细化,并应用多视图、遮挡感知的轮廓注册进行最终细化。结果表明,所提出的方法在手术环境中实现了毫米级准确的姿态估计,与监督方法相当,同时能够完全泛化到未见过的器械,展示了其在手术环境中的可行性。
Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
Authors: Jiahao Tian, Chenxi Song, Wei Cheng, Chi Zhang
Venue: CVPR 2026
First: 2026-03-26T09:12:14+00:00 · Latest: 2026-03-26T09:12:14+00:00
Comments: Accepted to CVPR 2026. Code: https://github.com/Westlake-AGI-Lab/FreeLOC
Abstract
Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.
中文标题/摘要
标题:基于层自适应O.O.D校正的免费午餐长视频生成
使用预训练的视频扩散模型生成长视频,这些模型通常是在短片段上训练的,这提出了一个重大挑战。直接将这些模型应用于长视频推理通常会导致视觉质量显著下降。本文指出,这一问题主要源于两个分布外(O.O.D)问题:帧级相对位置O.O.D和上下文长度O.O.D。为了解决这些挑战,我们提出了一种名为FreeLOC的新型无训练框架,该框架引入了两种核心技术:基于视频的相对位置重新编码(VRPR)用于帧级相对位置O.O.D,这是一种多粒度策略,通过分层重新编码时间相对位置以与模型的预训练分布对齐;以及分层稀疏注意(TSA)用于上下文长度O.O.D,通过在不同时间尺度上结构化注意密度来保留局部细节和长程依赖。关键地,我们引入了一种层自适应探针机制,以识别每个变压器层对这些O.O.D问题的敏感性,从而允许选择性和高效地应用我们的方法。大量实验表明,我们的方法在时间和视觉质量方面均显著优于现有无训练方法,达到了最先进的水平。代码可在https://github.com/Westlake-AGI-Lab/FreeLOC获取。
Summary / 总结
This paper addresses the challenge of generating long videos using pre-trained video diffusion models, which often suffer from visual quality degradation. It identifies two main out-of-distribution issues: frame-level relative position O.O.D and context-length O.O.D. The proposed FreeLOC framework introduces VRPR for re-encoding temporal relative positions and TSA for preserving both local detail and long-range dependencies. A layer-adaptive probing mechanism is used to selectively apply these techniques. Experiments show that FreeLOC outperforms existing methods in both temporal consistency and visual quality.
该论文解决了使用预训练视频扩散模型生成高质量长视频的挑战,这些模型通常是在短片段上训练的。它指出了两个主要的分布外(O.O.D)问题:帧级相对位置O.O.D和上下文长度O.O.D。为了解决这些问题,作者提出了FreeLOC,这是一种无需训练的框架,包括Video-based Relative Position Re-encoding (VRPR) 和 Tiered Sparse Attention (TSA)。VRPR重新编码了时间相对位置,而TSA则保留了局部细节和长程依赖性。层自适应探针机制选择性地将这些技术应用于每个变压器层。实验表明,FreeLOC在时间一致性和视觉质量方面显著优于现有方法。
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Authors: Xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li, Liangming Chen, Lei Ren, Yong-Lu Li
Venue: CVPR 2026
First: 2025-12-02T14:02:42+00:00 · Latest: 2026-03-26T09:06:05+00:00
Comments: Accepted by CVPR 2026. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
Abstract
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
中文标题/摘要
标题:通过视觉符号诊断、纠正和学习操作失败
视觉-语言-行动(VLA)模型在机器人操作方面取得了显著进展,但在故障诊断和从故障中学习方面仍然有限。此外,现有的故障数据集大多是在模拟中通过编程生成的,这限制了它们在现实世界中的泛化能力。鉴于此,我们提出了ViFailback框架,旨在诊断机器人操作故障并提供文本和视觉纠正指导。我们的框架利用显式的视觉符号以提高注释效率。我们还发布了ViFailback数据集,这是一个包含58,126个视觉问答(VQA)对及其对应的5,202条真实世界操作轨迹的大规模集合。基于该数据集,我们建立了ViFailback-Bench基准,这是一个包含11个细粒度VQA任务的基准,旨在评估视觉语言模型(VLM)的故障诊断和纠正能力,包括ViFailback-Bench Lite用于封闭式评估和ViFailback-Bench Hard用于开放式评估。为了证明我们框架的有效性,我们构建了ViFailback-8B VLM,它不仅在ViFailback-Bench上实现了显著的整体性能提升,还生成了视觉符号以提供纠正行动指导。最后,通过将ViFailback-8B与VLA模型集成,我们进行了现实世界的机器人实验,展示了其帮助VLA模型从故障中恢复的能力。项目网站:https://x1nyuzhou.github.io/vifailback.github.io/
Summary / 总结
The paper introduces ViFailback, a framework for diagnosing robotic manipulation failures and providing corrective guidance through visual symbols. It includes a large-scale dataset with 58,126 VQA pairs and 5,202 real-world manipulation trajectories. The ViFailback-Bench benchmark evaluates VLMs on 11 fine-grained tasks, showing significant improvement in failure diagnosis and correction. The ViFailback-8B VLM not only enhances overall performance but also generates visual symbols for corrective actions, aiding in real-world robotic experiments to recover from failures.
论文提出了ViFailback框架,用于诊断机器人操作失败并提供纠正指导。该框架基于包含58,126个VQA对和5,202个真实世界操作轨迹的大规模数据集。框架在包含11个细粒度任务的ViFailback-Bench基准上评估Vision-Language模型(VLM),并展示了ViFailback-8B VLM在整体性能上的显著提升,并生成了纠正动作的视觉符号。实验证明,将ViFailback-8B与VLA模型结合可以协助从失败中恢复。
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
Authors: I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen
First: 2026-02-25T15:27:57+00:00 · Latest: 2026-03-26T08:59:16+00:00
Comments: Accepted by CVPR2026; Project Page: https://robustvisrag.github.io
Abstract
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
中文标题/摘要
标题:RobustVisRAG:视觉降级条件下的因果关系感知视觉检索增强生成
基于视觉的检索增强生成(VisRAG)利用视觉语言模型(VLMs)联合检索相关视觉文档,并基于多模态证据生成基于地面的答案。然而,现有的VisRAG模型在视觉输入遭受模糊、噪声、低光照或阴影等失真时性能会下降,因为语义和失真因素在预训练的视觉编码器中交织在一起,导致检索和生成阶段出现错误。为了解决这一局限性,我们提出了RobustVisRAG,这是一种因果关系引导的双路径框架,可以提高VisRAG的鲁棒性,同时保持效率和零样本泛化能力。RobustVisRAG使用非因果路径通过单向注意力捕捉失真信号,并使用这些信号学习因果路径中的净化语义。通过提出的非因果失真建模和因果语义对齐目标,该框架确保语义和失真之间的清晰分离,从而在具有挑战性的视觉条件下实现稳定的检索和生成。为了在现实条件下评估鲁棒性,我们引入了Distortion-VisRAG数据集,这是一个包含合成和真实世界降级文档的大规模基准,覆盖七个领域,包含12种合成和5种真实失真类型,全面反映了实际的视觉降级。实验结果表明,RobustVisRAG在真实世界降级条件下分别提高了检索、生成和端到端性能7.35%、6.35%和12.40%,同时在干净输入上保持了相当的准确性。
Summary / 总结
RobustVisRAG is a causality-guided dual-path framework that enhances the robustness of Vision-based Retrieval-Augmented Generation (VisRAG) models under visual degradations. It uses a non-causal path to capture degradation signals and a causal path to learn purified semantics, improving retrieval and generation performance by 7.35% and 6.35%, respectively, on real-world degradations. The framework also maintains comparable accuracy on clean inputs. The Distortion-VisRAG dataset, containing both synthetic and real-world degraded documents, is introduced to evaluate robustness under realistic conditions.
RobustVisRAG 是一个因果引导的双路径框架,旨在增强视觉检索增强生成(VisRAG)模型在视觉退化条件下的鲁棒性。它通过非因果路径捕捉退化信号,并通过因果路径学习净化的语义,分别在真实世界退化条件下提高检索和生成性能 7.35% 和 6.35%。该框架在干净输入上也保持了相当的准确性。为了评估其鲁棒性,引入了 Distortion-VisRAG 数据集,其中包括七个领域内的合成和真实世界退化文档。
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Authors: Zherui Li, Zheng Nie, Zhenhong Zhou, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Yufei Guo, Jiaheng Zhang
First: 2025-09-29T05:17:10+00:00 · Latest: 2026-03-26T08:06:49+00:00
Comments: Accepted by ICLR2026
Abstract
The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.
中文标题/摘要
标题:DiffuGuard:扩散大语言模型中固有安全性的丧失与恢复
扩散大语言模型(dLLMs)的迅速发展引入了前所未有的漏洞,这些漏洞从根本上不同于自回归大语言模型,源于它们的迭代和并行生成机制。在本文中,我们对dLLM在监禁攻击下的漏洞进行了深入分析,从两个维度进行:单步内和跨步动态。实验结果揭示了标准贪婪去噪策略中固有的有害偏差,并识别出我们称之为去噪路径依赖的关键现象,早期阶段的令牌安全性对最终输出有决定性影响。这些发现还表明,尽管当前的解码策略构成了一个重大漏洞,但dLLMs仍然具有巨大的固有安全性潜力。为了释放这种潜力,我们提出了DiffuGuard,这是一种无需训练的防御框架,通过双重方法来解决漏洞:随机退火去噪动态引入可控的随机性以缓解贪婪选择偏差,而块级审计和修复利用内部模型表示进行自主风险检测和引导修正。在四个dLLMs上的全面实验表明,DiffuGuard具有卓越的效果,将六种不同监禁攻击方法的攻击成功率从47.9%降低到14.7%,同时保持模型的实用性和效率。我们的代码可在:https://github.com/niez233/DiffuGuard/ 获取。
Summary / 总结
This paper investigates the vulnerabilities of Diffusion Large Language Models (dLLMs) to jailbreak attacks, identifying a harmful bias in the standard greedy remasking strategy and a phenomenon called Denoising-path Dependence. It proposes DiffuGuard, a training-free defense framework that uses Stochastic Annealing Remasking and Block-level Audit and Repair to enhance safety. Experiments show that DiffuGuard reduces the Attack Success Rate from 47.9% to 14.7% across four dLLMs, maintaining model utility and efficiency.
本文研究了扩散大型语言模型(dLLMs)对劫持攻击的漏洞,发现了贪婪重掩码中的有害偏差以及一种称为去噪路径依赖的现象。为解决这些问题,作者提出了一个无需训练的防御框架DiffuGuard,包括随机退火重掩码和块级审计与修复。实验表明,DiffuGuard将攻击成功率从47.9%降低到14.7%,同时保持了模型的实用性和效率。
Learning to Rank Caption Chains for Video-Text Alignment
Authors: Ansel Blume, Burak Uzkent, Shalini Chaudhuri, Garin Kessler
First: 2026-03-26T08:04:57+00:00 · Latest: 2026-03-26T08:04:57+00:00
Abstract
Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary "winner-takes-all" approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the "losing" response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses' faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.
中文标题/摘要
标题:学习排序视频-文本对齐的字幕链
直接偏好优化(DPO)是一种有效的技术,用于训练语言模型生成更优选而非次优选的响应。然而,这种二元的“赢家通吃”方法对于响应质量高度依赖于视觉内容的视觉语言模型来说是次优的。特别是,即使一个响应不如替代响应优选,它也可能忠实于视觉输入。标准的Bradley-Terry DPO公式缺乏这种细微差别,过度强调获胜响应,而没有充分考虑“失败”的响应是否仍然保持了高视觉保真度。在本文中,我们研究排序优化作为一种替代方法,更精确地定位响应对视觉输入的忠实度。我们专注于使用详细的视频字幕进行视频-文本对齐,提出了一种方法,通过反复降级字幕生成具有挑战性的、完全排序的字幕链。我们的结果表明,排序优化在长文本生成和评估中优于二元DPO,并且重要的是,我们发现这些方法需要对视觉编码器进行微调才能有效,挑战了DPO仅为语言权重调整过程的观点。
Summary / 总结
This work addresses the limitations of direct preference optimization (DPO) in vision-language models by proposing ranking optimization, which better captures the faithfulness of responses to visual inputs. The method generates challenging caption chains through repeated degradation and finetunes the vision encoder. Results show that ranking optimization outperforms binary DPO for long-form content generation and assessment, highlighting the importance of finetuning the vision encoder for effective performance.
本文通过提出排名优化方法解决了直接偏好优化(DPO)在视觉-语言模型中的局限性,该方法更好地捕捉了响应对视觉输入的忠实度。该方法通过反复降解生成具有挑战性的标题链,并对视觉编码器进行微调。结果表明,排名优化在长文本生成和评估中优于二元DPO,强调了有效性能中对视觉编码器进行微调的重要性。
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Authors: Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, Gang Hua
Venue: CVPR 2026
First: 2026-03-05T10:49:46+00:00 · Latest: 2026-03-26T07:52:38+00:00
Comments: Accepted to CVPR 2026 main track
Abstract
Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.
中文标题/摘要
标题:CoIn3D: 重新审视配置不变的多相机3D物体检测
多相机3D物体检测(MC3D)随着多传感器物理代理(如机器人和自动驾驶车辆)的部署越来越多而受到越来越多的关注。然而,MC3D模型仍然难以在具有新多相机配置的未见过的平台上泛化。当前的解决方案只是使用一个元相机进行统一表示,但缺乏全面的考虑。在本文中,我们重新审视了这一问题,并发现问题在于源配置和目标配置之间的空间先验差异,包括不同的内参、外参和阵列布局。为了解决这一问题,我们提出了CoIn3D,这是一种通用的MC3D框架,能够从源配置高效地转移到未见过的目标配置。CoIn3D通过空间感知特征调制(SFM)和相机感知数据增强(CDA)将所有识别的空间先验显式地整合到特征嵌入和图像观察中。SFM通过整合焦距、地面深度、地面梯度和Plücker坐标等四种空间表示来丰富特征空间。CDA通过一种无需训练的动态新颖视角图像合成方案来在各种配置下提高观察多样性。广泛的实验表明,CoIn3D在NuScenes、Waymo和Lyft等地标数据集上,在BEVDepth、BEVFormer和PETR等三种主导的MC3D范式下,实现了强大的跨配置性能。
Summary / 总结
CoIn3D revisits the challenge of multi-camera 3D object detection across different configurations and proposes a framework that addresses spatial prior discrepancies through spatial-aware feature modulation and camera-aware data augmentation. Experiments show that CoIn3D outperforms existing methods on landmark datasets like NuScenes, Waymo, and Lyft under various paradigms such as BEVDepth, BEVFormer, and PETR.
CoIn3D通过提出一个通用框架解决了多相机3D物体检测(MC3D)在不同相机配置之间转移的挑战。该框架通过空间感知特征调制和相机感知数据增强将空间先验同时融入特征嵌入和图像观察中。实验结果表明,CoIn3D在NuScenes、Waymo和Lyft等地标数据集的各种MC3D范式下优于现有方法。
Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors
Authors: Chengxu Yang, Jingling Yuan, Chuang Hu, Jiawei Jiang
First: 2026-03-26T06:49:21+00:00 · Latest: 2026-03-26T06:49:21+00:00
Abstract
Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.
中文标题/摘要
标题:视觉注意力漂移但锚点稳固:通过跨层视觉锚点减轻多模态大型语言模型中的幻觉
多模态大型语言模型经常遭受物体幻觉的问题。虽然现有研究利用了注意力增强和视觉重溯,但我们发现这些工作在最终模型阶段缺乏足够的注意力漂移可解释性。在本文中,我们研究了各层视觉特征的演变,并发现幻觉源于深层注意力向早期层的初始视觉噪声回归。我们观察到,输出可靠性取决于在中间层而非最终层获取视觉锚点。基于这些见解,我们提出了CLVA(跨层视觉锚点),这是一种无需训练的方法,可以强化关键中间层特征并抑制回归噪声。该方法通过利用从注意力动态中捕获的关键锚点,有效将深层注意力拉回到正确的视觉区域。我们跨多种架构和基准评估了该方法,证明了其出色的性能,且未显著增加计算时间和GPU内存。
Summary / 总结
The research addresses the issue of object hallucination in multimodal large language models by focusing on attention drift in deep layers. It proposes CLVA, a training-free method that uses cross-layer visual anchors to reinforce critical mid-layer features and suppress regressive noise, thereby pulling deep layer attention back to correct visual regions. Experiments across various architectures and benchmarks show that CLVA improves output reliability without increasing computational time or GPU memory.
论文通过研究视觉特征在各层中的演变,解决了多模态大型语言模型中的对象幻觉问题,发现幻觉源于深层层注意力向早期层的初始视觉噪声回归。作者提出了一种名为CLVA的训练免费方法,利用注意力动态中捕获的关键锚点,强化中间层的关键特征并抑制回归噪声,将深层层的注意力拉回到正确的视觉区域。实验表明,CLVA在不同架构和基准上的性能出色,且计算时间和GPU内存消耗较少。
Sparse Visual Thought Circuits in Vision-Language Models
Authors: Yunpeng Zhou
First: 2026-03-26T06:24:36+00:00 · Latest: 2026-03-26T06:24:36+00:00
Abstract
Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and degrades accuracy, even under norm-matched perturbations. This non modular circuit interference is consistent with shared internal pathways where feature unions amplify activation shifts. We develop a reproducible causal pipeline to localize and test these sparse visual thought circuits in Qwen3-VL-8B. On a controlled synthetic benchmark with seven task types and three difficulty levels, linear probes identify a mid decoder locus for task type information. We train SAEs at this layer, construct task-selective sets via an explicit rule, and perform inference time scaling and ablation while quantifying accuracy and drift. Our findings-validated with bootstrapped subsamples and permutation controls, and replicated across multiple VLM families and five diverse datasets clarify the boundaries of SAE feature composability and provide a rigorous diagnostic framework for more reliable VLM control.
中文标题/摘要
标题:视觉语言模型中的稀疏视觉思维电路
稀疏自编码器(SAEs)在多模态模型中提高了可解释性,但尚不清楚SAE特征是否形成模块化、可组合的推理单元——这是许多基于干预的方法背后的假设。我们测试了这种模块性假设,并发现它经常失败:干预任务选择性特征集可以适度提高推理准确性,而干预两个此类集合并集通常会导致输出漂移(预测中的大量意外变化)并降低准确性,即使在匹配范例扰动的情况下也是如此。这种非模块化电路干扰与共享的内部路径一致,其中特征并集放大了激活变化。我们开发了一个可重复的因果管道来定位并测试Qwen3-VL-8B中的这些稀疏视觉思维电路。在具有七种任务类型和三种难度级别的受控合成基准测试中,线性探针确定了任务类型信息的中间解码器位置。我们在该层训练SAEs,通过显式规则构建任务选择性集,并在推理时间进行缩放和消融,同时量化准确性和漂移。我们的发现——通过自助子样本和置换控制验证,并在多个VLM家族和五个不同数据集上重复——澄清了SAE特征组合性的边界,并提供了一个严格的诊断框架,以实现更可靠的VLM控制。
Summary / 总结
The study investigates the modularity hypothesis in sparse autoencoders (SAEs) within vision-language models, finding that task-selective feature sets can modestly improve reasoning accuracy but combining them often leads to large unintended changes and decreased accuracy. The research develops a causal pipeline to identify and test these sparse visual thought circuits, validating the findings with bootstrapped subsamples and permutation controls across various vision-language models and datasets.
研究探讨了视觉语言模型中稀疏自编码器(SAE)特征是否形成可用于推理的模块化、可组合单元。研究发现,干预任务选择性特征集可以适度提高推理准确性,但结合这些特征集往往会导致显著的意外变化和准确性下降。研究开发了一种因果管道来识别和测试这些稀疏视觉思维电路,并通过分层抽样和排列控制验证了这些发现,覆盖了多种视觉语言模型和五个不同数据集,为更可靠的视觉语言模型控制提供了框架。
GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
Authors: Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, Shanghang Zhang, Jian Pu
First: 2026-03-26T06:21:41+00:00 · Latest: 2026-03-26T06:21:41+00:00
Comments: 11 pages, 3 figures
Abstract
Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.
中文标题/摘要
标题:GIFT:全球不可替代性框架目标定位以实现高效的视频理解
视频大型语言模型(VLMs)在视频理解方面取得了显著的成功,但由于处理密集帧的巨大计算成本严重限制了其实际应用。现有方法通过选择关键帧来缓解这一问题,但其贪婪的决策过程以及相关性和多样性评估的脱钩,往往陷入局部最优,导致错误地选择了无关的噪声帧。为了解决这些挑战,我们提出了GIFT:全局不可替代性框架目标定位,这是一种无需训练的新颖框架,通过评估帧的内在不可替代性来选择帧。具体而言,我们首先引入定向多样性来量化在相关性条件下的帧的独特性,这使我们能够制定统一的不可替代性评分。随后,我们的预算感知精炼策略采用自适应迭代过程,首先确保具有最高不可替代性的核心帧集,然后随着预算的扩大,优先构建这些选择周围的至关重要的时间上下文。广泛的实验表明,与均匀采样相比,GIFT在LLaVA-Video-7B的长格式视频基准上实现了最高12.5%的平均改进。
Summary / 总结
The research aims to improve the efficiency of video understanding by addressing the high computational cost of processing dense frames in Video Large Language Models (VLMs). GIFT, a training-free framework, selects frames based on their intrinsic irreplaceability, using Directed Diversity to quantify uniqueness and a Budget-Aware Refinement strategy to iteratively secure and expand a core set of frames. Experiments show that GIFT outperforms uniform sampling by up to 12.5% on long-form video benchmarks with LLaVA-Video-7B.
研究旨在通过解决视频大型语言模型(VLMs)处理密集帧时的高计算成本问题,提高视频理解的效率。GIFT 是一个无需训练的框架,基于帧的内在不可替代性进行选择,使用定向多样性来量化独特性,并采用预算感知细化策略逐步确保核心帧并构建时间上下文。实验表明,GIFT 在长视频基准上的性能比均匀采样提高了最多 12.5%。
Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling
Authors: Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao
First: 2026-03-09T07:57:29+00:00 · Latest: 2026-03-26T05:49:31+00:00
Abstract
The primary objective of cross-view UAV geolocalization is to identify the exact spatial coordinates of drone-captured imagery by aligning it with extensive, geo-referenced satellite databases. Current approaches typically extract features independently from each perspective and rely on basic heuristics to compute similarity, thereby failing to explicitly capture the essential interactions between different views. To address this limitation, we introduce a novel, plug-and-play ranking architecture designed to explicitly perform joint relational modeling for improved UAV-to-satellite image matching. By harnessing the capabilities of a Large Vision-Language Model (LVLM), our framework effectively learns the deep visual-semantic correlations linking UAV and satellite imagery. Furthermore, we present a novel relational-aware loss function to optimize the training phase. By employing soft labels, this loss provides fine-grained supervision that avoids overly penalizing near-positive matches, ultimately boosting both the model's discriminative power and training stability. Comprehensive evaluations across various baseline architectures and standard benchmarks reveal that the proposed method substantially boosts the retrieval accuracy of existing models, yielding superior performance even under highly demanding conditions.
中文标题/摘要
标题:通过LVLM驱动的关系建模增强跨视角无人机地理定位
跨视角无人机地理定位的主要目标是通过将无人机捕获的图像与广泛的地理参考卫星数据库对齐,来确定其精确的空间坐标。当前的方法通常独立从每个视角提取特征,并依赖基本的启发式方法来计算相似性,从而未能明确捕捉不同视角之间的关键交互。为了解决这一局限性,我们提出了一种新颖的即插即用排名架构,旨在显式地进行联合关系建模,以提高无人机与卫星图像的匹配效果。通过利用大型视觉-语言模型(LVLM)的能力,我们的框架有效地学习了无人机和卫星图像之间的深层视觉-语义关联。此外,我们还提出了一种新的关系感知损失函数来优化训练阶段。通过使用软标签,该损失提供了细粒度的监督,避免了对近似正匹配过度惩罚,最终提高了模型的辨别能力和训练稳定性。在各种基线架构和标准基准上的全面评估表明,所提出的方法显著提高了现有模型的检索准确性,在苛刻条件下也表现出更优的性能。
Summary / 总结
The research aims to enhance cross-view UAV geolocalization by addressing the limitations of current approaches that fail to capture interactions between different views. The method introduces a ranking architecture that uses a Large Vision-Language Model (LVLM) for joint relational modeling, along with a novel relational-aware loss function. The experimental results show significant improvements in retrieval accuracy compared to baseline models, even under challenging conditions.
研究旨在通过解决当前方法无法捕捉不同视图之间交互的问题来提升无人机跨视角地理定位。方法引入了一个使用大型视觉-语言模型(LVLM)进行联合关系建模的排名架构,并提出了一种新型的关系感知损失函数来优化训练。实验表明,这种方法在各种基线模型和标准基准上显著提高了检索准确性,即使在苛刻条件下也是如此。
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Authors: Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
First: 2025-10-30T08:21:50+00:00 · Latest: 2026-03-26T05:39:15+00:00
Comments: 12 pages
Abstract
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
中文标题/摘要
标题:时间流动的方向如何?基于心理物理学的视觉-语言模型评估
现代视觉-语言模型(VLMs)在许多多模态任务中表现出色,但在视频中的时间信息理解方面仍然较弱且未得到充分评估。我们通过一个看似简单但揭示性强的挑战——判断时间箭头(AoT)——来探索这一差距:判断短片段是正向播放还是反向播放。我们引入了AoT-PsyPhyBENCH,这是一个通过与人类相同的刺激和行为基线进行心理物理学验证的基准测试,测试VLMs是否能够从自然视频中推断出时间方向。我们对开放权重和专有、推理和非推理VLMs的全面评估显示,大多数模型的表现接近随机猜测,而最好的模型在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动操作(如分割/加法)上的表现也远远落后于人类的准确度,这些过程人类几乎可以瞬间识别。这些结果突显了当前多模态系统中的一个根本性差距:虽然它们捕捉了丰富的视觉-语义关联,但缺乏用于时间连续性和因果理解的归纳偏置。我们发布了AoT-PsyPhyBENCH的代码和数据,以鼓励进一步提高VLMs在物理和时间推理能力方面的发展。
Summary / 总结
The research evaluates the ability of vision-language models (VLMs) to understand temporal information in videos by introducing AoT-PsyPhyBENCH, a benchmark that tests models' ability to infer the direction of time in natural videos. The evaluation shows that most VLMs perform near chance, with even the best model lagging significantly behind human accuracy in recognizing physically irreversible processes and causal manual actions. This highlights a fundamental gap in current VLMs' temporal reasoning capabilities despite their strong performance in visual-semantic correlation tasks.
研究通过引入AoT-PsyPhyBENCH基准测试,评估了视觉-语言模型(VLMs)在理解视频中时间方向信息方面的能力。评估结果显示,大多数VLMs的表现接近随机猜测,特别是在物理不可逆过程和因果手动动作方面,表明它们在时间推理能力方面存在显著差距。结果表明,VLMs需要更好的时间连续性和因果理解的归纳偏置。研究发布了AoT-PsyPhyBENCH的代码和数据,以促进该领域的进一步研究。
Mechanistically Interpreting Compression in Vision-Language Models
Authors: Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, Hreetam Paul
First: 2026-03-26T05:10:32+00:00 · Latest: 2026-03-26T05:10:32+00:00
Comments: 15 pages, 7 figures, 12 tables
Abstract
Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal circuit analysis and crosscoder-based feature comparisons to examine how pruning and quantization fundamentally change the internals across representative VLMs. We observe that pruning generally keeps circuit structure intact but rotates and attenuates internal features, while quantization modifies the circuits at a higher level yet leaves the surviving features better aligned. Leveraging this insight, we also introduce VLMSafe-420, a novel benchmark that pairs harmful inputs with matched benign counterfactuals across various safety categories. Our findings show that pruning causes a sharp drop in genuine refusal behavior, suggesting that the choice of compression has safety implications.
中文标题/摘要
标题:从机制上解读视觉-语言模型的压缩
压缩视觉-语言模型(VLMs)广泛用于降低内存和计算成本,使其成为现实部署的理想选择。然而,压缩这些模型会引发对其内部计算和安全行为是否得以保留的担忧。在本研究中,我们使用因果电路分析和跨编译器特征比较来探讨剪枝和量化如何从根本上改变代表性VLMs的内部结构。我们观察到,剪枝通常保持电路结构不变,但会旋转和减弱内部特征,而量化则在更高层次上修改电路,但保留的特征更对齐。利用这一见解,我们还引入了VLMSafe-420这一新型基准,该基准将有害输入与各种安全类别中的匹配良性反事实配对。我们的研究发现剪枝会导致真实拒绝行为的急剧下降,表明压缩选择具有安全影响。
Summary / 总结
This study investigates the impact of compression techniques on vision-language models (VLMs) to ensure their safety and functionality. By employing causal circuit analysis and crosscoder-based feature comparisons, the researchers found that pruning maintains the circuit structure but alters internal features, while quantization changes the circuits more significantly but preserves the aligned features. The study introduces VLMSafe-420, a benchmark that evaluates safety by pairing harmful inputs with benign counterfactuals, revealing that pruning reduces genuine refusal behavior, indicating safety concerns with compression choices.
这项研究通过因果电路分析和交叉编译器特征比较,探讨了压缩技术如何影响视觉-语言模型(VLMs)。研究发现,剪枝保持了电路结构但改变了内部特征,而量化则在更高层次上改变了电路但保留了更好的对齐特征。研究还引入了VLMSafe-420基准,评估压缩下的安全行为,结果显示剪枝减少了真实的拒绝行为,表明压缩选择的安全影响。
CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering
Authors: Xu Liu
First: 2026-03-26T04:43:28+00:00 · Latest: 2026-03-26T04:43:28+00:00
Abstract
Medical image restoration is essential for improving the usability of noisy, incomplete, and artifact-corrupted clinical scans, yet existing methods often rely on task-specific retraining and offer limited control over the trade-off between faithful reconstruction and prior-driven enhancement. This lack of controllability is especially problematic in clinical settings, where overly aggressive restoration may introduce hallucinated details or alter diagnostically important structures. In this work, we propose CARE, a training-free controllable restoration framework for real-world medical images that explicitly balances structure preservation and prior-guided refinement during inference. CARE uses a dual-latent restoration strategy, in which one branch enforces data fidelity and anatomical consistency while the other leverages a generative prior to recover missing or degraded information. A risk-aware adaptive controller dynamically adjusts the contribution of each branch based on restoration uncertainty and local structural reliability, enabling conservative or enhancement-focused restoration modes without additional model training. We evaluate CARE on noisy and incomplete medical imaging scenarios and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions. The proposed approach offers a practical step toward safer, more controllable, and more deployment-ready medical image restoration.
中文标题/摘要
标题:CARE: 无需训练的可控医学图像恢复框架通过双潜空间引导
医学图像恢复对于提高噪声、不完整和伪影污染的临床扫描的可用性至关重要,但现有方法通常依赖于特定任务的重新训练,并且在保真重建和先验驱动增强之间的权衡控制有限。在临床环境中,这种缺乏可控性尤其成问题,因为过于激进的恢复可能会引入虚构的细节或改变诊断上重要的结构。在本文中,我们提出了一种无需训练的可控恢复框架CARE,该框架在推理过程中明确平衡结构保存和先验引导的细化。CARE 使用一种双潜空间恢复策略,其中一个分支强制数据保真度和解剖一致性,而另一个分支利用生成先验来恢复缺失或退化的信息。一种基于风险的自适应控制器根据恢复不确定性及局部结构可靠性动态调整每个分支的贡献,从而在无需额外模型训练的情况下实现保守或增强导向的恢复模式。我们在噪声和不完整的医学成像场景中评估了CARE,并展示了其在保持临床相关结构的同时,实现了高质量的恢复效果,降低了不合理的重建风险。所提出的方法为更安全、更可控和更易于部署的医学图像恢复提供了一种实用的步骤。
Summary / 总结
CARE is a training-free controllable restoration framework for medical images that balances structure preservation and prior-guided refinement. It uses a dual-latent strategy with one branch ensuring data fidelity and anatomical consistency, and another leveraging a generative prior to recover missing information. A risk-aware controller dynamically adjusts the restoration process, allowing for conservative or enhancement-focused modes. CARE demonstrates strong restoration quality while preserving clinically relevant structures and reducing the risk of implausible reconstructions.
CARE 是一个无需训练的医学图像恢复框架,能够平衡结构保真和先验驱动的细化。它采用双潜空间策略,一个分支确保数据保真和解剖一致性,另一个分支利用生成先验恢复缺失或降级的信息。基于恢复不确定性,一个风险感知自适应控制器动态调整每个分支的贡献,允许保守或增强导向的恢复模式。CARE 在恢复质量和保留临床相关结构方面表现出色,优于现有方法。
Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs
Authors: Yike Wu, Necva Bolucu, Stephen Wan, Dadong Wang, Jiahao Xia, Jian Zhang
Venue: MM
First: 2026-03-26T04:05:30+00:00 · Latest: 2026-03-26T04:05:30+00:00
Comments: Accepted by T-MM
Abstract
Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.
中文标题/摘要
标题:基于查询驱动场景图的可解释零样本引用表达理解
零样本引用表达理解(REC)旨在在不依赖特定任务训练数据的情况下,根据自然语言查询在图像中定位目标对象,要求具备强大的视觉理解能力。现有的视觉-语言模型(VLMs),如CLIP,通常通过直接测量文本查询和图像区域之间的特征相似性来解决零样本REC问题。然而,这些方法难以捕捉细微的视觉细节并理解复杂的对象关系。与此同时,大型语言模型(LLMs)在高层次语义推理方面表现出色,但它们无法直接将视觉特征抽象为文本语义,限制了它们在REC任务中的应用。为克服这些限制,我们提出了一种基于查询驱动场景图的可解释零样本REC方法——SGREC。具体而言,我们首先使用VLM构建一个查询驱动的场景图,明确编码与给定查询相关的空间关系、描述性说明和对象交互。通过利用这个场景图,我们弥合了低级图像区域与LLMs所需的高度语义理解之间的差距。最后,LLM从场景图提供的结构化文本表示中推断出目标对象,并以详细的解释回应,确保推理过程中的可解释性。大量实验表明,SGREC在包括RefCOCO val(66.78%)、RefCOCO+ testB(53.43%)和RefCOCOg val(73.28%)在内的大多数零样本REC基准测试中实现了最高的准确率,突显了其强大的视觉场景理解能力。
Summary / 总结
The paper addresses zero-shot referring expression comprehension (REC) by proposing SGREC, which uses query-driven scene graphs to bridge the gap between low-level image features and high-level semantic understanding. It constructs scene graphs using a Vision-Language Model (VLM) to encode spatial relationships and object interactions, then uses a Large Language Model (LLM) to infer the target object with detailed explanations. SGREC achieves top-1 accuracy on several REC benchmarks, demonstrating strong visual scene understanding capabilities.
论文提出SGREC方法,通过使用查询驱动的场景图来弥合低级图像特征与高级语义理解之间的差距。首先使用视觉-语言模型(VLM)构建场景图,编码空间关系和对象交互,然后使用大型语言模型(LLM)从这种结构化的文本表示中推断目标对象。实验表明,SGREC在各种零样本引用表达理解基准测试中表现出色,获得较高的top-1准确率。