arXiv 论文速递

2026-04-28 04:29
Snapshot: 20260428_0429
FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing
Authors: Ze Chen, Lan Chen, Yuanhang Li, Qi Mao
First: 2026-04-24T14:17:11+00:00 · Latest: 2026-04-24T14:17:11+00:00
Comments: Under review
Abstract
We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at https://cuc-mipg.github.io/FlowAnchor.github.io/.
Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
Authors: Gautam Kumar Jain, Carsten Markgraf, Julian Stähler
First: 2026-04-24T13:54:48+00:00 · Latest: 2026-04-24T13:54:48+00:00
Comments: 16 pages, 8 figures, 8 tables, preprint
Abstract
Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.
ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
Authors: Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang
First: 2026-04-24T13:36:41+00:00 · Latest: 2026-04-24T13:36:41+00:00
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
Summary / 总结
The paper addresses the challenge of open-vocabulary scene graph generation by proposing ReLIC-SGG, which treats unannotated relations as latent variables rather than negatives. It builds a semantic relation lattice to model relations and uses this lattice to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. The method improves rare and unseen predicate recognition and better recovers missing relations compared to existing methods.
论文提出ReLIC-SGG框架,将未标注的关系视为潜在变量而非否定项。该框架构建了一个语义关系格来建模关系,并利用此格从视觉-语言兼容性、图上下文和语义一致性中推断缺失的正关系。该方法在稀有和未见过的谓词识别以及恢复缺失关系方面优于现有方法。
Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples
Authors: Oussama Bouanani, Jim Berend, Wojciech Samek, Sebastian Lapuschkin, Maximilian Dreyer
First: 2026-04-24T11:55:50+00:00 · Latest: 2026-04-24T11:55:50+00:00
Abstract
Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples -- inputs that are semantically similar to activating examples but elicit low activations -- to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.
Summary / 总结
The paper addresses the issue of broad or misleading labels in neuron labeling by proposing a method that uses contrastive examples. It involves two stages: generating candidate labels with vision language models and assigning labels with CLIP-like encoders that incorporate contrastive examples. The method yields more specific and faithful labels compared to existing approaches, as demonstrated in extensive experiments and a case study on melanoma detection.
论文通过使用对比样本来解决神经元标签过于宽泛或误导的问题,提出了一种方法,包括两个阶段:使用视觉语言模型生成候选标签和使用包含对比样例的CLIP类似编码器分配标签。该方法在广泛的实验和黑色素瘤检测案例研究中,相比现有方法提供了更具体和忠实的标签。
Towards Adaptive Continual Model Merging via Manifold-Aware Expert Evolution
Authors: Haiyun Qiu, Xingyu Wu, Kay Chen Tan
First: 2026-04-24T11:35:53+00:00 · Latest: 2026-04-24T11:35:53+00:00
Abstract
Continual Model Merging (CMM) sequentially integrates task-specific models into a unified architecture without intensive retraining. However, existing CMM methods are hindered by a fundamental saturation-redundancy dilemma: backbone-centric approaches face parameter saturation and representation interference within fixed capacities, whereas Mixture-of-Experts (MoE) variants resort to indiscriminate expansion, incurring expert redundancy and a routing bottleneck reliant on additional data-driven optimization. To resolve these challenges, we propose MADE-IT (Manifold-Aware Dynamic Expert Evolution and Implicit rouTing), an adaptive CMM method that orchestrates expert management and activation by grounding intrinsic expert representations in manifold geometry. We introduce a projection-based subspace affinity metric coupled with a distribution-aware adaptive threshold mechanism to guide autonomous expert evolution, harmonizing diversity with architectural parsimony. Furthermore, to bypass parameterized gating networks, we design a data-free and training-free implicit routing mechanism that activates experts via feature-subspace alignment. Extensive experiments demonstrate that MADE-IT consistently outperforms strong baselines in accuracy and robustness across long-horizon and shuffled task sequences, while significantly pruning redundant experts, particularly within generic modules and early layers.
Summary / 总结
The paper addresses the challenges in Continual Model Merging (CMM) by proposing MADE-IT, an adaptive method that manages and activates experts based on manifold geometry. It uses a projection-based metric to guide expert evolution and an implicit routing mechanism for feature-subspace alignment, avoiding the need for additional data-driven optimization. Experimental results show that MADE-IT outperforms strong baselines in accuracy and robustness, and prunes redundant experts effectively, especially in generic modules and early layers.
论文提出了一种基于流形几何管理专家的持续模型合并方法MADE-IT,以避免参数饱和和专家冗余。它引入了基于投影的亲和度度量和自适应阈值机制来实现自主专家进化,并设计了基于特征子空间对齐的隐式路由机制。实验结果表明,MADE-IT在准确性和鲁棒性方面优于强基线,并有效修剪了冗余专家。
Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models
Authors: Qihang Ai, Ruizhou Li, Menghui Wang, Haiyun Jiang
First: 2025-03-27T12:20:37+00:00 · Latest: 2026-04-24T09:18:36+00:00
Comments: 26 pages, 23 figures
Abstract
Recent advances in Vision-Language Models (VLMs) have shown promising capabilities in interpreting visualized graph data, offering a new perspective for graph-structured reasoning beyond traditional Graph Neural Networks (GNNs). However, existing studies focus primarily on single-graph reasoning, leaving the critical challenge of multi-graph joint reasoning underexplored. In this work, we introduce the first comprehensive benchmark designed to evaluate and enhance the multi-graph reasoning abilities of VLMs. Our benchmark covers four common graph types-knowledge graphs, flowcharts, mind maps, and route maps-and supports both homogeneous and heterogeneous graph groupings with tasks of increasing complexity. We evaluate several state-of-the-art VLMs under a multi-dimensional scoring framework that assesses graph parsing, reasoning consistency, and instruction-following accuracy. Additionally, we fine-tune multiple open-source models and observe consistent improvements, confirming the effectiveness of our dataset. This work provides a principled step toward advancing multi-graph understanding and reveals new opportunities for cross-modal graph intelligence.
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
Authors: Xu Zhang, Danyang Li, Yingjie Xia, Xiaohang Dong, Hualong Yu, Jianye Wang, Qicheng Li
First: 2026-01-20T12:25:41+00:00 · Latest: 2026-04-24T08:12:11+00:00
Abstract
Change Detection (CD) is a fundamental task in remote sensing. It monitors the evolution of land cover over time. Based on this, Open-Vocabulary Change Detection (OVCD) introduces a new requirement. It aims to reduce the reliance on predefined categories. Existing training-free OVCD methods mostly use CLIP to identify categories. These methods also need extra models like DINO to extract features. However, combining different models often causes problems in matching features and makes the system unstable. Recently, the Segment Anything Model 3 (SAM 3) is introduced. It integrates segmentation and identification capabilities within one promptable model, which offers new possibilities for the OVCD task. In this paper, we propose OmniOVCD, a standalone framework designed for OVCD. By leveraging the decoupled output heads of SAM 3, we propose a Synergistic Fusion to Instance Decoupling (SFID) strategy. SFID first fuses the semantic, instance, and presence outputs of SAM 3 to construct land-cover masks, and then decomposes them into individual instance masks for change comparison. This design preserves high accuracy in category recognition and maintains instance-level consistency across images. As a result, the model can generate accurate change masks. Experiments on four public benchmarks (LEVIR-CD, WHU-CD, S2Looking, and SECOND) demonstrate SOTA performance, achieving IoU scores of 67.2, 66.5, 24.5, and 27.1 (class-average), respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/OmniOVCD.
Summary / 总结
OmniOVCD is a framework designed for open-vocabulary change detection (OVCD) in remote sensing. It leverages the Segment Anything Model 3 (SAM 3) to integrate segmentation and identification capabilities. The key method is a Synergistic Fusion to Instance Decoupling (SFID) strategy, which fuses semantic, instance, and presence outputs to construct land-cover masks and then decomposes them for change comparison. Experiments on four benchmarks show that OmniOVCD outperforms previous methods, achieving state-of-the-art IoU scores of 67.2, 66.5, 24.5, and 27.1 (class-average).
OmniOVCD 是一个用于开放词汇变化检测 (OVCD) 的框架,利用 Segment Anything Model 3 (SAM 3) 结合了分割和识别能力。关键方法是协同融合到实例解耦 (SFID) 策略,将语义、实例和存在输出融合以构建土地覆盖掩码,然后分解以进行变化比较。在四个基准上的实验表明,OmniOVCD 超过了先前的方法,分别实现了 67.2、66.5、24.5 和 27.1(类别平均)的最高 IoU 分数。
PreMoE: Proactive Inference for Efficient Mixture-of-Experts
Authors: Zehua Pei, Ying Zhang, Hui-Ling Zhen, Tao Yuan, Xianzhi Yu, Zhenhua Dong, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
First: 2025-05-23T08:59:16+00:00 · Latest: 2026-04-24T08:03:55+00:00
Abstract
Mixture-of-Experts (MoE) models offer dynamic computation, but are typically deployed as static full-capacity models, missing opportunities for deployment-specific specialization. We introduce PreMoE, a training-free framework that proactively compiles sparse MoE variants for targeted deployment scenarios. At its core is Predicted Expert Utility (PEU), a robust metric for estimating expert importance from router logits through high-confidence threshold filtering and logit transformation, which together stabilize utility estimation under aggressive sparsity. Using PEU scores computed on a small calibration set, PreMoE produces domain-aware expert rankings that can be used to compile either domain-specific specialists or high-efficiency multi-domain generalists, without any retraining. Across MoE models ranging from 30B to 718B parameters, PreMoE achieves up to 50\% sparsity with nearly no performance loss. It further exposes a practical deployment trade-off: specialists maximize in-domain efficiency, while synthesized generalists retain broader cross-domain capability at the same sparsity budget.
中文标题/摘要
标题:PreMoE:主动推理以提高混合专家模型的效率
混合专家(MoE)模型提供了动态计算,但通常作为静态全容量模型部署,错过了针对特定部署场景的专业化机会。我们提出了PreMoE,这是一种无需训练的框架,可以主动编译针对特定部署场景的稀疏MoE变体。其核心是预测专家效用(PEU),这是一种通过高置信度阈值过滤和logit转换来估计专家重要性的稳健指标,两者共同在激进稀疏化下稳定效用估计。使用在小校准集上计算的PEU分数,PreMoE生成领域感知的专家排名,可以用于编译领域特定专家或高效率多领域通用专家,而无需任何重新训练。在从300亿到7180亿参数的MoE模型中,PreMoE在几乎无性能损失的情况下实现了高达50%的稀疏化。此外,它还揭示了一种实际部署权衡:专家在领域内效率最大化,而合成的通用专家在相同的稀疏预算下保留了更广泛的跨领域能力。
DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
Authors: Joonmyung Choi, Sanghyeok Lee, Jongha Kim, Sehyung Kim, Dohwan Ko, Jihyung Kil, Hyunwoo J. Kim
Venue: CVPR 2026
First: 2026-04-24T06:51:58+00:00 · Latest: 2026-04-24T06:51:58+00:00
Comments: CVPR 2026
Abstract
Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model's level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.
Summary / 总结
DocPrune is a token pruning framework designed to improve efficient document document document document document document document question answering tasks leveraging background visual cues from tables. and backgrounds. Unlike existing, it methods, it for natural images and videos, it It on DocPrune preserves essential tokens and automatically selects appropriate layers for token pruning, improving throughput by on on33x in the encoder and decoder, while boosting accuracy and efficiency on on on11.
DocPrune 是一种无需训练且渐进的文档标记剪枝框架,选择性地移除背景和与问题无关的标记,保留关键标记。它在编码器和解码器中分别提高了3.0倍和3.3倍的吞吐量,并将F1分数提高了+1.0,从而在无需额外训练的情况下同时提高准确性和效率。
CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
Authors: Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen
First: 2026-04-24T06:34:45+00:00 · Latest: 2026-04-24T06:34:45+00:00
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
Authors: Xiangxi Zheng, Kuang He, Jiayi Hu, Ping Yu, Rui Yan, Yuan Yao, Peng Hou, Anxiang Zeng, Alex Jinpeng Wang
First: 2026-04-24T03:39:51+00:00 · Latest: 2026-04-24T03:39:51+00:00
Abstract
Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.
中文标题/摘要
标题:CharTide: 通过三视角调优和问题驱动演化实现数据为中心的图表到代码生成
图表到代码生成要求视觉精度和语法正确性严格一致。然而,现有方法在数据为中心的限制下根本受限:尽管存在不断增长的图表到代码数据集,简单地扩大同质图表代码对会混淆视觉感知与程序逻辑,阻止模型充分利用多模态监督的丰富性。我们提出了CharTide,一种新颖的数据为中心框架,系统地重新设计了图表到代码生成的训练和对齐数据。首先,我们通过三视角调优策略构建了一个200万样本的数据集,明确将训练拆分为视觉感知、纯文本代码逻辑和模态融合流,使7B模型仅使用监督数据就能超越专门的基础模型。其次,我们将对齐重新表述为数据验证问题,而非启发式评分任务。为此,我们引入了一种基于信息不变性原则的问题驱动RL框架:下游模型应对相同视觉查询在原始图表和生成图表中给出一致的答案。超越僵硬的规则匹配或VLM评分,我们采用一个冻结的检查员通过原子问答任务客观验证生成的图表,基于答案准确性提供可验证的奖励信号。在ChartMimic、Plot2Code和ChartX上的实验表明,CharTide-7B/8B显著优于开源基础模型,超越GPT-4o,并与GPT-5竞争。
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Authors: Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang
Venue: ICLR 2026
First: 2025-10-09T00:00:49+00:00 · Latest: 2026-04-24T03:12:09+00:00
Comments: To appear at ICLR 2026; extended results to generative multimodal models
Abstract
Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To correct this artifact, we introduce a group matching score that more faithfully evaluates model capability. Moreover, correctness under the new metric can be translated into correctness under existing metrics via a simple overfitting step. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. TTM also extends beyond contrastive vision-language models, yielding clear gains on a generative multimodal model across benchmarks. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.
Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
Authors: Weiqiu You, Cassandra Goldberg, Amin Madani, Daniel A. Hashimoto, Eric Wong
First: 2026-04-24T02:07:23+00:00 · Latest: 2026-04-24T02:07:23+00:00
Comments: IPCAI 2026 short communication
Abstract
Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and mortality. While large vision-language models (LVLMs) offer flexible reasoning, their predictions remain difficult to audit and unreliable on safety-critical surgical tasks. Methods: We introduce Sum-of-Checks, a framework that decomposes each CVS criterion into expert-defined reasoning checks reflecting clinically relevant visual evidence. Given a laparoscopic frame, an LVLM evaluates each check, producing a binary judgment and justification. Criterion-level scores are computed via fixed, weighted aggregation of check outcomes. We evaluate on the Endoscapes2023 benchmark using three frontier LVLMs, comparing against direct prompting, chain-of-thought, and sub-question decomposition, each with and without few-shot examples. Results: Sum-of-Checks improves average frame-level mean average precision by 12--14% relative to the best baseline across all three models and criteria. Analysis of individual checks reveals that LVLMs are reliable on observational checks (e.g., visibility, tool obstruction) but show substantial variability on decision-critical anatomical evidence. Conclusion: Structuring surgical reasoning into expert-aligned verification checks improves both accuracy and transparency of LVLM-based CVS assessment, demonstrating that explicitly separating evidence elicitation from decision-making is critical for reliable and auditable surgical AI systems. Code is available at https://github.com/BrachioLab/SumOfChecks.
Summary / 总结
The purpose of the study is to improve the accuracy and transparency of assessing the Critical View of Safety (CVS) during laparoscopic cholecystectomy using large vision-language models (LVLMs). The authors introduce Sum-of-Checks, a framework that breaks down each CVS criterion into expert-defined reasoning checks. The LVLM evaluates each check, producing a binary judgment and justification, which are then aggregated to compute criterion-level scores. The results show that Sum-of-Checks improves average frame-level mean average precision by 12-14% compared to the best baseline across three frontier LVLMs. The study also highlights that LVLMs are more reliable on observational checks but show variability on decision-critical anatomical evidence.
该研究旨在通过大型视觉-语言模型(LVLM)提高腹腔镜胆囊切除术中关键视野安全性(CVS)评估的准确性和透明度。作者引入了Sum-of-Checks框架,将每个CVS标准分解为专家定义的推理检查。LVLM评估每个检查,产生二元判断和解释。结果表明,Sum-of-Checks在三个LVLM中将平均帧级平均精度提高了12-14%,优于最佳基线。研究还指出,LVLM在观察性检查上更可靠,但在决策性解剖证据上表现出较大的变异性。
FLARE-BO: Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation for Low-Light Robotic Vision
Authors: Nathan Shankar, Pawel Ladosz, Hujun Yin
First: 2026-04-23T21:51:22+00:00 · Latest: 2026-04-23T21:51:22+00:00
Comments: 7 pages, 2 tables and 4 figures
Abstract
Reliable visual perception under low illumination remains a core challenge for autonomous robotic systems, where degraded image quality directly compromises navigation, inspection, and various operations. A recent training free approach showed that Bayesian optimisation with Gaussian Processes can adaptively select brightness, contrast, and denoising parameters on a per-image basis, achieving competitive enhancement without any learned model. However, that framework is limited to three parameters, applies no illumination decomposition or white balance correction, and relies on Non-Local Means denoising, which tends to over smooth edges under noisy conditions. This paper proposes FLARE-BO (Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation), an extended framework that jointly optimises eight parameters spanning across gamma correction, LIME-style illumination normalisation, chrominance denoising, bilateral filtering, NLM denoising, Grey-World automatic white balance, and adaptive post smoothing. The search engine employs a unit hypercube parameter normalisation, objective standardisation, Sobol quasi-random initialisation, and Log Expected Improvement acquisition for principled exploration of the expanded space. Performance of the proposed method is benchmarked using the Low Light paired dataset (LOL) and results show marked improvements of the proposed method over existing methods that were not specifically trained using this dataset.
Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Authors: Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim
Venue: ICLR 2026
First: 2025-10-22T13:42:59+00:00 · Latest: 2026-04-23T17:59:56+00:00
Comments: Accepted to ICLR 2026. Code is available at https://github.com/HYUNJS/DecAF
Abstract
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks.
中文标题/摘要
标题:MLLMs中分解注意力融合在训练-free 视频推理分割中的应用
多模态大型语言模型(MLLMs)通过关注与文本查询相关的视觉标记来展示强大的视频理解能力。为了以训练-free 的方式直接适应这一能力进行定位,我们将视频推理分割重新定义为视频问答任务,并通过展开机制提取注意力图。然而,原始的注意力图是嘈杂的,并且与对象区域对齐不良。我们提出了分解注意力融合(DecAF),通过两种机制对这些图进行细化:(1)对比对象-背景融合和(2)互补视频帧融合。该方法抑制了无关激活,并增强了对象聚焦的线索,使注意力图可以直接转换为粗略的分割掩码。此外,我们引入了注意力引导的SAM2提示,以获取细粒度掩码。与现有方法联合训练MLLMs和SAM不同,我们的方法完全不需要重新训练。DecAF在训练-free 方法中表现出色,并在引用和推理VOS基准上达到了与训练基线方法相当的性能。
Summary / 总结
The research aims to leverage MLLMs for training-free video reasoning segmentation by refining noisy attention maps through Decomposed Attention Fusion (DecAF). DecAF uses contrastive object-background fusion and complementary video-frame fusion to suppress irrelevant activations and enhance object-focused cues, directly converting attention maps into coarse segmentation masks. The method outperforms existing training-free approaches and achieves performance comparable to training-based methods on referring and reasoning VOS benchmarks.
研究旨在通过分解注意力融合(DecAF)方法,利用MLLMs实现无需训练的视频推理分割。DecAF 通过对比对象背景融合和互补视频帧融合来抑制无关激活并增强对象聚焦线索,直接将注意力图转换为粗略的分割掩码。该方法优于现有无需训练的方法,并在引用和推理VOS基准上达到了与训练基线方法相当的性能。
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
Authors: Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord
First: 2026-04-23T17:54:36+00:00 · Latest: 2026-04-23T17:54:36+00:00
Abstract
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Authors: Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou
First: 2025-06-26T17:59:12+00:00 · Latest: 2026-04-23T17:42:55+00:00
Comments: Project webpage: https://plan-lab.github.io/hallusegbench/
Abstract
Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
Summary / 总结
The research addresses the issue of pixel-grounding hallucinations in Segmentation Vision-Language Models (VLMs) by introducing Counterfactual Segmentation Reasoning (CSR) and a new benchmark called HalluSegBench. The method involves training models to segment the correct object in the factual image and abstain in the counterfactual image. Experimental results show that the proposed RobustSeg model reduces hallucinations by 30% and improves segmentation performance on FP-RefCOCO(+/g).
研究通过引入Counterfactual Segmentation Reasoning (CSR) 和新的基准HalluSegBench,解决了Segmentation Vision-Language Models (VLMs) 中的像素定位幻觉问题。该方法要求模型在事实图像中正确分割目标物体,在反事实图像中则不进行分割。实验结果表明,提出的RobustSeg模型将幻觉减少了30%,并在FP-RefCOCO(+/g)上提高了分割性能。
Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi
First: 2025-06-24T12:45:09+00:00 · Latest: 2026-04-23T17:05:26+00:00
Abstract
Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely or reward fluent but factually incorrect captions. These findings highlight the limitations of deploying foundation models for embodied agents and the need for more robust architectures and evaluation protocols in physical robotic applications.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
Authors: Katharina Prasse, Steffen Jung, Isaac Bravo, Stefanie Walter, Patrick Knab, Christian Bartelt, Margret Keuper
First: 2026-04-23T15:44:14+00:00 · Latest: 2026-04-23T15:44:14+00:00
Abstract
Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.
Summary / 总结
The study aims to evaluate how computer vision methods can be used for analyzing climate change discourse on social media. Six promptable vision-language models and 15 zero-shot CLIP-like models were benchmarked on two datasets from X (formerly Twitter), with Gemini-3.1-flash-lite outperforming all others. The research highlights that distributional evaluation is more reliable than per-image accuracy for discourse analysis at scale, and that specific prompt design improves performance. Chain-of-thought reasoning was found to reduce performance. The study provides a valuable resource for future research by releasing tweet IDs and labels along with the code.
研究旨在评估计算机视觉方法如何用于分析社交媒体上的气候变化话语。六个可提示的视觉语言模型和十五个零样本CLIP模型被用于两个来自X(以前的Twitter)的数据集,结果表明Gemini-3.1-flash-lite在所有超类别和两个数据集上表现最佳。研究指出,分布式评估比单个图像的准确性更可靠,适用于大规模话语分析。提示设计的特定性可以提高性能,而链式推理反而会降低性能。研究通过发布推特ID和标签以及代码提供了有价值的研究资源。
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
Authors: Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Morten Rieger Hannemose
First: 2026-02-21T10:53:50+00:00 · Latest: 2026-04-23T15:15:48+00:00
Comments: Accepted by CVPR2026
Abstract
Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, performs inference over 30x faster than the baseline and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.
中文标题/摘要
标题:MaskDiME:自适应掩码扩散以实现精确高效的视觉反事实解释
视觉反事实解释旨在揭示能够改变模型预测的最小语义修改,为深度神经网络提供因果和可解释的洞察。然而,现有的基于扩散的反事实生成方法通常计算成本高、采样速度慢且在局部修改区域定位方面不够精确。为了解决这些局限性,我们提出了一种名为MaskDiME的简单、快速且有效的扩散框架,通过局部采样统一语义一致性和空间精度。我们的方法适应性地关注决策相关区域,以实现局部和语义一致的反事实生成,同时保持高图像保真度。我们的无需训练框架MaskDiME在推理速度上比基线快30倍,并在五个涵盖不同视觉领域的基准数据集上实现了可比或最先进的性能,为高效的反事实解释提供了一种实用且可泛化的解决方案。
Summary / 总结
MaskDiME is designed to generate precise and efficient visual counterfactual explanations by addressing the computational inefficiency and imprecision of existing methods. It uses a localized sampling approach to focus on decision-relevant regions, ensuring both semantic consistency and spatial precision. Experiments show that MaskDiME is 30 times faster than the baseline while achieving comparable or state-of-the-art performance across five benchmark datasets.
MaskDiME旨在通过解决现有方法的计算效率低和精度差问题,生成精确且高效的视觉反事实解释。它采用局部采样方法,专注于决策相关区域,确保语义一致性和空间精度。实验表明,MaskDiME比基线快30倍,同时在五个涵盖不同视觉领域的基准数据集上达到了可比或最先进的性能。
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Authors: Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He
Venue: CVPR 2026
First: 2026-04-23T14:33:27+00:00 · Latest: 2026-04-23T14:33:27+00:00
Comments: Accepted by CVPR 2026 (Findings Track)
Abstract
Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .
Summary / 总结
Ramen is a framework for robust test-time adaptation of vision-language models, addressing the issue of performance degradation under mixed-domain settings. It uses active sample selection based on domain consistency and prediction balance to adapt models during inference. Ramen employs an embedding-gradient cache to store and reuse information, enhancing efficiency. Experiments show that Ramen performs strongly and consistently across various benchmarks, offering robust and efficient adaptation in complex mixed-domain scenarios.
Ramen 是一种针对混合域设置下性能下降问题的视觉-语言模型测试时自适应框架。它通过基于领域一致性和预测平衡的主动样本选择来适应模型。Ramen 使用嵌入-梯度缓存高效检索和更新相关样本,实现了多种基准测试中的强大且一致的性能。
Causal Disentanglement for Full-Reference Image Quality Assessment
Authors: Zhen Zhang, Jielei Chu, Tian Zhang, Weide Liu, Fengmao Lv, Tianrui Li, Jun Cheng, Yuming Fang
First: 2026-04-23T13:18:13+00:00 · Latest: 2026-04-23T13:18:13+00:00
Abstract
Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.
中文标题/摘要
标题:因果分离在全参考图像质量评估中的应用
现有的基于深度网络的全参考图像质量评估(FR-IQA)模型通常通过比较参考图像和失真图像的深度特征来进行成对比较。本文从不同角度出发,提出了一种基于因果推理和解耦表示学习的新型FR-IQA范式。与典型的基于特征比较的FR-IQA模型不同,我们的方法将退化估计公式化为由潜在表示干预引导的因果分离过程。首先,通过利用参考图像和失真图像之间的内容不变性,我们解耦退化表示和内容表示。其次,受到人类视觉掩蔽效应的启发,我们设计了一个掩蔽模块来建模图像内容和退化特征之间的因果关系,从而从失真图像中提取受内容影响的退化特征。最后,我们使用监督回归或无标签降维从这些退化特征预测质量评分。大量实验表明,我们的方法在标准图像质量评估基准上实现了高度竞争力的性能,涵盖全监督、少量标签和无标签设置。此外,我们还在包括水下、放射学、医学、中子和屏幕内容图像在内的多种非标准自然图像领域进行了评估,这些领域数据稀缺。得益于其能够在没有标注图像质量评估数据的情况下进行场景特定的训练和预测的能力,我们的方法在跨域泛化方面优于现有的无训练FR-IQA模型。
Summary / 总结
This paper proposes a novel FR-IQA paradigm based on causal inference and decoupled representation learning. It decouples degradation and content representations by exploiting content invariance and designs a masking module to model the causal relationship between image content and degradation features. The method predicts quality scores using supervised regression or label-free dimensionality reduction. Experiments show highly competitive performance on standard IQA benchmarks and superior cross-domain generalization for non-standard natural image domains.
本文提出了一种基于因果推理和解耦表示学习的FR-IQA新范式。通过利用内容不变性来解耦退化和内容表示,并设计了一个遮罩模块来建模图像内容和退化特征之间的因果关系。该方法使用监督回归或无标签降维来预测质量评分。实验表明,该方法在标准IQA基准上表现出色,并且在非标准自然图像域(如水下、放射学、医学、中子和屏幕内容图像)中具有优越的跨域泛化能力。
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Authors: Zhenyu Ning, Guangda Liu, Qihao Jin, Chengwei Li, Wenchao Ding, Minyi Guo, Jieru Zhao
Venue: 63rd ACM/IEEE Design Automation Conference (DAC '26), July 2026
First: 2025-05-21T08:47:15+00:00 · Latest: 2026-04-23T12:54:38+00:00
Comments: Accepted by DAC'26
Abstract
Recent developments in Video Large Language Models (Video LLMs) have enabled models to process hour-long videos and exhibit exceptional performance. Nonetheless, the Key-Value (KV) cache expands linearly over time, leading to substantial memory overhead and response delay--critical challenges in various real-world online applications, such as Deepseek services, autonomous driving and robotics. To mitigate these issues, we propose $\textbf{LiveVLM}$, a training-free and query-agnostic framework specifically designed for online video understanding and real-time interaction. LiveVLM employs a Vision Sink Bucketing (VSB) mechanism to process video streams in real time, retain long-term video details and eliminate redundant KVs. This mechanism utilizes vision-to-vision attention scores as the metric and seeks to maximize the coverage of contextual information during compression. Noting that KV cache compressed in a query-agnostic manner inevitably retains irrelevant information for specific queries, LiveVLM incorporates a Position-agnostic KV Retrieval (PaR) mechanism to reduce interference from redundant context. The keypoint of PaR lies in decoupling positional embeddings to enhance the similarity between key tensors, thereby supporting efficient retrieval at the granularity of pages. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to achieve state-of-the-art accuracy among both training-free query-agnostic methods and training-based online models.
Summary / 总结
LiveVLM is a training-free and query-agnostic framework designed to address memory overhead and response delay in online video understanding. It uses a Vision Sink Bucketing (VSB) mechanism to process video streams in real time and a Position-agnostic KV Retrieval (PaR) mechanism to reduce irrelevant information. Experiments show that LiveVLM enhances the accuracy of the LLaVA-OneVision model, achieving state-of-the-art performance among both training-free and training-based online models.
LiveVLM 是一个无需训练且查询无关的框架,旨在解决在线视频理解中的内存开销和响应延迟问题。它使用 Vision Sink Bucketing (VSB) 机制实时处理视频流,并使用 Position-agnostic KV Retrieval (PaR) 机制减少无关信息。实验表明,LiveVLM 提升了 LLaVA-OneVision 模型的准确性,实现了在训练无关方法和训练基方法中的最佳性能。
Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models
Authors: Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu
First: 2026-04-17T10:17:16+00:00 · Latest: 2026-04-23T12:41:25+00:00
Comments: 30 pages, 5 figures
Abstract
While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.
Summary / 总结
This paper addresses the challenge of verifying the correctness of answers generated by Diffusion Large Language Models (dLLMs) through a geometric perspective called Reasoning on the Manifold. It introduces Bidirectional Manifold Consistency (BMC), an unsupervised metric that evaluates the stability of generated sequences by comparing forward-masking and backward-reconstruction. The study shows that BMC can be used for diagnosis, inference, and alignment, demonstrating its effectiveness in ensuring the correctness of dLLMs across various reasoning tasks.
本文通过几何视角“Reasoning on the Manifold”解决了验证扩散大型语言模型(dLLMs)生成答案正确性的挑战。提出了双向流形一致性(Bidirectional Manifold Consistency, BMC),这是一种无监督的度量方法,通过前向掩码和后向重建比较来评估生成序列的稳定性。研究显示BMC可以在诊断、推理和对齐中使用,证明了其在各种推理任务中确保dLLMs正确性的有效性。
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Authors: Hao-Yuan Chen
First: 2026-04-23T12:36:12+00:00 · Latest: 2026-04-23T12:36:12+00:00
Abstract
Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.
Summary / 总结
The paper introduces Verbal Process Supervision (VPS), a framework that uses structured natural-language critique to guide an iterative generate-critique-refine loop in large language models. Across various benchmarks, VPS significantly improves model performance. On GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reached 94.9% at R=4, surpassing the state of the art. On AIME 2025, VPS boosted scores from 11.7-26.7% to 63.3-90.0%, and at matched compute, VPS outperformed Reflexion and Self-Consistency@5 by 8.5 to 12.1 points and 5.0 to 8.3 points, respectively. Performance scales with the supervisor-actor capability gap and degrades when errors are not linguistically expressible, suggesting the need for hybrid verbal-executable methods.
论文引入了Verbal Process Supervision (VPS)框架,该框架通过结构化的自然语言批评来引导生成-批评-修正的迭代循环。在各种基准测试中,VPS显著提高了模型性能。在GPQA Diamond上,GPT-5.4 (High) | GPT-5.4 (Low)在R=4时达到了94.9%,超过了当前最佳水平。在AIME 2025上,VPS将得分从11.7-26.7%提升到63.3-90.0%。在匹配计算资源的情况下,VPS分别比Reflexion和Self-Consistency@5高出8.5到12.1分和5.0到8.3分。性能随着监督者-执行者能力差距的增大而提升,并在错误无法用语言表达时下降,这表明需要结合语言执行的方法。
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Authors: Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Bo Zhang, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, Kinhei Lee, Z henxuan Zhang, Xiaobing Li, Maosong Sun
Venue: ACL 2026
First: 2025-11-24T06:40:38+00:00 · Latest: 2026-04-23T11:52:26+00:00
Comments: Accepted to ACL 2026 Main Conference
Abstract
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
中文标题/摘要
标题:音乐谱理解基准:评估大型语言模型对完整音乐谱的理解能力
理解完整的音乐谱需要综合推理音高、节奏、和声和大尺度结构,然而大型语言模型和视觉-语言模型对完整音乐记谱符号的解释能力仍缺乏充分的考察。我们引入了音乐谱理解基准(MSU-Bench),这是一个由人类编纂的基准,涵盖了文本(ABC 符号)和视觉(PDF)模态下的谱级音乐理解。MSU-Bench 包含来自巴赫、贝多芬、肖邦、德彪西等作曲家的1,800个生成性问题-答案对,按难度分为四个级别,从起始信息到织体和结构。超过十五个最先进的模型在零样本和微调设置下的评估显示了模态间的显著差距、不稳定级别的表现以及多级正确性的挑战。微调在模态间显著提高了结果,同时保留了通用知识,使MSU-Bench 成为未来多模态推理研究的坚实基础。基准和代码可在 https://github.com/Congren-Dai/MSU-Bench 获取。
Component-Based Out-of-Distribution Detection
Authors: Wenrui Liu, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
First: 2026-04-23T11:19:39+00:00 · Latest: 2026-04-23T11:19:39+00:00
Abstract
Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.
中文标题/摘要
标题:基于组件的离分布检测
离分布(OOD)检测需要对细微变化敏感而不对自然在分布(ID)多样性过度反应。然而,从检测粒度的角度来看,全局表示不可避免地抑制了局部OOD线索,而基于补丁的方法由于纠缠的虚假相关性和噪声不稳定。而且,它们在检测由有效ID组件组成的组合OOD方面也不有效。受组件识别理论的启发,我们提出了一种无需训练的基于组件的OOD检测(CoOD)框架,通过将输入分解为功能组件来解决现有局限性。为了实现CoOD,我们推导出组件偏移分数(CSS)来检测局部外观变化,并使用组成一致性分数(CCS)来识别跨组件的组成不一致性。实验上,CoOD在粗粒度和细粒度OOD检测中均实现了持续改进。
Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
Authors: Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra
First: 2026-04-23T10:36:50+00:00 · Latest: 2026-04-23T10:36:50+00:00
Abstract
Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.
中文标题/摘要
标题:看而不信:揭示评估者视觉-语言模型的盲点
大型视觉-语言模型(VLMs)越来越多地用于评估其他模型的输出,特别是在图像到文本(I2T)任务如视觉问答和文本到图像(T2I)生成任务中。尽管依赖程度不断增加,但这些评估者VLMs的可靠性仍鲜有研究。在本研究中,我们系统地评估了评估者VLMs在I2T和T2I任务中的可靠性。我们引入了针对性的扰动,这些扰动在关键错误维度上降低了输出质量,包括物体幻觉、空间推理、事实基础和视觉保真度。这些扰动测试了评估者VLMs是否能够可靠地在其评估中考虑到这些质量降低的错误。使用涵盖4000多个扰动实例和40个扰动维度的综合基准,我们使用单答案评分、成对比较和参考引导的方法评估了4个主要的VLMs。我们的研究发现表明,当前的VLM评估器存在显著的盲点:它们经常无法检测到扰动输出,在某些情况下超过50%;特别难以处理细粒度的组合和空间错误;并且对与输入图像相矛盾的幻觉内容往往不够敏感。成对比较证明更可靠,但失败率仍然存在。这些结果突显了当前评估者VLMs的不可靠性,并要求在基准测试和开发决策中谨慎使用。代码和数据已公开。
Summary / 总结
This work evaluates the reliability of Evaluator Vision-Language Models (VLMs) in image-to-text (I2T) and text-to-image (T2I) tasks by introducing targeted perturbations that degrade output quality. Using a comprehensive benchmark of over 4000 perturbed instances, the study finds that current VLM evaluators often fail to detect perturbed outputs, especially for fine-grained compositional and spatial errors, and are insensitive to hallucinated content. Pairwise comparison proves more reliable but still has failure rates. The results suggest that current Evaluator VLMs have significant blind spots and should be used with caution in benchmarking and development decisions.
研究通过引入针对性的扰动来评估Vision-Language模型(VLM)在图像到文本(I2T)和文本到图像(T2I)任务中的可靠性,这些扰动会降低输出质量。使用超过4000个扰动实例的综合基准,研究发现当前的VLM评估器经常无法检测到扰动输出,特别是在细粒度的组合和空间错误方面表现不佳,并且对与输入图像矛盾的虚构内容反应迟钝。成对比较虽然更可靠,但仍存在失败率。研究结果表明,当前的VLM评估器存在显著的盲点,应在基准测试和开发决策中谨慎使用。
PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation
Authors: Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim
Venue: ACL 2026
First: 2025-08-29T15:36:06+00:00 · Latest: 2026-04-23T09:35:10+00:00
Comments: ACL 2026
Abstract
Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.
中文标题/摘要
标题:PosterForest:科学海报生成的分层多智能体协作
自动化科学海报生成需要分层文档理解和连贯的内容-布局规划。现有方法通常依赖于平面总结或分别优化内容和布局,因此往往存在信息丢失、逻辑流程弱和视觉平衡差的问题。我们提出了PosterForest,一种无需训练的科学海报生成框架。我们的方法引入了Poster树,这是一种结构化的中间表示,能够捕捉多个层次上的文档层次和视觉-文本语义。基于这种表示,内容和布局代理进行分层推理和递归细化,逐步从全局组织到局部组成优化海报。这种联合优化提高了语义连贯性、逻辑流程和视觉和谐性。实验表明,PosterForest在自动和人工评估中均优于先前的方法,无需额外训练或领域特定监督。
Summary / 总结
The research aims to improve the hierarchical understanding and coherent planning of scientific posters. The method introduces a Poster Tree as a structured intermediate representation to capture document hierarchy and visual-textual semantics. Content and layout agents use this representation to perform hierarchical reasoning and recursive refinement, optimizing the poster from global organization to local composition. Experiments demonstrate that PosterForest outperforms previous methods in both automatic and human evaluations without additional training or domain-specific supervision.
研究旨在提高科学海报的层次理解和连贯规划。PosterForest 使用 Poster 树作为中间结构化表示,捕捉文档层次和视觉-文本语义。内容和布局代理执行层次推理和递归细化,从全局组织到局部组成优化海报。该方法在自动和人工评估中均优于先前的方法,无需额外训练或领域特定监督。
Instance-level Visual Active Tracking with Occlusion-Aware Planning
Authors: Haowei Sun, Kai Zhou, Hao Gao, Shiteng Zhang, Jinwu Hu, Xutao Wen, Qixiang Ye, Mingkui Tan
Venue: CVPR 2026 Poster
First: 2026-04-23T09:11:50+00:00 · Latest: 2026-04-23T09:11:50+00:00
Comments: CVPR 2026 Poster
Abstract
Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.
中文标题/摘要
标题:基于实例的视觉主动跟踪与遮挡感知规划
视觉主动跟踪(VAT)旨在控制相机在三维空间内跟随目标,对于无人机导航和安全监控等应用至关重要。然而,实际部署中面临两个关键瓶颈:由于实例级区分不足导致的视觉相似干扰物混淆,以及由于缺乏主动规划而导致的严重遮挡失效。为了解决这些问题,我们提出了OA-VAT,这是一种统一的管道,包含三个互补模块。首先,一种无需训练的实例感知离线原型初始化模块通过DINOv3聚合多视角增强特征,构建区分性实例原型,减轻干扰物混淆。其次,一种在线原型增强跟踪器在线增强原型,并结合一种基于置信度的卡尔曼滤波器,以应对外观和运动变化下的稳定跟踪。第三,一种遮挡感知轨迹规划器,基于我们新构建的Planning-20k数据集进行训练,使用条件扩散生成避障路径,以恢复遮挡。实验表明,OA-VAT在UnrealCV上实现了0.93的平均SR(比SOTA TrackVLA高2.2%),在真实世界数据集上实现了90.8%的平均CAR(比SOTA GC-VAT高12.1%),在DJI Tello无人机上实现了81.6%的TSR。在RTX 3090上运行速度为35 FPS,实现了稳健的实时性能,适用于实际部署。
History
20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553