CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
Authors: Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan
First: 2026-04-03T17:59:51+00:00 · Latest: 2026-04-03T17:59:51+00:00
Comments: 16 pages, 10 figures, 5 tables
Abstract
Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.
中文标题/摘要
标题:CoME-VL:扩展互补多编码器视觉-语言学习
近期的视觉-语言模型(VLMs)通常依赖于通过对比图像-文本目标训练的单个视觉编码器,例如CLIP风格的预训练。虽然对比编码器在跨模态对齐和检索方面非常有效,但自监督视觉编码器通常能够捕捉到更丰富的密集语义,并在识别和理解任务上表现出更强的鲁棒性。在本文中,我们研究如何扩展这些互补视觉表示的融合以用于视觉-语言建模。我们提出了CoME-VL:互补多编码器视觉-语言,这是一种模块化融合框架,将对比训练的视觉编码器与自监督的DINO编码器集成在一起。我们的方法通过(i)通过熵引导的多层聚合和正交约束投影来减少冗余性进行表示级融合,以及(ii)通过RoPE增强的交叉注意力来对齐异构的标记网格并生成紧凑的融合视觉标记。融合的标记可以注入到解码器仅的LLM中,对标准VLM管道进行最小的修改。在多种视觉-语言基准上的广泛实验表明,CoME-VL始终优于单编码器基线。特别是,我们在视觉理解任务上观察到平均改进4.9%,在定位任务上观察到5.4%的改进。我们的方法在RefCOCO上的检测任务上达到了最先进的性能,同时显著优于基线。最后,我们对层合并、非冗余特征混合和融合容量进行了消融研究,以评估互补对比和自监督信号如何影响VLM性能。
Summary / 总结
CoME-VL is a modular fusion framework that combines a contrastively trained vision encoder with a self-supervised DINO encoder for vision-language modeling. It uses entropy-guided multi-layer aggregation and RoPE-enhanced cross-attention to fuse visual representations, which are then injected into a decoder-only LLM. Experiments show that CoME-VL outperforms single-encoder baselines by 4.9% on visual understanding tasks and 5.4% on grounding tasks, achieving state-of-the-art performance on RefCOCO for detection and improving over the baseline by a large margin.
该研究针对单编码器视觉-语言模型的局限性,提出了CoME-VL框架,该框架结合了对比训练的视觉编码器和自监督的DINO编码器。方法使用熵引导的多层聚合和RoPE增强的交叉注意力来融合互补的视觉表示,然后注入到解码器语言模型中。实验表明,CoME-VL在视觉理解任务上比单编码器基线高出4.9%,在定位任务上高出5.4%,并在RefCOCO检测任务上达到了最先进的性能,显著改进了基线。
The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
Authors: Takuya Shiba
First: 2026-04-03T17:06:31+00:00 · Latest: 2026-04-03T17:06:31+00:00
Comments: 11 pages, 1 figure
Abstract
Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.
中文标题/摘要
标题:压缩差距:离散标记为何限制视觉-语言-行动模型的扩展
通过升级视觉编码器扩展视觉-语言-行动(VLA)模型有望提高下游操作性能——正如在视觉-语言建模中所见。我们表明,当动作表示为离散标记时,这一期望会失败,并通过我们称为压缩差距的信息论原理进行解释:在任何视觉-运动管道中,行为的扩展由最紧的信息瓶颈位置决定。当动作是连续的(例如,扩散策略)时,视觉编码器是绑定约束,直接升级它会提高性能。当动作通过固定容量的代码簿离散化(例如,OAT)时,代码簿成为绑定约束,编码器的改进无法在其后传播——无论上游表示多么丰富。我们在LIBERO基准上通过三条证据验证了这一原理:一项因子实验显示,编码器升级使扩散策略提高了超过21个百分点,而OAT在不同模型规模上的收益显著减弱;四种编码器的质量梯度确认扩散策略随编码器质量单调增加,而OAT保持不变;以及代码簿大小实验表明,放松代码簿容量部分恢复了编码器的敏感性,为瓶颈假设提供了因果证据。我们的发现揭示了在物理人工智能中,扩展需要识别管道中的信息瓶颈位置,而不是均匀增加模型或数据规模。
Summary / 总结
The research explores why upgrading the vision encoder in Vision-Language-Action models does not improve manipulation performance when actions are represented as discrete tokens. It introduces the concept of the Compression Gap, which suggests that the tightest information bottleneck in a visuomotor pipeline determines behavior scaling. The study validates this through experiments on the LIBERO benchmark, showing that encoder upgrades significantly improve performance for continuous action representations but have limited impact on discrete action representations. Key findings include a 21 percentage point improvement for Diffusion Policy and a flat performance for OAT across model scales, supporting the bottleneck hypothesis.
研究探讨了为何在动作以离散标记表示时,升级视觉编码器不能提高视觉-语言-动作模型的操纵性能。研究引入了压缩差距的概念,指出在视觉-运动管道中,最紧的信息瓶颈决定了行为。通过实验验证了这一观点,实验显示,编码器升级显著提高了连续动作的表现,但对离散动作的影响有限。关键发现包括因子实验、编码器质量梯度和码本大小实验,提供了码本作为约束的证据。
EffiMiniVLM: A Compact Dual-Encoder Regression Framework
Authors: Yin-Loon Khor, Yi-Jie Wong, Yan Chai Hum
First: 2026-04-03T16:48:59+00:00 · Latest: 2026-04-03T16:48:59+00:00
Abstract
Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.
中文标题/摘要
标题:EffiMiniVLM:一种紧凑的双编码器回归框架
从多模态商品信息预测产品质量在冷启动场景中至关重要,此时用户交互历史不可用,预测必须依赖于图像和文本元数据。然而,现有的视觉-语言模型通常依赖于大型架构和/或大量外部数据集,导致高计算成本。为了解决这一问题,我们提出了一种紧凑的双编码器视觉-语言回归框架EffiMiniVLM,该框架结合了EfficientNet-B0图像编码器、MiniLM为基础的文本编码器和一个轻量级的回归头。为了提高训练样本效率,我们引入了一种加权Huber损失,利用评分计数来强调更可靠的样本,从而获得一致的性能提升。该模型仅使用Amazon Reviews 2023数据集的20%进行训练,包含27.7M参数,需要6.8 GFLOPs,但在基准测试中仍能获得CES得分为0.40,资源成本最低。尽管其规模较小,但其性能与显著更大的模型相当,资源效率比其他顶级方法高约4到8倍,并且是唯一不使用外部数据集的方法。进一步的分析表明,将数据扩展到40%即可使我们的模型超越其他使用更大模型和数据集的方法,尽管模型设计紧凑,但显示出强大的可扩展性。
Summary / 总结
EffiMiniVLM is a compact dual-encoder regression framework that uses an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head to predict product quality from multimodal item information. It introduces a weighted Huber loss to improve training sample efficiency. Despite containing only 27.7M parameters and requiring 6.8 GFLOPs, EffiMiniVLM achieves a CES score of 0.40, outperforming larger models in resource efficiency and achieving comparable performance to them. Scaling the dataset to 40% further enhances its performance, demonstrating strong scalability.
EffiMiniVLM 是一个紧凑的双编码器回归框架,使用 EfficientNet-B0 图像编码器和基于 MiniLM 的文本编码器以及轻量级回归头来从多模态商品信息中预测商品质量。它引入了加权 Huber 损失以提高训练样本效率。尽管参数量仅为 27.7M,且计算量仅为 6.8 GFLOPs,EffiMiniVLM 仍能实现 0.40 的 CES 分数,表现出色,资源效率远高于大型模型。进一步分析表明,将数据扩展到 40% 即可使该模型超越使用更大模型和数据集的方法,展示了强大的可扩展性。
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
Authors: Yunfei Bai, Amit Dhanda, Shekhar Jain
Venue: KDD 2026
First: 2026-04-03T16:28:03+00:00 · Latest: 2026-04-03T16:28:03+00:00
Comments: In Proceedings of the 32nd ACM-SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
Abstract
The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Beyond pattern recognition, linguistic reasoning must integrate with visual comprehension, particularly for Chart Question Answering (CQA) tasks involving complex data visualizations. Current VLMs face significant limitations in CQA, including imprecise numerical extraction, difficulty interpreting implicit visual relationships, and inadequate attention mechanisms for capturing spatial relationships in charts. In this work, we address these challenges by presenting Chart-RL, a novel reinforcement learning framework that enhances VLMs chart understanding through feedback-driven policy optimization of visual perception and logical inference. Our key innovation includes a comprehensive framework integrating Reinforcement Learning (RL) from Policy Optimization techniques along with adaptive reward functions, that demonstrates superior performance compared to baseline foundation models and competitive results against larger state-of-the-art architectures. We also integrated Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA) in the RL framework that only requires single GPU configurations while preserving performance integrity. We conducted extensive benchmarking across open-source, proprietary, and state-of-the-art closed-source models utilizing the ChartQAPro dataset. The RL fine-tuned Qwen3-VL-4B-Instruct model achieved an answer accuracy of 0.634, surpassing the 0.580 accuracy of the Qwen3-VL-8B-Instruct foundation model despite utilizing half the parameter count, while simultaneously reducing inference latency from 31 seconds to 9 seconds.
中文标题/摘要
标题:Chart-RL:通过强化学习策略优化增强图表问答中的视觉推理能力
近期视觉语言模型(VLMs)的进步展示了向真正智能迈进的进展,需要具备强大的推理能力。除了模式识别,语言推理必须与视觉理解相结合,特别是在涉及复杂数据可视化图表问题回答(CQA)任务时。当前的VLMs在CQA任务中面临诸多限制,包括不精确的数值提取、难以解释隐含的视觉关系以及不充分的空间关系注意力机制。为应对这些挑战,我们提出了Chart-RL,这是一种新颖的强化学习框架,通过反馈驱动的策略优化视觉感知和逻辑推理来增强VLMs的图表理解能力。我们的主要创新包括一个综合框架,结合了策略优化技术的强化学习(RL)以及自适应奖励函数,其性能优于基线基础模型,并且在与最新架构的竞争中表现出色。我们还在RL框架中集成了通过低秩适应(LoRA)进行参数高效微调,仅需单个GPU配置即可保持性能完整性。我们使用ChartQAPro数据集对开源、专有和最新闭源模型进行了广泛的基准测试。RL微调的Qwen3-VL-4B-Instruct模型的答题准确率为0.634,尽管参数量仅为Qwen3-VL-8B-Instruct基础模型的一半,但同时将推理延迟从31秒降低到9秒。
Summary / 总结
Chart-RL is a reinforcement learning framework designed to enhance visual reasoning in Chart Question Answering tasks by optimizing policy through feedback-driven learning. It integrates RL techniques and adaptive reward functions to improve VLMs' performance in numerical extraction, visual relationship interpretation, and spatial attention. The model, fine-tuned with Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA), achieved an answer accuracy of 0.634 on the ChartQAPro dataset, surpassing the accuracy of a larger foundation model while significantly reducing inference latency.
Chart-RL 是一种强化学习框架,旨在通过结合策略优化技术和自适应奖励函数来增强图表问答任务中的视觉推理能力。该框架解决了现有视觉语言模型(VLMs)中存在的问题,如数值提取不精确和注意力机制不足。该框架包括通过低秩适应(LoRA)进行的参数高效微调,相比基线模型和更大规模的先进架构,实现了更好的性能。Qwen3-VL-4B-Instruct 模型在 ChartQAPro 数据集上经过 RL 微调后,达到了 0.634 的答案准确率,参数量仅为 Qwen3-VL-8B-Instruct 模型的一半,并且显著减少了推理延迟。
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
Authors: Xiang Chen, Fangfang Yang, Chunlei Meng, Yuxian Dong, Ang Li, Yiwei Wei, Jiahuan Long, Jiujiang Guo, Chengyin Hu
First: 2026-03-19T07:00:44+00:00 · Latest: 2026-04-03T16:16:51+00:00
Abstract
Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.
中文标题/摘要
标题:CoDA:探索链式分布攻击及后验词元空间修复技术在医疗视觉语言模型中的应用
医疗视觉-语言模型(MVLMs)在放射学管道和多模态助手的视觉前端中越来越被用作感知骨干,但它们在实际临床工作流程中的可靠性尚未得到充分探索。先前的鲁棒性评估通常假设干净、经过整理的输入或研究孤立的破坏,忽视了保留临床可读性的同时改变图像统计的常规获取、重建、显示和交付操作。为解决这一差距,我们提出了CoDA,一种链式分布框架,通过组合获取样式的阴影、重建和显示映射以及交付和导出降级来构建临床合理的管道变化。在遮罩结构相似性约束下,CoDA 联合优化阶段组合和参数以诱导故障同时保持视觉合理性。在脑部MRI、胸部X光和腹部CT中,CoDA 显著降低了CLIP风格的MVLMs的零样本性能,链式组合比任何单一阶段的破坏更为严重。我们还评估了多模态大型语言模型(MLLMs)作为影像真实性和质量的技术认证者,而不是病理学。我们测试的专有多模态模型在CoDA变化样本上的审计可靠性降低,并且持续存在高置信度错误,而我们测试的针对医学图像质量审计的医疗特定MLLMs表现出明显的缺陷。最后,我们引入了一种基于教师引导的词元空间后验修复策略,基于块级对齐,该策略提高了CoDA输出的准确性。总体而言,我们的研究结果描述了MVLM部署的临床基础威胁面,并表明轻量级对齐可以提高部署的鲁棒性。
Summary / 总结
CoDA proposes a chain-of-distribution framework to evaluate the robustness of medical vision-language models (MVLMs) under realistic clinical workflows, which include acquisition, reconstruction, display, and delivery operations. It degrades the zero-shot performance of CLIP-style MVLMs and shows that chained compositions are more damaging than any single stage. Additionally, CoDA reveals that proprietary multimodal models have degraded auditing reliability and persistent high-confidence errors, while medical-specific models exhibit deficiencies in medical image quality auditing. A post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment is introduced to improve accuracy on CoDA-shifted samples.
研究引入了CoDA框架,通过模拟实际临床工作流程中的变化来评估医疗视觉-语言模型(MVLM)的鲁棒性。通过组合获取过程中的阴影、重建和显示映射以及交付降级,CoDA在保持视觉可信度的同时诱导失败。实验表明,链式组合比任何单一阶段都更严重地损害了MVLM的性能,且专有的多模态模型在审计成像真实性和质量方面表现出可靠性下降。提出了一种基于教师引导的标记空间适应的后处理修复策略,结合像素级对齐,以提高CoDA处理样本的准确性。
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
Authors: Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu, Jiahuan Long, Yiwei Wei, Tingsong Jiang, Wen Yao
First: 2026-04-03T15:42:55+00:00 · Latest: 2026-04-03T15:42:55+00:00
Abstract
Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.
中文标题/摘要
标题:揭示物理世界语义漏洞:红外视觉语言模型的通用对抗性补丁
红外视觉语言模型(IR-VLMs)已成为低能见度环境中多模态感知的一种有前途的范式,但它们对对抗性攻击的鲁棒性尚未得到充分探索。现有的对抗性补丁方法主要针对基于RGB的模型,在封闭集设置中设计,不适用于红外VLMs开放的语义理解和物理部署要求。为弥合这一差距,我们提出了通用曲面网格补丁(UCGP),这是一种针对IR-VLMs的通用物理对抗性补丁框架。UCGP结合了曲面网格参数化(CGM)以生成连续、低频和可部署的补丁,并采用统一表示驱动的目标,促进子空间偏离、拓扑破坏和隐蔽性。为了提高在实际部署和领域转移下的鲁棒性,我们进一步引入了元微分进化和EOT增强的TPS变形建模。UCGP不操纵标签或提示,而是直接破坏视觉表示空间,削弱跨模态语义对齐。大量实验表明,UCGP在多种不同的IR-VLM架构中一致地削弱了语义理解能力,同时保持了跨模型可转移性、跨数据集泛化能力和实际物理效果的鲁棒性,以及对防御措施的鲁棒性。这些发现揭示了当前红外多模态系统中一个之前未被注意到的鲁棒性漏洞。
Summary / 总结
This paper addresses the robustness of infrared vision-language models (IR-VLMs) to adversarial attacks, which have been largely unexplored. The authors propose Universal Curved-Grid Patch (UCGP), a framework that uses Curved-Grid Mesh parameterization and a unified representation-driven objective to generate deployable adversarial patches. UCGP is designed to disrupt the visual representation space, weaken cross-modal semantic alignment, and improve robustness under real-world deployment. Experiments show that UCGP consistently degrades semantic understanding across various IR-VLM architectures while maintaining transferability and generalization, highlighting a robustness vulnerability in current IR-VLM systems.
研究旨在探索红外视觉语言模型(IR-VLMs)在对抗攻击下的鲁棒性,因为现有方法不适用于IR-VLMs的开放语义理解和物理部署需求。研究提出了一种名为Universal Curved-Grid Patch (UCGP)的框架,该框架使用曲面网格参数化和统一的表示驱动目标来生成可部署的对抗性贴图。UCGP还结合了元差分进化和EOT增强的TPS变形建模,以增强在真实世界条件下的鲁棒性。实验表明,UCGP在各种IR-VLM架构中一致地削弱了语义理解能力,同时保持了迁移性和泛化能力,并且对防御措施具有鲁棒性。
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
Authors: Zhangyun Tan, Zeliang Zhang, Susan Liang, Yolo Yunlong Tang, Lisha Chen, Chenliang Xu
First: 2026-04-03T15:36:00+00:00 · Latest: 2026-04-03T15:36:00+00:00
Abstract
VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks.
We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.
中文标题/摘要
标题:VLMs真能遗忘吗?基于训练的视觉概念去学习基准测试
在大规模网络数据上训练的VLM保留了敏感和版权的视觉概念,部署时可能需要移除。基于训练的去学习方法存在结构缺陷:在狭窄的遗忘集上微调会降低一般能力,使得后续性能下降难以归因于去学习过程本身。基于训练的方法通过提示或系统指令抑制概念,但缺乏针对视觉任务的严格基准评估。我们引入了VLM-UnBench,这是首个针对VLM中基于训练的视觉概念去学习的基准测试。它涵盖了四个遗忘级别、7个源数据集和11个概念轴,并结合三级探针分类和五种评估条件,以区分真正的遗忘和指令遵从。在8种评估设置和13种VLM配置下,现实的去学习提示使遗忘准确率接近无指令基线;只有在披露目标概念的先验条件下,才出现有意义的减少。对象和场景概念最难以抑制,即使有明确的遗忘指令,更强的指令调优模型仍保持能力。这些结果揭示了提示级抑制与真正视觉概念擦除之间明显的差距。
Summary / 总结
The study addresses the issue of sensitive and copyrighted visual concepts retained by VLMs and introduces VLM-UnBench, a benchmark for evaluating training-free visual concept unlearning. The method involves using prompts or system instructions to suppress concepts without fine-tuning. Key findings show that realistic unlearning prompts do not significantly reduce forget accuracy, and only under oracle conditions do meaningful reductions occur. Object and scene concepts are particularly resistant to suppression, indicating a gap between prompt-level suppression and true visual concept erasure.
论文引入了VLM-UnBench,这是一个用于评估VLM中训练外视觉概念消除的基准。它解决了基于训练的消除方法的局限性,并通过涉及多种遗忘级别、源数据集和概念轴的综合设置来评估抑制技术的有效性。研究发现,现实的消除提示并不会显著降低遗忘准确性,只有在披露目标概念的oracle条件下才会出现有意义的减少。物体和场景概念特别难以抑制,表明提示级抑制与真正视觉概念擦除之间存在差距。
QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection
Authors: Lokman Bekit, Hamza Karim, Nghia T Nguyen, Yasin Yilmaz
First: 2026-04-03T13:48:34+00:00 · Latest: 2026-04-03T13:48:34+00:00
Abstract
Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating" mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.
中文标题/摘要
标题:QVAD:一种以问题为中心的代理框架,用于高效的无训练视频异常检测
视频异常检测(VAD)是计算机视觉中的一个基本挑战,尤其是由于异常的开放集性质。虽然最近的无训练方法利用视觉-语言模型(VLMs)显示出前景,但它们通常依赖于大规模、资源密集的基础模型来弥补静态提示的模糊性。我们认为,VAD 的瓶颈不一定是模型容量,而是询问的静态性质。我们提出了 QVAD,一种以问题为中心的代理框架,将 VLM-LLM 交互视为动态对话。通过基于视觉上下文迭代细化查询,我们的 LLM 代理引导较小的 VLM 生成高保真度的字幕和精确的语义推理,而不进行参数更新。这种“提示更新”机制有效地解锁了轻量级模型的潜在能力,使用竞争对手方法所需参数的一小部分,在 UCF-Crime、XD-Violence 和 UBNormal 上实现了最先进的性能。我们进一步在单场景 ComplexVAD 数据集上展示了出色的泛化能力。最关键的是,QVAD 实现了高速推理和最小的内存占用,使高级 VAD 能力能够在资源受限的边缘设备上部署。
Summary / 总结
QVAD is a question-centric agentic framework for video anomaly detection that enhances the performance of lightweight models by iteratively refining queries based on visual context. This approach, which treats VLM-LLM interaction as a dynamic dialogue, enables high-fidelity captioning and precise semantic reasoning without parameter updates. QVAD achieves state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal datasets using fewer parameters than competing methods and demonstrates strong generalizability on the ComplexVAD dataset. Additionally, QVAD offers high inference speeds and minimal memory usage, making it suitable for deployment on resource-constrained devices.
QVAD 是一种基于问题的代理框架,通过基于视觉上下文迭代细化查询来增强较小模型的表现。该方法将 VLM-LLM 交互视为动态对话,使 LLM 代理能够引导模型生成高保真度的字幕和精确的语义推理,而无需参数更新。QVAD 在 UCF-Crime、XD-Violence 和 UBNormal 数据集上实现了最先进的性能,使用比竞争方法更少的参数,并在 ComplexVAD 数据集上展示了强大的泛化能力。此外,QVAD 提供了高速推理和极小的内存占用,使其适用于资源受限的边缘设备部署。
Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
Authors: Jason Qiu, Zachary Meurer, Xavier Thomas, Deepti Ghadiyaram
First: 2026-04-02T10:02:49+00:00 · Latest: 2026-04-03T11:47:24+00:00
Abstract
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.
中文标题/摘要
标题:语义丰富性或几何推理?VLM视觉不变性的脆弱性
本研究探讨了最先进的视觉-语言模型(VLMs)在基本几何变换下的根本脆弱性。尽管现代VLMs在识别处于标准方向的对象和描述复杂场景等语义任务上表现出色,但在更基本的层面上,它们表现出系统性的失败:缺乏可靠的确定物体身份所需的稳健的空间不变性和协变性。我们通过在包括符号草图、自然照片和抽象艺术在内的多种视觉领域进行系统评估,展示了这一局限性。随着语义内容的稀疏,性能急剧下降,这种行为在不同架构、模型容量和提示策略中均被观察到。总体而言,我们的结果揭示了当前VLMs在语义理解和空间推理之间的系统性差距,突显了未来多模态系统中需要更强的几何基础。
Summary / 总结
This work examines the fundamental limitations of state-of-the-art Vision-Language Models (VLMs) in maintaining spatial invariance and equivariance under basic geometric transformations. Despite their success in semantic tasks, VLMs show significant performance drops when dealing with sparse semantic content, indicating a gap between their semantic understanding and spatial reasoning capabilities. The study evaluates VLMs across various visual domains and finds that performance deteriorates sharply as semantic content decreases, suggesting the need for improved geometric reasoning in VLMs.
这项研究考察了最先进的视觉-语言模型(VLMs)在基本几何变换下的脆弱性,发现尽管VLMs在语义任务上表现良好,但在基本的空间不变性和协变性方面却存在困难,特别是在语义内容稀疏时。研究在各种视觉领域评估了VLMs,并发现简单变换下的性能急剧下降,表明当前VLMs在语义理解和空间推理之间存在差距。
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
Authors: Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, Wei Zhao
First: 2026-04-03T11:34:47+00:00 · Latest: 2026-04-03T11:34:47+00:00
Abstract
Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.
中文标题/摘要
标题:并非所有框架都值得进行完整计算:通过选择性计算和预测外推加速自回归视频生成
自回归(AR)视频扩散模型能够实现长视频生成,但由于重复的多步去噪操作,成本仍然很高。现有的无需训练加速方法依赖于二元缓存或重新计算的决策,忽视了直接重用过于粗略而完全重新计算又不必要的中间情况。此外,异步AR调度为同时生成的帧分配不同的噪声级别,但现有方法对整个有效区间进行均匀处理。为解决这些AR特有的低效问题,我们提出了SCOPE,这是一种无需训练的高效AR视频扩散框架。SCOPE引入了缓存、预测和重新计算的三模调度器,其中通过噪声级别泰勒外推进行预测填补了重用和重新计算之间的空白,并通过误差传播分析提供显式的稳定性控制。它还引入了选择性计算,限制执行仅在活动帧区间内进行。在MAGI-1和SkyReels-V2上,SCOPE实现了最高4.73倍的加速,同时保持与原始输出相当的质量,优于所有无需训练的基线方法。
Summary / 总结
The paper addresses the inefficiency of autoregressive (AR) video diffusion models by proposing SCOPE, a training-free framework that introduces a tri-modal scheduler for cache, predict, and recompute. This scheduler uses noise-level Taylor extrapolation for prediction, providing a balance between reuse and recomputation with stability controls. SCOPE also restricts computation to the active frame interval, leading to up to 4.73x speedup on MAGI-1 and SkyReels-V2 datasets while maintaining comparable quality to the original output, outperforming other training-free methods.
论文提出了一种名为SCOPE的训练-free框架,通过引入缓存、预测和重新计算的三模调度器来解决自回归(AR)视频扩散模型的低效问题。该调度器使用噪声级别泰勒外推进行预测,提供了一种介于重用和重新计算之间的中间方案。SCOPE还限制计算仅在活动帧区间内进行,从而在MAGI-1和SkyReels-V2数据集上实现了最高4.73倍的加速,同时保持与原始输出相当的质量,超越了所有训练-free基线方法。
Collaborative Multi-Mode Pruning for Vision-Language Models
Authors: Zimeng Wu, Yunhong Wang, Donghao Wang, Jiaxin Chen
First: 2026-04-03T10:44:23+00:00 · Latest: 2026-04-03T10:44:23+00:00
Comments: CVPR2026 Accepted
Abstract
Vision-Language Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning. Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into the computation of parameter importance scores, while simultaneously mitigating the affect of pruned parameters on token importance scores. Moreover, we develop a Multi-Mode Pruning Strategy (MPS) that decomposes the overall pruning process into a sequence of pruning stages, while in each stage we estimate the priory of different pruning modes based on their pruning cost and adaptively shift to the optimal one. Additionally, MPS integrates the historical cost and random exploration, in order to achieve a stable pruning process and avoid local optimum. Extensive experiments across various vision-language tasks and models demonstrate that our method effectively promotes the performance under high pruning ratios by comparing to the state-of-the-art approaches. The source code is available at https://github.com/Wuzimeng/CoMP.git.
中文标题/摘要
标题:协作多模式剪枝方法用于视觉-语言模型
视觉-语言模型(VLMs)在统一的Transformer架构中取得了快速进展,但由于其在资源受限设备上的部署面临高计算复杂度的挑战,因此其应用仍然具有挑战性。尽管剪枝已成为压缩VLMs的有效技术,但现有的方法主要集中在单一模式上,通过剪枝参数或标记来压缩模型,忽视了在每个模式中固有的冗余性,导致在高剪枝比例下性能大幅下降。为了解决上述限制,我们提出了一种名为协作多模式剪枝(CoMP)的新框架,该框架专门针对VLMs,通过联合剪枝参数和标记来实现。具体而言,我们首先设计了一种协作重要性度量(CIM),以研究耦合参数和标记之间的相互干扰。CIM将标记的独特重要性纳入参数重要性得分的计算中,同时减轻剪枝参数对标记重要性得分的影响。此外,我们开发了一种多模式剪枝策略(MPS),将整体剪枝过程分解为一系列剪枝阶段,在每个阶段中,根据其剪枝成本估计不同剪枝模式的优先级,并根据其剪枝成本适应性地转向最优模式。此外,MPS 结合了历史成本和随机探索,以实现稳定的剪枝过程并避免局部最优。在各种视觉-语言任务和模型上的广泛实验表明,我们的方法在高剪枝比例下有效促进了性能,优于最先进的方法。源代码可在 https://github.com/Wuzimeng/CoMP.git 获取。
Summary / 总结
The research aims to address the high computational complexity of Vision-Language Models (VLMs) for deployment on resource-constrained devices by proposing Collaborative Multi-Mode Pruning (CoMP). CoMP jointly prunes parameters and tokens, incorporating a Collaborative Importance Metric (CIM) to assess the mutual interference between parameters and tokens, and a Multi-Mode Pruning Strategy (MPS) to adaptively shift between parameter and token pruning stages. Experiments show that CoMP outperforms state-of-the-art approaches at high pruning ratios across various vision-language tasks and models.
研究旨在解决视觉-语言模型(VLMs)在资源受限设备上的高计算复杂性问题。为此,作者提出了协作多模式剪枝(CoMP),该方法同时剪枝参数和标记,以探索两者的内在冗余。该方法引入了协作重要性度量(CIM)和多模式剪枝策略(MPS),以在高剪枝比下提升性能。实验表明,CoMP在各种视觉-语言任务和模型中优于现有方法。
Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models
Authors: Jaemin Kim, Jong Chul Ye
First: 2026-03-18T12:54:50+00:00 · Latest: 2026-04-03T10:32:08+00:00
Abstract
Retrieval-Augmented Generation (RAG) improves factual grounding by incorporating external knowledge into language model generation. However, when retrieved context is noisy, unreliable, or inconsistent with the model's parametric knowledge, it introduces retrieval-prior conflicts that can degrade generation quality. While this problem has been studied in autoregressive language models, it remains largely unexplored in diffusion-based language models, where the iterative denoising process introduces unique challenges for integrating retrieved context. In this work, we propose Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models (MDMs) in RAG settings. ARAM dynamically calibrates the guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context. Intuitively, the model strengthens guidance when the retrieved context provides reliable corrective evidence and suppresses it when the contextual signal is noisy or non-supportive. Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.
From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving
Authors: A. Humnabadkar, A. Sikdar, B. Cave, H. Zhang, N. Bessis, A. Behera
First: 2026-03-18T13:32:26+00:00 · Latest: 2026-04-03T10:04:30+00:00
Comments: Accepted manuscript - Transactions on Intelligent Transportation Systems
Abstract
Autonomous driving technologies have achieved significant advances in recent years, yet their real-world deployment remains constrained by data scarcity, safety requirements, and the need for generalization across diverse environments. In response, synthetic data and virtual environments have emerged as powerful enablers, offering scalable, controllable, and richly annotated scenarios for training and evaluation. This survey presents a comprehensive review of recent developments at the intersection of autonomous driving, simulation technologies, and synthetic datasets. We organize the landscape across three core dimensions: (i) the use of synthetic data for perception and planning, (ii) digital twin-based simulation for system validation, and (iii) domain adaptation strategies bridging synthetic and real-world data. We also highlight the role of vision-language models and simulation realism in enhancing scene understanding and generalization. A detailed taxonomy of datasets, tools, and simulation platforms is provided, alongside an analysis of trends in benchmark design. Finally, we discuss critical challenges and open research directions, including Sim2Real transfer, scalable safety validation, cooperative autonomy, and simulation-driven policy learning, that must be addressed to accelerate the path toward safe, generalizable, and globally deployable autonomous driving systems.
中文标题/摘要
标题:从虚拟环境到现实世界试验:自主驾驶新兴趋势
近年来,自主驾驶技术取得了显著进展,但其在现实世界的部署仍受到数据稀缺性、安全要求以及跨不同环境泛化的限制。为应对这些挑战,合成数据和虚拟环境已成为强大的助力,提供了可扩展、可控且注释丰富的场景,用于训练和评估。本文综述了自主驾驶、模拟技术和合成数据集交叉领域的最新进展。我们从三个核心维度组织了这一景观:(i) 使用合成数据进行感知和规划,(ii) 基于数字孪生的模拟用于系统验证,(iii) 跨合成和现实世界数据的领域适应策略。我们还强调了视觉语言模型和模拟现实性在增强场景理解和泛化方面的作用。我们提供了数据集、工具和模拟平台的详细分类,并分析了基准设计的趋势。最后,我们讨论了必须解决的关键挑战和开放研究方向,包括Sim2Real迁移、可扩展的安全验证、协同自主以及基于模拟的策略学习,以加速实现安全、泛化和全球部署的自主驾驶系统。
Summary / 总结
This paper reviews recent advancements in autonomous driving, focusing on the use of synthetic data and virtual environments to address data scarcity and safety concerns. It covers three main areas: synthetic data for perception and planning, digital twin-based simulation for system validation, and domain adaptation strategies. Key findings include the importance of vision-language models and simulation realism in enhancing scene understanding and generalization, and the need for Sim2Real transfer and scalable safety validation to achieve safe and globally deployable autonomous driving systems.
论文探讨了使用合成数据和虚拟环境来解决自动驾驶在现实世界部署中的挑战,如数据稀缺性和安全性要求。它回顾了使用合成数据进行感知和规划的最新进展,基于数字孪生的仿真系统验证,以及合成数据与现实数据之间的领域适应策略。关键发现包括视觉-语言模型和仿真逼真性在增强场景理解和泛化方面的重要性,以及为了实现安全和通用的自动驾驶系统,需要进行Sim2Real转移和可扩展的安全验证。
When Negation Is a Geometry Problem in Vision-Language Models
Authors: Fawaz Sammani, Tzoulio Chamiti, Paul Gavrikov, Nikos Deligiannis
Venue: CVPR
First: 2026-03-20T23:06:23+00:00 · Latest: 2026-04-03T09:27:25+00:00
Comments: Accepted to CVPR (Multimodal Algorithmic Reasoning Workshop) 2026
Abstract
Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries, for example, failing to distinguish "no" in the query: "a plain blue shirt with no logos". Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts. Code is at https://github.com/fawazsammani/negation-steering
中文标题/摘要
标题:当否定成为视觉-语言模型中的几何问题
联合视觉-语言嵌入模型如CLIP通常在理解文本查询中的否定方面表现不佳,例如,在查询“一件没有logo的蓝色衬衫”中无法区分“no”。先前的工作主要通过数据导向的方法,对大规模合成否定数据集进行CLIP的微调来解决这一限制。然而,这些努力通常使用检索型评估指标,无法可靠地反映是否真正理解了否定。在本文中,我们识别了此类评估指标的两个关键局限性,并探讨了一种基于多模态LLM作为评判者的替代评估框架,这些模型通常擅长理解关于图像内容的简单是/否问题,从而提供对CLIP模型中否定理解的公平评估。然后我们询问在CLIP嵌入空间中是否已经存在与否定相关的方向。我们发现这种方向确实存在,并通过测试时干预和表示工程对其进行操纵,使其引导CLIP朝向否定感知的行为,而无需任何微调。最后,我们在非常见图像-文本样本上测试否定理解,以评估在分布转移下的泛化能力。代码见https://github.com/fawazsammani/negation-steering
Summary / 总结
This paper addresses the challenge of negation understanding in joint vision-language models like CLIP, which often fail to correctly interpret negation in text queries. The authors propose an alternative evaluation framework using Multimodal LLMs-as-a-judge to assess negation understanding more accurately than retrieval-based metrics. They find that a direction in the CLIP embedding space associated with negation exists and can be manipulated through test-time intervention to improve negation understanding without fine-tuning. The study also evaluates the model's generalization to non-common image-text samples under distribution shifts.
本文探讨了视觉-语言模型如CLIP在处理文本查询中的否定时的理解挑战,这些模型经常无法正确解释否定。作者提出了一种新的评估框架,使用多模态LLM作为评判者来更准确地评估否定理解,而不是使用检索基指标。研究发现,CLIP嵌入空间中与否定相关的方向存在,并可以通过测试时干预来操纵以改善否定理解,无需微调。此外,研究还评估了模型在非常见图像-文本样本下的泛化能力,特别是在分布变化下的表现。
FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
Authors: Hao Yin, Lijun Gu, Paritosh Parmar, Lin Xu, Tianxiao Guo, Xiujin Liu, Weiwei Fu, Yang Zhang, Tianyou Zheng
First: 2025-06-02T01:44:02+00:00 · Latest: 2026-04-03T09:25:47+00:00
Comments: Dataset and code are available at https://github.com/HaoYin116/FLEX . Link to Project page https://haoyin116.github.io/FLEX_Dataset
Abstract
Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel Video$\rightarrow$EMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.
中文标题/摘要
标题:FLEX:用于健身动作质量评估的大规模多模态多视角数据集
动作质量评估(AQA)——量化动作执行质量的任务——在健身房举重训练中具有巨大潜力,准确的反馈对于预防受伤和最大化收益至关重要。现有的AQA数据集仅限于单视角竞技运动和RGB视频,缺乏多模态信号和健身动作的专业评估。我们介绍了FLEX,这是首个用于健身AQA的大规模多模态多视角数据集,包含表面肌电图(sEMG)。FLEX包含7,500多个20种负重练习的多视角记录,由38名不同技能水平的受试者完成,配有同步的RGB视频、3D姿态、sEMG和生理信号。专家注释组织成健身知识图谱(FKG),链接动作、关键步骤、错误类型和反馈,支持组合评分函数以实现可解释的质量评估。FLEX使多模态融合、跨模态预测——包括新颖的Video$\rightarrow$EMG任务——和生物力学导向的表示学习成为可能。基于FKG,我们进一步引入了FLEX-VideoQA,这是一个结构化问答基准,具有层次查询,驱动视觉-语言模型中的跨模态推理。基线实验表明,多模态输入、多视角视频和细粒度注释显著提高了AQA性能。FLEX因此推动了AQA向更丰富的多模态环境发展,并为基于AI的健身评估和指导提供了基础。数据集和代码可在https://github.com/HaoYin116/FLEX 获取。项目页面链接https://haoyin116.github.io/FLEX_Dataset
Summary / 总结
The research aims to improve Action Quality Assessment (AQA) in fitness training by developing a comprehensive dataset, FLEX, which includes multimodal signals like surface electromyography (sEMG) and 3D pose, along with expert annotations in a Fitness Knowledge Graph (FKG). The dataset contains over 7,500 recordings of 20 weight-loaded exercises performed by 38 subjects, enabling multimodal fusion and cross-modal prediction tasks. Experimental results show that multimodal inputs and fine-grained annotations significantly enhance AQA performance, advancing the field towards richer multimodal settings for fitness assessment and coaching.
研究旨在通过开发包含表面肌电图(sEMG)和3D姿态数据的大规模多模态数据集FLEX,提升健身训练中的动作质量评估(AQA)。该数据集包含38名不同技能水平的受试者进行的20种负重练习的超过7,500个录制,同步了RGB视频和生理信号。专家注释组织成健身知识图谱(FKG),支持可解释的质量评估。实验表明,多模态输入和细粒度注释显著提升了AQA性能,为健身评估和指导提供了更丰富的多模态环境。
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
Authors: Hai Nguyen-Truong, Alper Balbay, Tunga Bayrak
First: 2026-04-03T09:10:21+00:00 · Latest: 2026-04-03T09:10:21+00:00
Comments: 12 pages, 7 figures
Abstract
We study visual explanation in geometry education as a Referring Image Segmentation (RIS) problem: given a diagram and a natural language description, the task is to produce a pixel-level mask for the referred geometric element. However, existing RIS models trained on natural image benchmarks such as RefCOCO fail catastrophically on geometric diagrams due to the fundamental domain shift between photographic scenes and abstract, textureless schematics. To address the absence of suitable training data, we present a fully automated procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and linguistically diverse referring expressions, requiring zero manual annotation. We further propose domain-specific fine-tuning of vision-language models (VLMs), demonstrating that a fine-tuned Florence-2 achieves 49% IoU and 85% Buffered IoU (BIoU), compared to <1% IoU in zero-shot settings. We introduce Buffered IoU, a geometry-aware evaluation metric that accounts for thin-structure localization, and show that it better reflects true segmentation quality than standard IoU. Our results establish a foundation for building Artificial General Teachers (AGTs) capable of providing visually grounded, step-by-step explanations of geometry problems.
中文标题/摘要
标题:通向通用人工教师:程序化几何数据生成与视觉语义结合
我们研究了几何教育中的视觉解释问题,将其视为引用图像分割(RIS)问题:给定一个图形和自然语言描述,任务是生成所指几何元素的像素级掩码。然而,现有在自然图像基准数据集如RefCOCO上训练的RIS模型在几何图形上表现极差,因为摄影场景和抽象的、无纹理的示意图之间存在根本性的领域差异。为了解决缺乏合适训练数据的问题,我们提出了一种全自动的程序化数据生成引擎,生成了超过200,000个带有像素级分割掩码和语言多样引用表达式的合成几何图形,无需人工标注。我们进一步提出针对视觉-语言模型(VLM)的领域特定微调,证明微调后的Florence-2在零样本设置下实现了49%的IoU和85%的缓冲IoU(BIoU),而未微调模型的IoU不到1%。我们引入了缓冲IoU,这是一种几何感知的评估指标,考虑了细结构定位,表明它比标准IoU更能反映真实的分割质量。我们的结果为构建能够提供视觉接地、逐步解释几何问题的通用人工教师(AGTs)奠定了基础。
Summary / 总结
This study aims to address the challenge of visual explanation in geometry education by treating it as a Referring Image Segmentation (RIS) problem. To overcome the domain shift between natural images and geometric diagrams, the authors developed a fully automated procedural data generation engine to create over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and diverse referring expressions. They also fine-tuned vision-language models on this dataset, achieving 49% Intersection over Union (IoU) and 85% Buffered IoU (BIoU) compared to less than 1% in zero-shot settings. The study introduces Buffered IoU as a more accurate evaluation metric for geometry-aware segmentation. These findings lay the groundwork for developing Artificial General Teachers that can provide step-by-step, visually grounded explanations of geometry problems.
该研究将几何教育中的视觉解释问题表述为图像分割(RIS)问题。为了解决自然图像与几何图之间的领域差异,作者开发了一个程序化数据生成引擎,生成了超过200,000个带有像素级分割掩码和多样语言描述的合成几何图。他们还在该数据集上对视觉语言模型进行了微调,实现了49%的交并比(IoU)和85%的缓冲交并比(BIoU),而在零样本设置中则低于1%。此外,他们引入了缓冲交并比(BIoU)作为更准确的几何感知分割评估指标。这些结果为开发能够提供逐步、视觉接地解释的通用人工教师奠定了基础。
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
Authors: Yufei Yin, Yuchen Xing, Qianke Meng, Minghao Chen, Yan Yang, Zhou Yu
Venue: ICME 2026
First: 2026-04-03T09:00:38+00:00 · Latest: 2026-04-03T09:00:38+00:00
Comments: Accepted to ICME 2026
Abstract
Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.
中文标题/摘要
标题:利用MLLM代理进行长视频理解的渐进式视频凝练
理解长视频需要在严格的计算预算下从长序列中提取查询相关的信息。现有的文本-然后-LLM流水线会丢失细粒度的视觉线索,而基于视频的多模态大型语言模型(MLLMs)可以保留视觉细节但计算成本高昂且帧需求量大。在本文中,我们旨在利用MLLMs进行高效的视频理解。我们提出了ProVCA,这是一种渐进式视频凝练代理,能够迭代地在多个粒度级别定位关键视频帧。ProVCA首先采用一个片段定位模块来识别与查询相关的视频片段,然后采用一个片段选择模块根据相似性选择重要片段,最后采用一个关键帧细化模块在这些片段中确定特定的关键帧。通过从粗粒度片段逐步缩小到细粒度帧的范围,ProVCA能够识别出一组关键帧用于基于MLLM的推理。ProVCA在EgoSchema上实现了69.3%的零样本准确率,在NExT-QA上实现了80.5%的准确率,在IntentQA上实现了77.7%的准确率,同时使用的帧数比之前的无训练方法更少。
Summary / 总结
The research aims to efficiently understand long videos by using MLLMs while keeping visual details. ProVCA, a progressive video condensation agent, iteratively locates key frames at multiple granularities. It first identifies relevant video segments, then selects important snippets based on similarity, and finally refines keyframes. ProVCA achieves state-of-the-art zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA, using fewer frames than previous methods.
该研究提出了一种渐进式视频凝练代理ProVCA,以高效理解长视频。ProVCA通过逐步识别多个粒度的关键视频帧,使用段落定位模块、片段选择模块和关键帧精炼模块。该方法在EgoSchema、NExT-QA和IntentQA上的零样本准确率达到了最新技术水平,同时使用了比之前方法更少的帧数。
InstructTable: Improving Table Structure Recognition Through Instructions
Authors: Boming Chen, Zining Wang, Zhentao Guo, Jianqiang Liu, Chen Duan, Yu Gu, Kai zhou, Pengfei Yan
First: 2026-04-03T08:44:45+00:00 · Latest: 2026-04-03T08:44:45+00:00
Comments: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition- FINDINGS Track (CVPRF)
Abstract
Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.
中文标题/摘要
标题:InstructTable:通过指令提高表格结构识别
表格结构识别(TSR)通过解析表格图像为结构化表示具有广泛的实用价值,但在处理包含合并或空单元格的复杂布局时面临重大挑战。传统基于视觉的模型仅依赖视觉信息,缺乏关键的语义支持,从而在复杂场景中妨碍准确的结构识别。视觉-语言模型利用上下文语义增强理解,但这些方法在建模视觉结构信息方面有所不足。为解决这些限制,本文提出了InstructTable,这是一种指令引导的多阶段训练TSR框架。精心设计的表格指令预训练引导注意力关注细粒度的结构模式,增强对复杂表格的理解。TSR微调补充了视觉信息建模,确保在各种场景中保持高精度的表格解析。此外,我们引入了Table Mix Expand(TME),这是一种无模板的大规模真实表格数据合成方法。利用TME,我们构建了Balanced Complex Dense Synthetic Tables(BCDSTab)基准,包含900张通过我们方法合成的复杂表格图像,作为严格的基准。在多个公开数据集(FinTabNet、PubTabNet、MUSTARD)和BCDSTab上的广泛实验表明,InstructTable在TSR任务中达到了最先进的性能。消融研究进一步证实了所提表格数据特定指令和合成数据的积极影响。
Summary / 总结
InstructTable is a multi-stage training framework for table structure recognition that uses instruction guidance to improve the handling of complex layouts with merged or empty cells. It combines table instruction pre-training with TSR fine-tuning to enhance semantic understanding and maintain visual information modeling. The paper introduces Table Mix Expand (TME), a template-free method for generating large-scale synthetic tabular data, and constructs the BCDSTab benchmark. Experiments show that InstructTable outperforms existing methods on multiple datasets, including FinTabNet, PubTabNet, and MUSTARD, as well as the BCDSTab benchmark.
InstructTable 是一种多阶段训练框架,通过指令引导来提高处理包含合并或空单元格的复杂布局的表格结构识别能力。它结合了表指令预训练和TSR微调,以增强语义理解和保持视觉信息建模。论文引入了Table Mix Expand (TME) 方法,这是一种无模板的大规模合成表格数据生成方法,并构建了BCDSTab基准数据集。实验表明,InstructTable 在多个数据集(包括FinTabNet、PubTabNet和MUSTARD)以及BCDSTab基准上优于现有方法。
Diffusion Models as Dataset Distillation Priors
Authors: Duo Su, Huyu Wu, Huanran Chen, Yiming Shi, Yuzhu Wang, Xi Ye, Jun Zhu
First: 2025-10-20T11:04:09+00:00 · Latest: 2026-04-03T07:59:32+00:00
Abstract
Dataset distillation aims to synthesize compact yet informative datasets from large ones. A significant challenge in this field is achieving a trifecta of diversity, generalization, and representativeness in a single distilled dataset. Although recent generative dataset distillation methods adopt powerful diffusion models as their foundation models, the inherent representativeness prior in diffusion models is overlooked. Consequently, these approaches often necessitate the integration of external constraints to enhance data quality. To address this, we propose Diffusion As Priors (DAP), which formalizes representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel. We then introduce this prior as guidance to steer the reverse diffusion process, enhancing the representativeness of distilled samples without any retraining. Extensive experiments on large-scale datasets, such as ImageNet-1K and its subsets, demonstrate that DAP outperforms state-of-the-art methods in generating high-fidelity datasets while achieving superior cross-architecture generalization. Our work not only establishes a theoretical connection between diffusion priors and the objectives of dataset distillation but also provides a practical, training-free framework for improving the quality of the distilled dataset.
中文标题/摘要
标题:扩散模型作为数据集蒸馏先验
数据集蒸馏旨在从大型数据集中合成紧凑且信息丰富的数据集。该领域的一个重大挑战是在单一蒸馏数据集中同时实现多样性、泛化能力和代表性。尽管最近的生成数据集蒸馏方法采用强大的扩散模型作为基础模型,但扩散模型固有的代表性先验被忽视了。因此,这些方法通常需要结合外部约束来提高数据质量。为了解决这个问题,我们提出了扩散作为先验(DAP),通过使用Mercer核量化合成数据和真实数据在特征空间中的相似性来正式化代表性。然后,我们将这种先验作为指导,引导逆向扩散过程,从而在无需重新训练的情况下增强蒸馏样本的代表性。在ImageNet-1K及其子集等大规模数据集上的广泛实验表明,DAP在生成高保真数据集方面优于最先进的方法,并且在跨架构泛化方面表现更优。我们的工作不仅建立了扩散先验与数据集蒸馏目标之间的理论联系,还提供了一种无需训练的实用框架,以提高蒸馏数据集的质量。
Summary / 总结
The paper addresses the challenge of synthesizing compact yet informative datasets from large ones, focusing on achieving diversity, generalization, and representativeness. It proposes Diffusion As Priors (DAP), which uses a Mercer kernel to quantify the similarity between synthetic and real data, guiding the reverse diffusion process to enhance representativeness. Experiments show that DAP outperforms existing methods in generating high-fidelity datasets with better cross-architecture generalization.
论文提出了一种名为Diffusion As Priors (DAP)的方法,通过使用Mercer核来量化合成数据和真实数据之间的相似性,指导逆向扩散过程以增强代表性。实验表明,DAP在生成高质量数据集方面优于现有方法,并且具有更好的跨架构泛化能力。
MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling
Authors: Shubo Lin, Xuanyang Zhang, Wei Cheng, Weiming Hu, Gang Yu, Jin Gao
First: 2026-04-03T07:32:24+00:00 · Latest: 2026-04-03T07:32:24+00:00
Comments: Project Page: https://shubolin028.github.io/MMPhysVideo-Page
Abstract
Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.
中文标题/摘要
标题:MMPhysVideo:通过联合多模态建模扩展视频生成中的物理合理性
尽管在生成视觉惊艳的内容方面取得了进展,但由于仅基于像素的重建,视频扩散模型(VDMs)往往会产生物理上不一致的结果。为了解决这一问题,我们提出了MMPhysVideo,这是第一个通过联合多模态建模扩展视频生成中物理合理性的框架。我们重新定义了感知线索,特别是语义、几何和时空轨迹,将其统一为伪RGB格式,使VDMs能够直接捕捉复杂的物理动态。为了减轻跨模态干扰,我们提出了一种双向控制教师架构,该架构利用并行分支完全解耦RGB和感知处理,并采用两个零初始化的控制链接逐步学习像素级的一致性。为了提高推理效率,教师的物理先验通过表示对齐被提炼到单流学生模型中。此外,我们还提出了MMPhysPipe,这是一种针对构建富含物理信息的多模态数据集进行扩展的数据采集和标注流水线。MMPhysPipe利用由视觉证据链规则引导的视觉语言模型(VLM)来确定物理主题,使专家模型能够提取多粒度的感知信息。无需额外的推理成本,MMPhysVideo在各种基准测试中一致地提高了物理合理性和视觉质量,并在与现有方法的比较中达到了最先进的性能。
Summary / 总结
MMPhysVideo addresses the issue of physical inconsistency in video generation by proposing a joint multimodal modeling framework. It converts perceptual cues into a unified pseudo-RGB format and uses a Bidirectionally Controlled Teacher architecture to mitigate cross-modal interference, resulting in improved physical plausibility and visual quality across various benchmarks compared to existing methods.
MMPhysVideo 通过提出联合多模态建模框架来解决视频生成中的物理不一致性问题。它将感知线索转换为统一的伪RGB格式,并使用双向控制教师架构来减轻跨模态干扰,从而在各种基准测试中实现比现有方法更好的物理合理性和视觉质量。
PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis
Authors: Dexiang Li, Zhenning Che, Haijun Zhang, Dongliang Zhou, Zhao Zhang, Yahong Han
First: 2026-04-03T07:15:41+00:00 · Latest: 2026-04-03T07:15:41+00:00
Abstract
Pavement condition assessment is essential for road safety and maintenance. Existing research has made significant progress. However, most studies focus on conventional computer vision tasks such as classification, detection, and segmentation. In real-world applications, pavement inspection requires more than visual recognition. It also requires quantitative analysis, explanation, and interactive decision support. Current datasets are limited. They focus on unimodal perception. They lack support for multi-turn interaction and fact-grounded reasoning. They also do not connect perception with vision-language analysis. To address these limitations, we introduce PaveBench, a large-scale benchmark for pavement distress perception and interactive vision-language analysis on real-world highway inspection images. PaveBench supports four core tasks: classification, object detection, semantic segmentation, and vision-language question answering. It provides unified task definitions and evaluation protocols. On the visual side, PaveBench provides large-scale annotations and includes a curated hard-distractor subset for robustness evaluation. It contains a large collection of real-world pavement images. On the multimodal side, we introduce PaveVQA, a real-image question answering (QA) dataset that supports single-turn, multi-turn, and expert-corrected interactions. It covers recognition, localization, quantitative estimation, and maintenance reasoning. We evaluate several state-of-the-art methods and provide a detailed analysis. We also present a simple and effective agent-augmented visual question answering framework that integrates domain-specific models as tools alongside vision-language models. The dataset is available at: https://huggingface.co/datasets/MML-Group/PaveBench.
中文标题/摘要
标题:PaveBench:一种多功能的路面病害感知及交互式视觉-语言分析基准
路面状况评估对于道路安全和维护至关重要。现有研究取得了显著进展,但大多数研究集中在传统的计算机视觉任务,如分类、检测和分割上。在实际应用中,路面检查不仅需要视觉识别,还需要定量分析、解释和交互式决策支持。当前的数据集有限,主要关注单一模态感知,缺乏多轮交互和基于事实的推理支持,也不将感知与视觉-语言分析联系起来。为了解决这些限制,我们引入了PaveBench,这是一个针对实际高速公路检查图像的路面病害感知及交互式视觉-语言分析的大规模基准。PaveBench 支持四个核心任务:分类、对象检测、语义分割和视觉-语言问答。它提供了统一的任务定义和评估协议。在视觉方面,PaveBench 提供了大规模的注释,并包括一个精心挑选的具有挑战性的干扰子集,用于鲁棒性评估。它包含了大量的实际路面图像。在多模态方面,我们引入了PaveVQA,这是一个基于真实图像的问答(QA)数据集,支持单轮、多轮和专家校正的交互。它涵盖了识别、定位、定量估计和维护推理。我们评估了几种最先进的方法,并进行了详细的分析。我们还提出了一种简单而有效的基于代理的视觉问答框架,该框架将领域特定模型作为工具与视觉-语言模型集成。数据集可在以下网址获取:https://huggingface.co/datasets/MML-Group/PaveBench。
Summary / 总结
PaveBench is a new benchmark for pavement distress perception and interactive vision-language analysis, addressing the limitations of existing datasets. It includes large-scale annotations for classification, detection, segmentation, and vision-language question answering. PaveBench supports four core tasks and provides a unified evaluation protocol. The dataset includes real-world pavement images and a curated hard-distractor subset for robustness. Additionally, it introduces PaveVQA, a real-image question answering dataset supporting single-turn, multi-turn, and expert-corrected interactions, covering various aspects of pavement inspection. The evaluation shows that current methods can be improved with a simple agent-augmented framework that integrates domain-specific models. The dataset is publicly available for research use.
PaveBench 是一个针对路面病害感知和交互式视觉语言分析的新基准,解决了现有数据集的局限性。它包含大规模注释,支持分类、检测、分割和视觉语言问答等四个核心任务,并提供统一的评估协议。该数据集包括真实的路面图像和一个用于鲁棒性评估的精心挑选的困难干扰子集。此外,它还引入了PaveVQA,这是一个支持单轮、多轮和专家校正交互的现实图像问答数据集,涵盖了路面检查的各个方面。评估结果显示,当前的方法可以通过集成领域特定模型的简单代理增强框架得到改进。该数据集已公开供研究使用。
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
Authors: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Jing Zhang, Jun Zhang, Xing Wei, Yi Liu, Dianhai Yu, Yanjun Ma
First: 2026-03-25T14:08:56+00:00 · Latest: 2026-04-03T06:49:01+00:00
Comments: Accepted by CVPR2026
Abstract
Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.
中文标题/摘要
标题:利用粗到细视觉处理提升文档解析效率和性能
文档解析是一项精细的任务,图像分辨率对性能有重大影响。虽然利用视觉-语言模型的先进研究可以从高分辨率输入中受益,从而提升模型性能,但这通常会导致视觉标记数量的平方级增长,并显著增加计算成本。我们归因于这种低效率是由于文档图像中大量冗余的视觉区域,如背景。为了解决这个问题,我们提出了PaddleOCR-VL,这是一种新颖的粗到细架构,专注于语义相关区域,同时抑制冗余区域,从而提高效率和性能。具体来说,我们引入了一个轻量级的有效区域聚焦模块(VRFM),利用定位和上下文关系预测能力来识别有效的视觉标记。随后,我们设计并训练了一个紧凑而强大的0.9B视觉-语言模型(PaddleOCR-VL-0.9B),在VRFM输出的引导下进行详细识别,避免直接处理整个大图像。广泛的实验表明,PaddleOCR-VL 在页面级解析和元素级识别方面均达到了最先进的性能。它显著优于现有解决方案,表现出与顶级视觉语言模型的强大竞争力,并提供快速推理,同时使用大量较少的视觉标记和参数,突显了针对粗到细解析的准确和高效文档理解的有效性。源代码和模型可在https://github.com/PaddlePaddle/PaddleOCR/ 公开获取。
Summary / 总结
The research aims to improve the efficiency and performance of document parsing by addressing the issue of high-resolution input leading to increased computational costs. The proposed PaddleOCR-VL uses a coarse-to-fine architecture with a Valid Region Focus Module (VRFM) to identify and process only semantically relevant regions, reducing the number of vision tokens and parameters. Experiments show that PaddleOCR-VL outperforms existing solutions in both page-level parsing and element-level recognition, while maintaining fast inference speeds and lower computational costs.
研究旨在通过解决高分辨率输入导致的高计算成本问题,提升文档解析的效率和性能。提出的PaddleOCR-VL采用粗细粒度架构,结合Valid Region Focus Module (VRFM)来识别并处理仅具有语义相关性的区域,减少视觉令牌和参数的数量。实验表明,PaddleOCR-VL在页面级解析和元素级识别方面均达到最先进的性能,超越现有解决方案,并在与顶级视觉语言模型的竞争中表现出强大的竞争力,同时保持快速推理速度。
EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors
Authors: Ryuhei Miyazato, Shunsuke Kitada, Kei Harada
First: 2026-04-03T06:48:27+00:00 · Latest: 2026-04-03T06:48:27+00:00
Abstract
Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.
中文标题/摘要
标题:EnsemHalDet:基于内部状态检测器集成的鲁棒视觉-语言模型幻觉检测
视觉-语言模型(VLMs)在多模态任务中表现出色,但它们仍然容易受到与输入图像事实不符或与输入图像无关的幻觉的影响。近期研究表明,使用内部表示进行幻觉检测比仅依赖模型输出的方法更高效和准确。然而,现有的基于内部表示的方法通常依赖单一的表示或检测器,限制了它们捕捉多样化幻觉信号的能力。在本文中,我们提出了一种基于集成的幻觉检测框架EnsemHalDet,该框架利用了VLMs的多种内部表示,包括注意力输出和隐藏状态。EnsemHalDet为每种表示训练独立的检测器,并通过集成学习将它们结合起来。在多个VQA数据集和VLM上的实验结果表明,EnsemHalDet在AUC方面始终优于先前的方法和单一检测器模型。这些结果表明,集成多样化的内部信号显著提高了多模态幻觉检测的鲁棒性。
Summary / 总结
EnsemHalDet is a robust hallucination detection framework for Vision-Language Models (VLMs) that uses an ensemble of internal state detectors to improve accuracy and robustness. It leverages multiple internal representations such as attention outputs and hidden states, trains independent detectors for each, and combines them through ensemble learning. Experiments across various VQA datasets and VLMs show that EnsemHalDet outperforms previous methods and single-detector models in terms of AUC, highlighting the effectiveness of ensembling diverse internal signals.
研究旨在通过利用多种内部表示来提高Vision-Language模型(VLM)中幻觉检测的鲁棒性。EnsemHalDet是一种基于集成的框架,使用VLM的注意力输出和隐藏状态来训练独立的检测器,并通过集成学习进行组合。研究结果显示,EnsemHalDet在多个VQA数据集和VLM上的AUC指标上优于先前的方法和单一检测器模型,表明结合多样化的内部信号可以增强多模态幻觉检测的鲁棒性。
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
Authors: Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan, Linshan Li, Weiming Liu, Ruizhi Qian, Guangxin Wu, Hao Zhang, Jifeng Shen, Piotr Koniusz, Zhengtao Yao, Junhao Dong, Qiang Sun
Venue: ICLR 2026
First: 2026-04-03T05:56:29+00:00 · Latest: 2026-04-03T05:56:29+00:00
Comments: Accepted at ICLR 2026
Abstract
Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.
中文标题/摘要
标题:DeCo-DETR:解耦认知DETR用于高效的开放词汇目标检测
开放词汇目标检测(OVOD)使模型能够识别超出预定义类别的对象,但现有方法在实际部署中仍受到限制。一方面,多模态设计往往由于依赖于推理时的文本编码而产生巨大的计算开销。另一方面,紧密耦合的训练目标在封闭集检测准确性和开放世界泛化之间引入了权衡。因此,我们提出了解耦认知DETR(DeCo-DETR),这是一种以视觉为中心的框架,通过统一的解耦范式来解决这些挑战。DeCo-DETR 不依赖于在线文本编码,而是从预训练的LVLM生成的区域级描述中构建层次语义原型空间,并通过CLIP对齐,从而实现高效且可重用的语义表示。在此基础上,该框架进一步通过解耦训练策略将语义推理与定位分离,将对齐和检测分离为并行优化流。在标准OVOD基准上的广泛实验表明,DeCo-DETR 在实现竞争力的零样本检测性能的同时,显著提高了推理效率。这些结果突显了将语义认知与检测解耦的有效性,为可扩展的OVOD系统提供了实用的方向。
Summary / 总结
DeCo-DETR is designed to address the challenges of Open-vocabulary Object Detection by decoupling semantic cognition from detection. It constructs a hierarchical semantic prototype space using region-level descriptions from pre-trained LVLMs and aligns them via CLIP, reducing computational overhead. The framework further disentangles semantic reasoning from localization through a decoupled training strategy, improving inference efficiency while maintaining competitive zero-shot detection performance.
DeCo-DETR 通过将语义认知与检测解耦来提高开放词汇对象检测(OVOD)的效率和实用性。它使用预训练的语言-视觉模型生成的区域级描述构建层次语义原型空间,并通过CLIP进行对齐,避免了在线文本编码的需求。这种方法实现了高效的语义表示并提高了推理效率。实验表明,DeCo-DETR 在保持零样本检测性能的同时,显著提升了效率,优于现有方法。
THOM: Generating Physically Plausible Hand-Object Meshes From Text
Authors: Uyoung Jeong, Yihalem Yimolal Tiruneh, Hyung Jin Chang, Seungryul Baek, Kwang In Kim
Venue: CVPR
First: 2026-04-03T05:17:12+00:00 · Latest: 2026-04-03T05:17:12+00:00
Comments: accepted to CVPR Findings 2026
Abstract
The generation of 3D hand-object interactions (HOIs) from text is crucial for dexterous robotic grasping and VR/AR content generation, requiring both high visual fidelity and physical plausibility. Nevertheless, the ill-posed problem of mesh extraction from text-generated Gaussians, and physics-based optimization on the erroneous meshes pose challenges. To address these issues, we introduce THOM, a training-free framework that generates photorealistic, physically plausible 3D HOI meshes without the need for a template object mesh. THOM employs a two-stage pipeline, initially generating the hand and object Gaussians, followed by physics-based HOI optimization. Our new mesh extraction method and vertex-to-Gaussian mapping explicitly assign Gaussian elements to mesh vertices, allowing topology-aware regularization. Furthermore, we improve the physical plausibility of interactions by VLM-guided translation refinement and contact-aware optimization. Comprehensive experiments demonstrate that THOM consistently surpasses state-of-the-art methods in terms of text alignment, visual realism, and interaction plausibility.
中文标题/摘要
标题:THOM:从文本生成物理上可信的手-物体网格
从文本生成3D手-物体交互(HOIs)对于灵巧的机器人抓取和VR/AR内容生成至关重要,需要高度的视觉真实性和物理可信度。然而,从文本生成的高斯分布中提取网格以及在错误网格上进行物理优化的问题是不明确的。为了解决这些问题,我们引入了THOM,这是一种无需训练的框架,可以在无需模板物体网格的情况下生成逼真的、物理上可信的3D HOI网格。THOM采用两阶段管道,首先生成手和物体的高斯分布,然后进行基于物理的HOI优化。我们提出了一种新的网格提取方法和顶点到高斯的映射,明确地将高斯元素分配给网格顶点,允许拓扑感知正则化。此外,我们通过VLM指导的翻译细化和接触感知优化来提高交互的物理可信度。全面的实验表明,THOM在文本对齐、视觉真实性和交互可信度方面始终优于最先进的方法。
Summary / 总结
THOM is a training-free framework that generates photorealistic and physically plausible 3D hand-object interaction meshes from text. It uses a two-stage pipeline to first generate hand and object Gaussians and then optimize the HOI using physics-based methods. THOM improves physical plausibility through VLM-guided translation refinement and contact-aware optimization. Experiments show that THOM outperforms existing methods in text alignment, visual realism, and interaction plausibility.
THOM 是一个无需训练的框架,可以从文本生成逼真的、物理上可信的 3D 手物交互网格。它使用两阶段管道首先生成手和物体的高斯分布,然后使用基于物理的方法进行优化。THOM 通过 VLM 引导的平移细化和接触感知优化来提高物理可信度。实验表明,THOM 在文本对齐、视觉真实性和交互可信度方面优于现有方法。
Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Authors: Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu
First: 2026-04-02T12:51:07+00:00 · Latest: 2026-04-03T05:00:08+00:00
Abstract
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
中文标题/摘要
标题:注意力静止则保持静止:打破视觉惯性以减轻认知幻觉
如同静止的物体保持静止,我们发现多模态大型语言模型(MLLMs)中的视觉注意力表现出明显的惯性,在早期解码步骤中一旦稳定下来就基本保持不变,无法支持认知推理所需的组合理解。现有的幻觉缓解方法主要针对与物体存在或属性相关的感知幻觉,但对于需要物体间关系推理的认知幻觉则显得力不从心。通过词元级别的注意力分析,我们发现这种视觉惯性是关键因素:对语义关键区域的注意力保持持续聚焦,无法动态支持关系推理。因此,我们提出了一种无需训练的感知意识视觉激发(IVE)方法,通过将认知推理建模为视觉注意力的动态响应来打破这种惯性模式。具体而言,IVE 选择相对于历史注意力趋势动态出现的视觉词元,同时区分表现出惯性行为的词元。为了进一步促进组合推理,IVE 引入了一种感知意识惩罚,以防止过度集中并限制注意力在局部区域内的持久性。广泛的实验表明,IVE 在各种基础 MLLMs 和多个幻觉基准测试中都有效,特别是在处理认知幻觉方面。
Summary / 总结
The paper addresses the issue of visual inertia in multimodal large language models (MLLMs), where attention remains static and fails to support relational inference, leading to cognitive hallucinations. It introduces an Inertia-aware Visual Excitation (IVE) method that models cognitive inference as dynamic responsiveness of visual attention, selecting tokens that are dynamically emerging and discouraging over-concentration to mitigate hallucinations. Experiments demonstrate IVE's effectiveness across different MLLMs and hallucination benchmarks, especially for cognitive hallucinations.
研究关注多模态大型语言模型(MLLMs)中视觉惯性问题,即注意力保持静态,无法支持需要组成性推理的认知推理。提出了一种名为Inertia-aware Visual Excitation (IVE)的方法,通过动态调整注意力打破惯性模式。实验表明IVE在各种MLLMs和幻觉基准测试中有效,尤其是在认知幻觉方面。
Video Understanding: Through A Temporal Lens
Authors: Thong Thanh Nguyen
First: 2026-01-31T12:01:09+00:00 · Latest: 2026-04-03T01:32:26+00:00
Comments: PhD Thesis, NUS, 2025
Abstract
This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new "temporal-oriented recipe" for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model's ability to represent and reason about the fluid nature of video content.
中文标题/摘要
标题:视频理解:通过时间视角
该论文探讨了如何利用视频元素之间的时序关系来推进视频理解。针对现有方法的局限性,该研究提出了五个方面的贡献:(1) 一种自动注释框架,利用大规模视觉-语言模型和具有减量角余量的鲁棒对比学习目标;(2) 一种参数高效的微调策略,使用“递归适配器”来捕捉低数据环境下的时序动态;(3) 将状态空间层(SSL)集成到高效长视频建模中,通过引入两种新的长期基准来支持自观和特征长度内容;(4) 一种新颖的对比学习框架,旨在明确建模动作与视频时刻之间的细粒度关系;(5) 对大规模视觉-语言模型(LVLM)的全面实证研究,发现视觉-语言接口是时间推理的瓶颈,并提出了一种新的“时间导向食谱”以提升视频理解。这些贡献共同表明,显式的时间建模显著增强了模型对视频内容流动性的表示和推理能力。
Summary / 总结
This thesis aims to improve video understanding by leveraging temporal relations among video elements. It introduces an automatic annotation framework using large vision-language models and a noise-robust contrastive learning objective. The work also presents a parameter-efficient fine-tuning strategy with recurrent adapters for capturing temporal dynamics in low-data regimes. Additionally, it integrates State Space Layers for efficient long-form video modeling, introduces new long-term benchmarks, and develops a novel contrastive learning framework to model fine-grained relations between motions and video moments. Empirical studies show that explicit temporal modeling enhances a model's ability to represent and reason about video content.
该论文旨在通过利用视频元素之间的时序关系来提升视频理解能力。它引入了一个使用大型视觉-语言模型的自动注释框架,并采用了一种噪声鲁棒的对比学习目标。此外,还提出了一种参数高效的微调策略‘递归适配器’,并整合了状态空间层以实现高效长视频建模。关键发现包括视觉-语言接口是时间推理的瓶颈,并开发了一种‘时间导向的食谱’以增强视频理解。这些贡献共同表明,显式的时间建模显著提高了模型对视频内容的表示和推理能力。
OSCAR: Orchestrated Self-verification and Cross-path Refinement
Authors: Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta
First: 2026-04-02T05:02:22+00:00 · Latest: 2026-04-03T01:28:15+00:00
Abstract
Diffusion language models (DLMs) expose their denoising trajectories, offering a natural handle for inference-time control; accordingly, an ideal hallucination mitigation framework should intervene during generation using this model-native signal rather than relying on an externally trained hallucination classifier. Toward this, we formulate commitment uncertainty localization: given a denoising trajectory, identify token positions whose cross-chain entropy exceeds an unsupervised threshold before factually unreliable commitments propagate into self-consistent but incorrect outputs. We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods. We also introduce OSCAR, a training-free inference-time framework operationalizing this formulation. OSCAR runs N parallel denoising chains with randomized reveal orders, computes cross-chain Shannon entropy to detect high-uncertainty positions, and then performs targeted remasking conditioned on retrieved evidence. Ablations confirm that localization and correction contribute complementary gains, robust across N in {4, 8, 16}. On TriviaQA, HotpotQA, RAGTruth, and CommonsenseQA using LLaDA-8B and Dream-7B, OSCAR enhances generation quality by significantly reducing hallucinated content and improving factual accuracy through uncertainty-guided remasking, which also facilitates more effective integration of retrieved evidence. Its native entropy-based uncertainty signal surpasses that of specialized trained detectors, highlighting an inherent capacity of diffusion language models to identify factual uncertainty that is not present in the sequential token commitment structure of autoregressive models.
中文标题/摘要
标题:OSCAR:协调自我验证和跨路径校准
扩散语言模型(DLMs)暴露了它们的去噪轨迹,提供了一种自然的推理时控制手段;因此,理想的幻觉缓解框架应该利用这种模型固有的信号在生成过程中进行干预,而不是依赖于外部训练的幻觉分类器。为此,我们提出了承诺不确定性定位:给定一个去噪轨迹,识别出跨链熵超过无监督阈值的标记位置,以防止事实不可靠的承诺传播到自洽但错误的输出中。我们引入了一系列轨迹级别的评估,包括跨链幻觉时的发散度(CDH)度量,用于原则性地比较定位方法。我们还引入了OSCAR,这是一种无需训练的推理时框架,实现了这一形式化。OSCAR运行N条并行的去噪链,以随机揭示顺序计算跨链香农熵以检测高不确定性位置,然后根据检索到的证据进行有针对性的遮盖。消融实验表明,定位和修正提供了互补的增益,且在N取{4, 8, 16}时表现稳健。在使用LLaDA-8B和Dream-7B的TriviaQA、HotpotQA、RAGTruth和CommonsenseQA上,OSCAR通过不确定性引导的遮盖显著减少了生成的幻觉内容,提高了事实准确性,从而促进了检索证据的更有效整合。其固有的基于熵的不确定性信号超过了专门训练的检测器,突显了扩散语言模型识别事实不确定性的能力,而这种能力在自回归模型的顺序标记承诺结构中是不存在的。
Summary / 总结
The research aims to mitigate hallucinations in diffusion language models by leveraging their denoising trajectories. The method involves identifying token positions with high cross-chain entropy to prevent unreliable commitments and introduces a cross-chain divergence-at-hallucination (CDH) metric for comparison. OSCAR, a training-free framework, runs parallel denoising chains, detects high-uncertainty positions, and performs targeted remasking. Experiments on TriviaQA, HotpotQA, RAGTruth, and CommonsenseQA show that OSCAR reduces hallucinated content and improves factual accuracy, outperforming specialized detectors in identifying factual uncertainty.
研究旨在通过利用扩散语言模型的去噪轨迹来减轻幻觉。方法包括识别具有高跨链熵的令牌位置以防止不可靠的承诺,并引入跨链幻觉离散度(CDH)指标进行比较。OSCAR是一种无需训练的框架,运行并行去噪链,检测高不确定性位置,并进行基于检索证据的目标化遮盖。实验表明,OSCAR减少了幻觉内容并提高了事实准确性,其基于熵的事实不确定性信号优于专门训练的检测器。
Moondream Segmentation: From Words to Masks
Authors: Ethan Reid
First: 2026-04-03T00:09:14+00:00 · Latest: 2026-04-03T00:09:14+00:00
Comments: Demo: https://moondream.ai/me/playground
Abstract
We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).
中文标题/摘要
标题:Moondream 分割:从文字到掩码
我们介绍了 Moondream 分割,这是 Moondream 3 的一个视觉语言模型的图像引用分割扩展。给定一张图片和一个引用表达式,模型自回归地解码一个向量路径,并迭代地细化矢量化掩码以生成最终的详细掩码。我们引入了一种强化学习阶段,通过直接优化掩码质量来解决监督信号中的歧义。该阶段的展开生成了细化器的粗到精确的目标。为了减轻多边形注释带来的评估噪声,我们发布了 RefCOCO-M,这是一个带有边界准确掩码的清洁版 RefCOCO 验证集。Moondream 分割在 RefCOCO(验证集)上达到了 cIoU 80.2%,在 LVIS(验证集)上达到了 62.6% 的 mIoU。
Summary / 总结
Moondream Segmentation is an extension of Moondream 3, a vision-language model, for referring image segmentation. It autoregressively decodes a vector path and iteratively refines it into a detailed mask. A reinforcement learning stage optimizes mask quality by producing coarse-to-ground-truth targets. The model achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).
Moondream Segmentation 是 Moondream 3 视觉语言模型的扩展,用于参考图像分割。它自回归地解码向量路径并逐步细化为详细的掩码。通过生成粗到精细的目标,强化学习阶段直接优化掩码质量。该模型在 RefCOCO (val) 上达到 cIoU 80.2%,在 LVIS (val) 上达到 62.6% 的 mIoU。
WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models
Authors: Haiyu Wang, Yutong Wang, Jack Jiang, Sai Qian Zhang
First: 2026-04-02T22:49:57+00:00 · Latest: 2026-04-02T22:49:57+00:00
Abstract
Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low-rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each element during SVD process to better preserve accuracy, then extend this framework with quantization applied to both weights and activations, resulting in a highly efficient VLM. Collectively, we introduce~\textit{Weighted SVD} (WSVD), which outperforms other approaches by achieving over $1.8\times$ decoding speedup while preserving accuracy. We open source our code at: \href{https://github.com/SAI-Lab-NYU/WSVD}{\texttt{https://github.com/SAI-Lab-NYU/WSVD}
中文标题/摘要
标题:WSVD:加权低秩逼近以实现低精度视觉语言模型快速高效执行
奇异值分解(SVD)已成为减少视觉语言模型(VLMs)计算负担的重要技术,VLMs 在图像字幕和视觉问答等任务中发挥着核心作用。尽管已有多种先前工作提出了高效的 SVD 变体以实现低秩操作,但在实践中我们发现仍难以在模型执行过程中实现显著的延迟减少。为解决这一局限,我们引入了一种新的计算模式,并在更细粒度上应用 SVD,从而实现了实际且可测量的执行延迟改进。此外,鉴于权重元素的重要性不同,我们在 SVD 过程中自适应地分配相对重要性给每个元素,以更好地保持准确性,然后在此框架中应用量化,对权重和激活值都进行量化,从而实现高度高效的 VLM。总体而言,我们引入了加权 SVD(WSVD),该方法在保持准确性的同时实现了超过 1.8 倍的解码速度提升。我们已开源代码:https://github.com/SAI-Lab-NYU/WSVD
Summary / 总结
The research aims to reduce the computational burden of Vision Language Models (VLMs) by proposing Weighted SVD (WSVD), which applies SVD at a finer granularity and adaptively allocates importance to weight elements. This method achieves over 1.8 times decoding speedup while maintaining accuracy. The approach also includes quantization for both weights and activations, leading to highly efficient VLMs.
该论文提出了加权SVD(WSVD)方法,通过在更细粒度上应用SVD并在SVD过程中适配性地分配权重来提高低精度视觉-语言模型(VLM)的执行效率。该方法在保持准确性的前提下实现了超过1.8倍的解码速度提升。该方法结合了对权重和激活进行量化,从而显著减少了延迟。