World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
Authors: Wanyue Zhang, Wenxiang Wu, Wang Xu, Jiaxin Luo, Helu Zhi, Yibin Huang, Shuo Ren, Zitao Liu, Jiajun Zhang
First: 2026-04-29T17:48:01+00:00 · Latest: 2026-04-29T17:48:01+00:00
Comments: The code is available at https://github.com/WanyueZhang-ai/World2VLM. The dataset is available at https://huggingface.co/datasets/WanyueZhang/World2VLM
Abstract
Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.
中文标题/摘要
标题:World2VLM:将世界模型的想象能力提炼到VLM中以实现动态空间推理
视觉-语言模型(VLMs)在静态视觉理解方面表现出色,但在处理需要想象场景在主观运动下如何演变的动态空间推理方面仍然存在困难。最近的努力通过使用合成数据扩展空间监督或在推理时将VLM与世界模型耦合来解决这一局限性。然而,前者往往缺乏对运动条件下的状态转换的显式建模,而后者则会带来巨大的计算开销。在本文中,我们提出了一种训练框架World2VLM,该框架将生成世界模型中的空间想象能力提炼到视觉-语言模型中。给定初始观察和参数化的摄像机轨迹,我们使用视图一致的世界模型来合成几何对齐的未来视图,并从中推导出结构化的监督,用于前向(动作-结果)和逆向(结果-动作)空间推理。我们使用此管道生成的紧凑数据集对VLM进行两阶段的后训练,并在多个空间推理基准上进行评估。World2VLM在多个基准上(包括SAT-Real、SAT-Synthesized、VSI-Bench和MindCube)相对于基模型提供了持续的改进。它还优于测试时与世界模型耦合的方法,同时消除了昂贵的推理时生成的需要。我们的结果表明,世界模型不仅可以作为推理时的工具,还可以作为有效的训练时教师,使VLM能够以可扩展和高效的方式内化空间想象。
Summary / 总结
World2VLM proposes a training framework that integrates a generative world model into a vision-language model to enhance dynamic spatial reasoning. By using a view-consistent world model to synthesize future views and derive structured supervision, it improves the base model's performance across various benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. This method reduces the computational overhead compared to inference-time world model coupling and demonstrates that world models can effectively train VLMs for spatial imagination.
World2VLM 提出了一种将生成的世界模型集成到视觉语言模型中的训练框架,以增强动态空间推理能力。通过使用视图一致的世界模型合成未来视图并推导结构化的监督信息,该方法在 SAT-Real、SAT-Synthesized、VSI-Bench 和 MindCube 等多个基准测试中提升了基模型的表现。这种方法减少了与推理时世界模型耦合相比的计算开销,并展示了世界模型可以有效训练 VLMs 进行空间想象。
Value-Guided Iterative Refinement and the DIQ-H Benchmark for Evaluating VLM Robustness
Authors: Hanwen Wan, Zexin Lin, Yixuan Deng, Xiaoqiang Ji
First: 2025-12-03T17:22:29+00:00 · Latest: 2026-04-29T16:22:49+00:00
Abstract
Vision-Language Models (VLMs) are essential for embodied AI and safety-critical applications, such as robotics and autonomous systems. However, existing benchmarks primarily focus on static or curated visual inputs, neglecting the challenges posed by adversarial conditions, value misalignment, and error propagation in continuous deployment. Current benchmarks either overlook the impact of real-world perturbations, or fail to account for the cumulative effect of inconsistent reasoning over time. To address these gaps, we introduce the Degraded Image Quality Leading to Hallucinations (DIQ-H) benchmark, the first to evaluate VLMs under adversarial visual conditions in continuous sequences. DIQ-H simulates real-world stressors including motion blur, sensor noise, and compression artifacts, and measures how these corruptions lead to persistent errors and misaligned outputs across time. The benchmark explicitly models error propagation and its long-term value consistency. To enhance scalability and reduce costs for safety-critical evaluation, we propose the Value-Guided Iterative Refinement (VIR) framework, which automates the generation of high-quality, ethically aligned ground truth annotations. VGIR leverages lightweight VLMs to detect and refine value misalignment, improving accuracy from 72.2% to 83.3%, representing a 15.3% relative improvement. The DIQ-H benchmark and VGIR framework provide a robust platform for embodied AI safety assessment, revealing vulnerabilities in error recovery, ethical consistency, and temporal value alignment.
Summary / 总结
The paper introduces the DIQ-H benchmark to evaluate VLMs under adversarial visual conditions in continuous sequences, addressing the limitations of existing benchmarks. It also proposes the Value-Guided Iterative Refinement (VIR) framework to automate the generation of high-quality, ethically aligned ground truth annotations, improving accuracy from 72.2% to 83.3%. The DIQ-H benchmark and VIR framework help assess the robustness of VLMs in safety-critical applications, highlighting issues in error recovery, ethical consistency, and temporal value alignment.
论文提出了DIQ-H基准,以评估VLM在连续序列中的对抗视觉条件下的表现,解决了现有基准的局限性。同时,提出了基于价值引导的迭代细化(VIR)框架,以自动化生成高质量、伦理对齐的地面真值注释,准确率从72.2%提高到83.3%。DIQ-H基准和VIR框架有助于评估嵌入式AI的安全性,并揭示了错误恢复和时间价值对齐方面的漏洞。
Random Cloud: Finding Minimal Neural Architectures Without Training
Authors: Javier Gil Blázquez
First: 2026-04-29T15:57:01+00:00 · Latest: 2026-04-29T15:57:01+00:00
Abstract
I propose the \emph{Random Cloud} method, a training-free approach to neural architecture search that discovers minimal feedforward network topologies through stochastic exploration and progressive structural reduction. Unlike post-training pruning methods that require a full train-prune-retrain cycle, this method evaluates randomly initialized networks without backpropagation, progressively reduces their topology, and only trains the best minimal candidate at the end. I evaluate on 7 classification benchmarks against magnitude pruning and random pruning baselines. The Random Cloud matches or outperforms both baselines in 6 of 7 datasets, achieving statistically significant improvements on Sonar ($+4.9$pp accuracy, $p{=}0.017$ vs magnitude pruning) with 87\% parameter reduction. Crucially, the method is faster than both pruning baselines in 4 of 5 datasets (0.67--0.94$\times$ the cost of full training), since it avoids training the full-size network entirely.
ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
Authors: Hui Wang, Hongze Li, Wei Chen, Xiaojin Zhang
First: 2026-04-29T15:35:03+00:00 · Latest: 2026-04-29T15:35:03+00:00
Abstract
Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by the profound spatial heterogeneity inherent in natural images. Specifically, the imposition of a uniform global receptive field across regions of varying information density inevitably leads to local feature degradation, particularly in dense conflict zones populated by microscopic targets. To address this mechanistic limitation, we propose ViCrop-Det, a training-free inference framework that introduces adaptive spatial trust region shrinkage. Inspired by the use of attention entropy in anomaly segmentation, ViCrop-Det leverages the detection decoder's cross-attention distribution as an endogenous probe. By utilizing Spatial Attention Entropy (SAE) to heuristically evaluate local spatial ambiguity, the framework executes dynamic spatial routing, allocating a fixed computational budget exclusively to regions exhibiting both high target saliency and high cognitive uncertainty. By shrinking the spatial trust region and injecting high-frequency localized observations, ViCrop-Det actively resolves spatial ambiguity and recovers fine-grained features without requiring architectural modifications. Extensive evaluations on VisDrone and DOTA-v1.5 demonstrate that ViCrop-Det yields competitive performance enhancements, consistently adding +1-3 mAP@50 to RT-DETR-R50 and Deformable DETR with a marginal 20-23\% latency overhead. On MS COCO, $AP_{S}$ improves while $AP_{M}/AP_{L}$ remains stable, indicating precise fine-scale refinement without compromising the global spatial prior. Under compute-matched settings, our adaptive routing strategy comprehensively surpasses uniform slicing baselines, achieving a highly optimized accuracy-speed trade-off.
Summary / 总结
ViCrop-Det is a training-free inference framework that addresses the spatial heterogeneity issue in natural images by introducing adaptive spatial trust region shrinkage. It uses Spatial Attention Entropy (SAE) to dynamically route computation to regions with high target saliency and cognitive uncertainty, thereby enhancing fine-grained feature recovery. Evaluations on VisDrone and DOTA-v1.5 show that ViCrop-Det improves performance by +1-3 mAP@50 with a 20-23% latency overhead, and on MS COCO, it achieves precise fine-scale refinement without compromising the global spatial prior.
ViCrop-Det 是一种无需训练的推理框架,通过引入自适应的空间信任区域收缩来解决基于变换器的架构中的空间异质性问题。它使用空间注意力熵(SAE)动态分配计算资源给具有高目标显著性和认知不确定性区域,增强细粒度特征的恢复。实验结果表明,ViCrop-Det 在 VisDrone 和 DOTA-v1.5 上将 mAP@50 提高了 1-3%,并带有 20-23% 的延迟开销。在 MS COCO 上,它实现了精细尺度的精确细化,而不会牺牲全局空间先验。
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification
Authors: Zuzheng Kuang, Honghao Chang, Boqiang Liang, Haoqian Wang, Lijun He, Fan Li, Haixia Bi
First: 2026-04-29T15:05:37+00:00 · Latest: 2026-04-29T15:05:37+00:00
Abstract
Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free open-vocabulary change detection framework based on cross-temporal memory reasoning and global-local adaptive rectification. Specifically, we reformulate bi-temporal change detection as a two-frame tracking problem and introduce weighted bidirectional propagation to aggregate semantic evidence from both temporal directions. To stabilize memory propagation across large temporal gaps, we construct histogram-aligned transition frames to smooth abrupt appearance changes. Moreover, a global-local adaptive rectification strategy adaptively fuses local and global-view predictions, improving spatial consistency while preserving fine-grained details. Experiments on five benchmarks demonstrate that MemOVCD achieves favorable performance on two change detection tasks, validating its effectiveness and generalization under diverse open-vocabulary settings.
中文标题/摘要
标题:MemOVCD:基于跨时间记忆推理和全局-局部自适应校正的无训练开放词汇变化检测
开放词汇变化检测旨在识别双时相遥感图像中的语义变化,无需预定义类别。最近的方法结合了如SAM、DINO和CLIP等基础模型,但通常独立处理每个时间戳或仅在最终比较阶段进行交互。这些范式在语义推理过程中缺乏充分的时间耦合,限制了它们区分真正的语义变化与非语义外观差异的能力。此外,高分辨率图像上的块主导推理往往会削弱全局语义连续性并产生碎片化的变化区域。为了解决这些问题,我们提出了一种基于跨时间记忆推理和全局-局部自适应校正的无训练开放词汇变化检测框架MemOVCD。具体而言,我们将双时相变化检测重新表述为两帧跟踪问题,并引入加权双向传播以从两个时间方向聚合语义证据。为了在大时间间隔内稳定记忆传播,我们构建了直方图对齐的过渡帧以平滑突然的外观变化。此外,全局-局部自适应校正策略自适应地融合局部和全局视图预测,提高了空间一致性同时保留了细粒度的细节。在五个基准上的实验表明,MemOVCD在两种变化检测任务中均取得了良好的性能,验证了其在多种开放词汇设置下的有效性和泛化能力。
Summary / 总结
MemOVCD is a training-free open-vocabulary change detection framework that addresses the limitations of insufficient temporal coupling and fragmented change regions in existing methods. It uses cross-temporal memory reasoning and global-local adaptive rectification to aggregate semantic evidence from both timestamps and stabilize memory propagation. Experiments show that MemOVCD outperforms existing methods on two change detection tasks across various benchmarks.
MemOVCD 是一个无需训练的开放词汇变化检测框架,解决了现有方法中时间耦合不足和变化区域碎片化的问题。它通过跨时间记忆推理和全局-局部自适应校正来从两个时间戳中聚合语义证据,并稳定长时间间隔的记忆传播。实验表明,MemOVCD 在两个变化检测任务上的表现优于现有方法,并在多种基准测试中验证了其有效性与泛化能力。
Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples
Authors: Oussama Bouanani, Jim Berend, Wojciech Samek, Sebastian Lapuschkin, Maximilian Dreyer
First: 2026-04-24T11:55:50+00:00 · Latest: 2026-04-29T14:23:54+00:00
Abstract
Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples -- inputs that are semantically similar to activating examples but elicit low activations -- to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.
中文标题/摘要
标题:对比语义投影:通过对比样例进行忠实神经元标签标注
神经元标签为深度网络的内部单元分配文本描述。现有方法通常依赖于高度激活的样例,往往会因为关注主导但偶然的视觉因素而产生宽泛或误导性的标签。先前的工作如FALCON引入了对比样例——与激活样例在语义上相似但激活较低的输入——以使解释更加精确,但主要解决的是子空间级别的可解释性而非可扩展的神经元级标签标注。我们重新审视了对比解释在神经元级标签标注中的应用,分为两个阶段:(1) 使用视觉语言模型(VLMs)生成候选标签,(2) 使用CLIP类似编码器进行标签分配。首先,我们展示了向VLMs提供对比图像集可以产生更具体、更忠实的候选标签。其次,我们引入了对比语义投影(CSP),这是一种扩展的语义镜头(SemanticLens),直接将对比样例纳入其基于CLIP的评分和选择流程中。在广泛的实验和黑色素瘤检测案例研究中,对比标签在忠实度和语义粒度上都优于最先进的基线方法。我们的结果表明,对比样例是神经元标签和分析管道中简单而强大且目前被低估的组成部分。
Summary / 总结
The paper addresses the issue of broad or misleading labels in neuron labeling by proposing a method that uses contrastive examples. It involves two stages: generating candidate labels with vision language models and assigning labels with CLIP-like encoders. The method, Contrastive Semantic Projection (CSP), incorporates contrastive examples to improve label specificity and faithfulness. Experiments show that contrastive labeling outperforms existing methods in terms of faithfulness and semantic granularity for neuron-level labeling in tasks like melanoma detection.
论文通过使用对比样本解决神经元标签过于宽泛或误导的问题,提出了一种方法,包括两个阶段:使用视觉语言模型生成候选标签和使用CLIP类似编码器分配标签。方法Contrastive Semantic Projection (CSP) 直接将对比样本纳入其CLIP基线评分和选择流程中,以提高标签的特异性和真实性。实验表明,对比标签在神经元级标签分配中优于现有方法,在如黑色素瘤检测任务中表现出更高的真实性和语义粒度。
A Multimodal Depth-Aware Method For Embodied Reference Understanding
Authors: Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel
Venue: ICASSP 2026
First: 2025-10-09T14:32:21+00:00 · Latest: 2026-04-29T14:10:18+00:00
Comments: Accepted by ICASSP 2026
Abstract
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.
中文标题/摘要
标题:一种多模态深度感知方法用于体态参考理解
体态参考理解需要根据语言指令和指示手势在视觉场景中识别目标物体。尽管先前的工作在开放词汇对象检测方面取得了进展,但在存在多个候选物体的模糊场景中,它们往往无法有效工作。为了解决这些挑战,我们提出了一种新颖的ERU框架,该框架联合利用基于LLM的数据增强、深度图模态和深度感知决策模块。这种设计能够稳健地整合语言和体态线索,提高在复杂或杂乱环境中消歧的效果。在两个数据集上的实验结果表明,我们的方法显著优于现有基线,实现了更准确和可靠的指代检测。
Summary / 总结
The research aims to improve embodied reference understanding by addressing ambiguities in visual scenes. The proposed ERU framework integrates LLM-based data augmentation, depth-map modality, and a depth-aware decision module to robustly combine linguistic and embodied cues. Experiments show that this approach outperforms existing methods, leading to more accurate and reliable referent detection in complex environments.
研究旨在通过解决视觉场景中的歧义性来提升体态参考理解。提出的ERU框架结合了基于LLM的数据增强、深度图模态和深度感知决策模块,以稳健地结合语言和体态线索。实验表明,该方法优于现有方法,能够在复杂环境中实现更准确和可靠的指代检测。
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
Authors: Zhimin Lin, Yixin Ji, Jinpeng Li, Yu Luo, Dong Li, Junhua Fang, Juntao Li, Min Zhang
First: 2026-04-29T13:11:39+00:00 · Latest: 2026-04-29T13:11:39+00:00
Abstract
Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.
Summary / 总结
This study proposes a runtime scaling framework for large-the reasoning tasks, which dynamically routes instances based based the basis of of output disagreement to allocate more computation on to harder instances. The method finds that output disagreement is strongly correlated with instance difficulty and prediction prediction correctness, and uses this framework to improve on-time scale scaling on an instance-on routing problem rather than on sampling and rewriting-based on on.. Ex seven benchmarks, and three models, the method- improves accuracy by 3 -- while reducing sampling on on compared to existing approaches.
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
Authors: Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen
First: 2026-03-06T06:44:17+00:00 · Latest: 2026-04-29T12:58:34+00:00
Comments: Project page: https://vaisr.github.io/OVGGT/ Code: https://github.com/VAISR/OVGGT
Abstract
Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy. Project page: https://vaisr.github.io/OVGGT/ Code: https://github.com/VAISR/OVGGT
Summary / 总结
OVGGT addresses the challenge of reconstructing 3D geometry from streaming video by bounding both memory and compute costs to a fixed budget. It combines Self-Selective Caching to compress the KV cache and Dynamic Anchor Protection to shield critical tokens, ensuring geometric accuracy over long sequences. Experiments show OVGGT processes arbitrarily long videos with constant VRAM usage while achieving state-of-the-art 3D geometric accuracy.
OVGGT通过将内存和计算成本限制在一个固定预算内来解决从流式视频中重建3D几何结构的挑战。它结合了Self-Selective Caching来压缩KV缓存和Dynamic Anchor Protection来保护关键令牌,确保长时间序列中的几何准确性。实验表明,OVGGT能够在恒定的显存环境下处理任意长度的视频,同时达到最先进的3D几何精度。
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
Authors: Paul Julius Kühn, Mika Pommeranz, Arjan Kuijper, Saptarshi Neil Sinha
First: 2026-04-29T12:57:32+00:00 · Latest: 2026-04-29T12:57:32+00:00
Abstract
The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defects are rare, annotations are expensive, and collecting balanced training sets is slow. We present an end-to-end pipeline for synthetic defect generation and annotation, combining Vision-Language-Model-based prompts, LoRA-adapted diffusion, mask-guided inpainting, and sample filtering with automatic label derivation, and demonstrates the potential of real data with realistic synthetic samples to overcome data scarcity. The evaluation is conducted on, a challenging dataset of pitting defects on ball screw drives, and then on a subset of the Mobile phone screen surface defect segmentation dataset (MSD) dataset to test cross-domain transfer. Beyond downstream detector performance, we analyze key stages of the pipeline, including prompt construction, LoRA selection, and sample filtering with DreamSim and CLIPScore, to understand which synthetic samples are both realistic and useful. Experiments with YOLOv26, YOLOX, and LW-DETR show that synthetic-only training does not replace real data. When combined with real data, synthetic defects can preserve performance and yield modest gains in selected BSData training regimes. The MSD transfer study shows that the overall pipeline structure carries over to a second industrial inspection domain, while also highlighting the importance of domain-specific adaptation and annotation-quality control. Overall, the paper provides an end-to-end assessment of diffusion-based industrial defect synthesis and shows that its strongest value lies in strengthening scarce real datasets rather than substituting for them.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
Authors: Mingji Ge, Qirui Chen, Zeqian Li, Weidi Xie
First: 2026-04-29T11:51:35+00:00 · Latest: 2026-04-29T11:51:35+00:00
Abstract
Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps.
This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at https://huggingface.co/datasets/mingjige/DenseStep2M.
中文标题/摘要
标题:DenseStep2M:一种可扩展、无需训练的密集指令视频标注流水线
长期视频理解需要解释复杂的时序事件并推理程序性活动。虽然像HowTo100M这样的指令视频语料库为模型训练提供了丰富的资源,但它们也带来了显著的挑战,包括嘈杂的ASR转录和叙述与视觉内容之间不一致的时序对齐。在本文中,我们介绍了一种自动化的、无需训练的流水线,用于从野外的指令视频中提取高质量的程序性注释。我们的方法将视频分割成连贯的镜头,过滤掉对齐不良的内容,并利用最先进的多模态和大型语言模型(Qwen2.5-VL和DeepSeek-R1)生成结构化、时序定位的程序性步骤。
该流水线生成了包含约10万段视频和200万详细指令步骤的DenseStep2M大规模数据集,旨在支持全面的长视频理解。为了严格评估我们的流水线,我们编纂了DenseCaption100,这是一个高质量的人工撰写的字幕基准。评估结果表明,我们自动生成的步骤与人工注释高度一致。此外,我们还验证了DenseStep2M在三个核心下游任务中的实用性:密集视频字幕生成、程序性步骤定位和跨模态检索。在DenseStep2M上微调的模型在字幕质量和时序定位方面取得了显著提升,并在第一人称、第三人称和混合视角领域表现出强大的零样本泛化能力。这些结果突显了DenseStep2M在促进高级多模态对齐和长期活动推理方面的有效性。我们的数据集可在https://huggingface.co/datasets/mingjige/DenseStep2M/ 获取。
Summary / 总结
The research aims to address the challenges of long-term video understanding by developing a training-free pipeline to automatically annotate instructional videos with detailed procedural steps. The method segments videos into coherent shots, filters poorly aligned content, and uses state-of-the-art multimodal and large language models to generate structured annotations. The pipeline produces DenseStep2M, a large-scale dataset with 100K videos and 2M steps, which significantly improves performance on dense video captioning, procedural step grounding, and cross-modal retrieval tasks, demonstrating strong alignment with human annotations and robust zero-shot generalization across different perspectives.
研究旨在通过开发一个无需训练的自动标注流水线来解决长期视频理解的挑战,该流水线能够从野外的指令视频中提取高质量的程序性注释。方法将视频分割成连贯的镜头,过滤掉对齐不佳的内容,并使用最先进的多模态和大语言模型生成结构化的注释。该流水线生成了包含100K视频和2M步骤的DenseStep2M大规模数据集,显著提高了密集视频标注、程序性步骤定位和跨模态检索等任务的表现,展示了与人工注释的强烈对齐以及在不同视角下的鲁棒零样本泛化能力。
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
Authors: Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert
First: 2025-05-26T14:19:29+00:00 · Latest: 2026-04-29T11:05:46+00:00
Abstract
Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of \emph{ViTaPEs} in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes
Summary / 总结
The research aims to improve the fusion of visual and tactile data for better multimodal understanding, addressing the challenge of positional encoding in cross-modal alignment. The method introduces ViTaPEs, a transformer-based architecture with two-stage positional encodings to capture fine-grained visuotactile correlations. Experiments show that ViTaPEs outperform existing methods on various recognition tasks and demonstrate zero-shot generalization to unseen scenarios, with superior performance in predicting grasp success in robotic grasping tasks.
研究旨在通过融合视觉和触觉数据来提高多模态理解,解决跨模态对齐中的位置编码挑战。方法引入了ViTaPEs,这是一种基于变压器的架构,采用两阶段位置编码来捕捉细粒度的视觉-触觉关联。实验表明,ViTaPEs在各种识别任务上优于现有方法,并展示了在未见过的场景中的零样本泛化能力,在机器人抓取任务中预测抓取成功率方面表现更优。
Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models
Authors: Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni
First: 2026-04-29T10:16:06+00:00 · Latest: 2026-04-29T10:16:06+00:00
Comments: Under review. Extended version with additional figures and appendices
Abstract
Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge-cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug-and-play deployment with off-the-shelf VLMs without additional fine-tuning. This design allows flexible transmission at different information levels, providing a controllable trade-off between communication cost and semantic fidelity. We implement a full end-to-end edge-cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth-constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full-edge and full-cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open-ep/ProSemComVLM.
Summary / 总结
This paper addresses the challenge of deploying Vision-Language Models (VLMs) on edge devices by proposing a progressive semantic communication framework. The framework uses a Meta AutoEncoder to compress visual tokens into adaptive, progressively refinable representations, allowing flexible transmission and reducing latency. Experiments show that the proposed method significantly reduces network latency compared to full-edge and full-cloud solutions while maintaining high semantic consistency even under high compression.
本文提出了一种渐进式语义通信框架,以解决在边缘设备上部署视觉语言模型(VLMs)的挑战。该框架使用Meta AutoEncoder将视觉标记压缩为适应性强、可逐步细化的表示,允许灵活传输并减少延迟。实验结果显示,与全边缘和全云解决方案相比,所提出的方法在高压缩下仍能显著降低网络延迟并保持高语义一致性。
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
Authors: Haosen Li, Wenshuo Chen, Lei Wang, Shaofeng Liang, Bowen Tian, Soning Lai, Yutao Yue
First: 2026-04-29T10:08:08+00:00 · Latest: 2026-04-29T10:08:08+00:00
Abstract
Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.
中文标题/摘要
标题:Delta评分很重要!扩散模型中的空间自适应多向引导
扩散模型在合成复杂静态和动态视觉方面取得了显著成功,这一突破主要得益于无分类引导(CFG)。然而,尽管CFG在使生成内容与文本提示保持一致方面发挥着关键作用,但标准CFG依赖于全局均匀标量。这种均匀放大导致了广泛记录的“细节-伪影困境”:低引导标量无法注入复杂的语义,而高标量则不可避免地导致结构退化、色彩过度饱和和视频中的时间不一致。在本文中,我们通过微分几何的视角揭示了这一缺陷的物理根源。通过分析Tweedie公式,我们揭示了CFG本质上执行的是切线线性外推。由于自然数据流形高度弯曲,这种均匀线性步骤引入了严重的正交偏差。为了使生成轨迹安全地保持在边界内,我们提出了空间自适应多向引导(SAMG)的理论上限,这是一种无需训练且几乎零成本的采样算法。SAMG动态计算点条件引导能量,在高能量边界区域应用保守的最小标量以保留精细的微观纹理,在低能量区域部署激进的最大标量以最大限度地注入语义。在各种图像(SD 1.5,SDXL,SD3.5 中型)和视频(CogVideoX,ModelScope)架构的广泛实验中,SAMG有效地解决了细节-伪影困境,实现了语义对齐、结构完整性和时间平滑性,而没有任何计算开销。
Summary / 总结
This paper addresses the 'detail-artifact dilemma' in diffusion models by proposing Spatial Adaptive Multi Guidance (SAMG). It analyzes the uniform scalar used in Classifier-Free Guidance (CFG) and reveals its limitations due to the high curvature of natural data manifolds. SAMG dynamically adjusts guidance energy based on spatial regions, preserving fine details and enhancing semantic alignment. Experiments show SAMG resolves the dilemma, improving structural integrity and temporal smoothness without additional computational cost.
本文通过提出空间自适应多指导(SAMG)解决了扩散模型中的“细节-伪影困境”。它分析了Classifier-Free Guidance (CFG) 中使用的均匀标量的局限性,由于自然数据流形的高度弯曲。SAMG 根据空间区域动态调整指导能量,保留精细细节并增强语义对齐。实验表明,SAMG 解决了该困境,提高了结构完整性和时间平滑性,且无需额外的计算成本。
A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows
Authors: Yuxuan Han, Yuanxing Zhang, Yushuo Wang, Yichao Jin, Kenneth Zhu Ke, Jingyuan Zhao
First: 2026-04-29T09:19:16+00:00 · Latest: 2026-04-29T09:19:16+00:00
Abstract
Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non machine readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task relevant information. Although recent vision-language models achieve strong benchmark performance, directly applying them end to end to full financial reports often leads to unreliable extraction under real world conditions. We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multipage documents. We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR-VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM2.6, achieves 87.27 percent accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.
Delineating Knowledge Boundaries for Honest Large Vision-Language Models
Authors: Junru Song, Yimeng Hu, Yijing Chen, Huining Li, Qian Li, Lizhen Cui, Yuntao Du
First: 2026-04-29T08:29:44+00:00 · Latest: 2026-04-29T08:29:44+00:00
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that exceed their parametric knowledge. In this paper, we propose a systematic framework to enhance the refusal capability of VLMs when facing such unknown questions. We first curate a model-specific "Visual-Idk" (Visual-I don't know) dataset, leveraging multi-sample consistency probing to distinguish between known and unknown facts. We then align the model using supervised fine-tuning followed by preference-aware optimization (e.g., DPO, ORPO) to effectively delineate its knowledge boundaries. Results on the Visual-Idk dataset show our method improves the Truthful Rate from 57.9\% to 67.3\%. Additionally, internal probing also demonstrates that the model genuinely recognizes its boundaries instead of just memorizing refusal patterns. Our framework further generalizes to out-of-distribution medical and perceptual domains, providing a robust path toward more trustworthy and prudent visual assistants.
Decoupled Prototype Matching with Vision Foundation Models for Few-Shot Industrial Object Detection
Authors: Hari Prasanth S. M., Nilusha Jayawickrama, Risto Ojala
First: 2026-04-29T08:16:01+00:00 · Latest: 2026-04-29T08:16:01+00:00
Comments: This article is submitted to Journal of Intelligent Manufacturing, and is currently in under review
Abstract
Industrial object detection systems typically rely on large annotated datasets, which are expensive to collect and challenging to maintain in industrial scenarios where the inventory of objects changes frequently. This work addresses the challenge of few-shot object detection in such industrial scenarios, where only a limited number of labeled samples are available for newly introduced objects. We present a detection framework that leverages vision foundation models to recognize objects with minimal supervision. The method constructs class prototypes from a small set of reference samples by extracting feature representations. For a given query scene during inference, object regions are generated using a segmentation model, and feature embeddings are extracted and matched with class prototypes using similarity matching. We evaluate the detection method on three established industrial datasets from the Benchmark for 6D Object Pose Estimation benchmark following the official 2D object detection evaluation protocol. We demonstrate competitive detection performance, improving AP by 6.9% compared to the state-of-the-art training-free detection methods. Furthermore, the presented method is able to onboard new objects using only a few reference images, without requiring any CAD models or large annotated datasets. These properties make the approach well-suited for real-world industrial applications.
Summary / 总结
This work aims to address the challenge of few-shot object detection in industrial settings where only limited labeled samples are available. The proposed method uses vision foundation models to construct class prototypes from a small set of reference samples and matches feature embeddings with these prototypes for detection. The method achieves competitive performance, improving AP by 6.9% compared to state-of-the-art training-free methods and can onboard new objects with just a few reference images without requiring CAD models or large annotated datasets.
该研究旨在解决工业环境中仅凭少量标注样本进行少样本物体检测的挑战。提出的方法利用视觉基础模型从少量参考样本中构建类原型,并通过特征嵌入与这些原型进行匹配以实现检测。该方法达到了竞争力的表现,相比最先进的无需训练的方法,AP提高了6.9%,并且可以通过少量参考图像快速上线新物体,无需使用CAD模型或大量标注数据集。
Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning
Authors: Junwon You, Mihyun Jang, Sangwoo Mo, Jae-Hun Jung
First: 2026-04-29T07:30:33+00:00 · Latest: 2026-04-29T07:30:33+00:00
Comments: 30 pages, 10 figures, 24 tables
Abstract
Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, existing methods remain fundamentally pairwise and fail to model the global structure of multimodal representation manifolds. Existing topology-based alignment methods rely on persistence diagram matching, which neither guarantees geometric alignment nor utilizes the image-text pairing information central to vision-language learning. We propose Topology-Aware Multimodal Representation Alignment (ToMA), a framework that uses persistent homology to identify topologically salient edges and aligns them across modalities through available cross-modal correspondences. ToMA leverages both H_0-death edges and lightweight H_1-birth edges, allowing it to capture both connectivity and cycle structure without constructing 2-simplices. Experiments show that ToMA yields stable gains, with clear improvements on remote sensing and modest but consistent benefits on fashion retrieval. Additional analysis shows that ToMA is more stable than alternative topology-based objectives and that lightweight H_1-birth edges provide useful higher-order structural signals.
Summary / 总结
The research aims to improve the generalization of vision-language models to specialized domains by addressing the limitations of existing semi-supervised learning methods. ToMA, a new framework, uses persistent homology to align topologically salient edges across image and text modalities, leveraging both H_0-death and H_1-birth edges without constructing 2-simplices. Experiments demonstrate stable performance gains on remote sensing tasks and modest improvements on fashion retrieval, with ToMA showing greater stability compared to other topology-based methods and highlighting the utility of lightweight H_1-birth edges for higher-order structural signals.
研究旨在通过解决现有半监督学习方法的局限性,提高视觉-语言模型在特定领域的泛化能力。ToMA框架利用持久同调来跨图像和文本模态对拓扑显著边进行对齐,同时利用H_0死亡边和轻量级H_1生成边而不构建2-单纯形。实验表明,ToMA在遥感任务上表现出稳定的性能提升,并在时尚检索任务上取得适度但一致的改进,且ToMA相比其他基于拓扑的方法更为稳定,并突显了轻量级H_1生成边在高阶结构信号方面的有用性。
Causal Disentanglement for Full-Reference Image Quality Assessment
Authors: Zhen Zhang, Jielei Chu, Tian Zhang, Fengmao Lv, Tianrui Li
First: 2026-04-23T13:18:13+00:00 · Latest: 2026-04-29T07:26:38+00:00
Abstract
Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.
Summary / 总结
This paper proposes a novel full-reference image quality assessment (FR-IQA) method based on causal inference and decoupled representation learning. Unlike traditional feature comparison methods, it formulates degradation estimation as a causal disentanglement process. The method first decouples degradation and content representations, then uses a masking module to extract content-influenced degradation features, and finally predicts quality scores. Experiments show that the proposed method performs well across various settings and diverse image domains, demonstrating superior cross-domain generalization compared to existing models.
本文提出了一种基于因果推理和解耦表示学习的全参考图像质量评估(FR-IQA)方法。不同于传统的特征比较方法,该方法将退化估计视为因果分离过程。方法首先分离退化和内容表示,然后使用遮罩模块提取内容影响的退化特征,最后预测质量评分。实验表明,所提出的方法在各种设置和不同图像领域中表现出色,展示了优于现有模型的跨域泛化能力。
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
Authors: Zhirong Shen, Rui Huang, Jiacheng Liu, Chang Zou, Peiliang Cai, Shikang Zheng, Zhengyi Shi, Liang Feng, Linfeng Zhang
Venue: CVPR 2026
First: 2026-04-29T07:22:16+00:00 · Latest: 2026-04-29T07:22:16+00:00
Comments: Accepted by CVPR 2026
Abstract
To address the high sampling cost of Diffusion Transformers (DiTs), feature caching offers a training-free acceleration method. However, existing methods rely on hand-crafted forecasting formulas that fail under aggressive skipping. We propose L2P (Learnable Linear Predictor), a simple data-driven caching framework that replaces fixed coefficients with learnable per-timestep weights. Rapidly trained in ~20 seconds on a single GPU, L2P accurately reconstructs current features from past trajectories. L2P significantly outperforms existing baselines: it achieves a 4.55x FLOPs reduction and 4.15x latency speedup on FLUX.1-dev, and maintains high visual fidelity under up to 7.18x acceleration on Qwen-Image models, where prior methods show noticeable quality degradation. Our results show learning linear predictors is highly effective for efficient DiT inference. Code is available at https://github.com/Aredstone/L2P-Cache.
Summary / 总结
The research aims to reduce the high sampling cost of Diffusion Transformers by proposing L2P (Learnable Linear Predictor), which replaces fixed coefficients with learnable per-timestep weights. L2P is trained quickly in about 20 seconds on a single GPU and significantly outperforms existing methods, achieving a 4.55x FLOPs reduction and 4.15x latency speedup on FLUX.1-dev, while maintaining high visual fidelity even under up to 7.18x acceleration on Qwen-Image models, where previous methods show quality degradation.
研究旨在通过提出L2P(可学习线性预测器)来降低扩散变换器的高采样成本,该方法用可学习的每时间步权重替换固定系数。L2P在单个GPU上大约20秒内快速训练,并显著优于现有方法,在FLUX.1-dev上实现4.55倍的FLOPs减少和4.15倍的延迟加速,同时在Qwen-Image模型上即使在高达7.18倍的加速下仍保持高质量,而先前的方法在加速时会显示出质量下降。
VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
Authors: Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi
First: 2026-04-28T05:30:18+00:00 · Latest: 2026-04-29T07:06:15+00:00
Abstract
Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty
中文标题/摘要
标题:VLM评判员可以排序但不能评分:多模态评估中的任务依赖不确定性
视觉-语言模型(VLMs)越来越多地用作多模态系统的自动化评判员,但它们的评分无法反映可靠性。我们通过容许预测研究了这一问题,这是一种无需重新训练即可将评判员的点评分转换为基于评分-标记对数概率的校准预测区间的无分布框架。我们首次系统地分析了VLM作为评判员的容许预测,涉及3名评判员和14个视觉任务类别。我们的结果显示,评估不确定性强烈依赖于任务:对于美学和自然图像,区间覆盖评分范围的约40%,而对于图表和数学推理,区间扩展到约70%,从而为多模态评估提供了定量的可靠性地图。我们进一步识别出标准评估指标未能捕捉到的一种失败模式,即排序-评分解耦,其中评判员在获得高排序相关性的同时产生宽泛、无信息的区间,正确排序响应但无法分配可靠的绝对评分。最后,我们展示了区间宽度主要由任务难度和注释质量驱动,即相同的评判员和方法在干净的多注释器描述基准上产生4.5倍窄的区间。代码:https://github.com/divake/VLM-Judge-Uncertainty
Efficient, VRAM-Constrained xLM Inference on Clients
Authors: Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan
First: 2026-04-29T06:35:35+00:00 · Latest: 2026-04-29T06:35:35+00:00
Comments: Accepted at MLSys 2026 (Industry Track). 17 pages, 7 figures, 9 tables. Code and artifacts available at: https://github.com/deepshnv/pipeshard-mlsys26-ae
Abstract
To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama$.$cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance.
These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing software development kit (IGI SDK) and the Cosmos-Reason1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: for interactive use, TTFT improves by up to 6.7x and TPS by up to 30x for LLMs, and CR1 inference's VRAM demand is down by 10x, while in batched mode, throughput improves by up to 8.2x, all compared to their respective aggressive baselines. This paper is accepted at the 9th MLSys Conference (Industry Track), 2026. Code and artifact available at: https://github.com/deepshnv/pipeshard-mlsys26-ae
中文标题/摘要
标题:在客户端受限VRAM条件下高效推理xLM
为了推动客户端AI的下一轮创新,迫切需要在客户端系统上高效、无损地推理高准确度的大语言模型(LLMs)和视觉语言模型(VLMs),统称为xLMs。为此,我们提出了一种名为流水线分片的创新性、基准导向的CPU-GPU混合调度技术,以实现对密集型和混合专家(MoE)LLMs的高效、受限VRAM推理。通过在子层级别进行模型分片、CPU卸载、流水线复制计算以及优先在VRAM中放置张量,该技术优化了首个令牌时间(TTFT)和每秒令牌数(TPS)指标,同时灵活适应系统和推理条件。为了实现高效的高准确度VLM推理,我们结合了流水线分片与llama$.$cpp实现的三种已知的先前想法(统称为VLMOpt),即视觉张量CPU卸载、闪存注意力和视觉与语言模型VRAM重叠避免。
这些增强措施旨在在未来两个重要NVIDIA产品的版本中改进客户端xLM推理——即游戏内推理软件开发工具包(IGI SDK)和Cosmos-Reason1(CR1)物理AI推理VLM。我们严格评估涵盖多个模型和客户端系统的亮点包括:对于交互式使用,LLMs的TTFT提高最多6.7倍,TPS提高最多30倍,而CR1推理的VRAM需求降低10倍;在批量模式下,吞吐量提高最多8.2倍,所有这些都与各自的激进基线相比。该论文已被接受参加2026年第9届MLSys会议(工业轨道)。代码和资源可在:https://github.com/deepshnv/pipeshard-mlsys26-ae
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
Authors: Siyuan Li, Aodu Wulianghai, Guangyan Li, Xi Lin, Qinghua Mao, Yuliang Chen, Jun Wu, Jianhua Li
First: 2026-04-29T06:22:08+00:00 · Latest: 2026-04-29T06:22:08+00:00
Abstract
The rapid advancement of large language models (LLMs) presents new security challenges, particularly in detecting machine-generated text used for misinformation, impersonation, and content forgery. Most existing detection approaches struggle with robustness against adversarial perturbation, paraphrasing attacks, and domain shifts, often requiring restrictive access to model parameters or large labeled datasets. To address this, we propose DSIPA, a novel training-free framework that detects LLM-generated content by quantifying sentiment distributional stability under controlled stylistic variation. It is based on the observation that LLMs typically exhibit more emotionally consistent outputs, while human-written texts display greater affective variation. Our framework operates in a zero-shot, black-box manner, leveraging two unsupervised metrics, sentiment distribution consistency and sentiment distribution preservation, to capture these intrinsic behavioral asymmetries without the need for parameter updates or probability access. Extensive experiments are conducted on state-of-the-art proprietary and open-source models, including GPT-5.2, Gemini-1.5-pro, Claude-3, and LLaMa-3.3. Evaluations on five domains, such as news articles, programming code, student essays, academic papers, and community comments, demonstrate that DSIPA improves F1 detection scores by up to 49.89% over baseline methods. The framework exhibits superior generalizability across domains and strong resilience to adversarial conditions, providing a robust and interpretable behavioral signal for secure content identification in the evolving LLM landscape.
Summary / 总结
DSIPA is a training-free framework that detects LLM-generated text by analyzing sentiment distributional stability under controlled stylistic variation. It leverages sentiment distribution consistency and preservation to identify the more emotionally consistent outputs of LLMs compared to human-written texts. Experiments on various domains show that DSIPA improves F1 detection scores by up to 49.89% over baseline methods, demonstrating its robustness and generalizability across different LLMs and domains.
DSIPA 是一个无需训练的框架,通过在受控的风格变化下分析情感分布稳定性来检测 LLM 生成的文本。它利用情感分布一致性与保持来识别 LLM 相比人类撰写的文本更具情感一致性。在不同领域的实验中,DSIPA 的 F1 检测得分比基线方法提高了高达 49.89%,展示了其在不同 LLM 和领域中的鲁棒性和普适性。
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Authors: Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang
First: 2025-04-14T06:33:29+00:00 · Latest: 2026-04-29T06:12:36+00:00
Abstract
We introduce FLARE, a family of vision language models (VLMs) with a fully vision-language alignment and integration paradigm. Unlike existing approaches that rely on single MLP projectors for modality alignment and defer cross-modal interaction to LLM decoding, FLARE achieves deep, dynamic integration throughout the pipeline. Our key contributions include: (1) Text-Guided Vision Encoding that incorporates textual information during vision encoding to achieve pixel-level alignment; (2) Context-Aware Alignment Decoding that aggregates visual features conditioned on textual context during decoding for query-level integration; (3) Dual-Semantic Mapping Loss to supervise feature mapping from both modalities and enable modality-level bridging; and (4) Text-Driven VQA Synthesis that leverages high-quality text to generate VQA pairs and synthesize corresponding images, enabling data-level optimization. We train FLARE at 3B and 8B scales under both fixed and dynamic resolution settings, demonstrating that our full-modality alignment significantly outperforms existing methods while maintaining strong generalizability. FLARE 3B surpasses Cambrian-1 8B and Florence-VL 8B using only 630 vision tokens. Ablation studies reveal that FLARE achieves superior performance over existing methods with minimal computational cost. Even without dynamic resolution, FLARE outperforms LLaVA-NeXT, validating the effectiveness of our approach. We release our code, model weights, and dataset in https://github.com/starriver030515/FLARE.
Summary / 总结
FLARE is a vision-language model that integrates vision and language deeply throughout the pipeline, achieving pixel-level alignment and query-level integration. It introduces Text-Guided Vision Encoding, Context-Aware Alignment Decoding, Dual-Semantic Mapping Loss, and Text-Driven VQA Synthesis. FLARE outperforms existing methods at 3B and 8B scales, demonstrating strong generalizability and minimal computational cost. Even without dynamic resolution, FLARE surpasses LLaVA-NeXT, validating its effectiveness.
FLARE 是一种在管道中实现深度视觉-语言集成的模型,实现像素级对齐和查询级集成。它引入了文本引导的视觉编码、上下文感知的对齐解码、双语义映射损失和文本驱动的VQA合成。FLARE 在 3B 和 8B 规模下超越了现有方法,展示了强大的泛化能力和较低的计算成本。即使没有动态分辨率,FLARE 也超过了 LLaVA-NeXT,验证了其有效性。
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
Authors: Zheng Liu, Honglin Lin, Chonghan Qin, Xiaoyang Wang, Xin Gao, Yu Li, Mengzhang Cai, Yun Zhu, Zhanping Zhong, Qizhi Pei, Zhuoshi Pan, Xiaoran Shang, Bin Cui, Conghui He, Wentao Zhang, Lijun Wu
First: 2026-01-20T05:11:44+00:00 · Latest: 2026-04-29T05:49:25+00:00
Comments: 29 pages
Abstract
Chart reasoning is a critical capability for Vision Language Models (VLMs). However, the development of open-source models is severely hindered by the lack of high-quality training data. Existing datasets suffer from a dual challenge: synthetic charts are often simplistic and repetitive, while the associated QA pairs are prone to hallucinations and lack the reasoning depth required for complex tasks. To bridge this gap, we propose ChartVerse, a scalable framework designed to synthesize complex charts and reliable reasoning data from scratch. (1) To address the bottleneck of simple patterns, we first introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity. Guided by RPE, we develop complexity-aware chart coder to autonomously synthesize diverse, high-complexity charts via executable programs. (2) To guarantee reasoning rigor, we develop truth-anchored inverse QA synthesis. Diverging from standard generation, we adopt an answer-first paradigm: we extract deterministic answers directly from the source code, generate questions conditional on these anchors, and enforce strict consistency verification. To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning. We curate ChartVerse-SFT-600K and ChartVerse-RL-40K using Qwen3-VL-30B-A3B-Thinking as the teacher. Experimental results demonstrate that ChartVerse-8B achieves state-of-the-art performance, notably surpassing its teacher and rivaling the stronger Qwen3-VL-32B-Thinking. We release our code, model weights, and datasets in https://chartverse.github.io.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
Authors: Zihao Zheng, Zhihao Mao, Xingyue Zhou, Jiayu Chen, Maoliang Li, Xinhao Sun, Hailong Zou, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei, Xiang Chen
First: 2026-03-07T07:30:35+00:00 · Latest: 2026-04-29T05:35:06+00:00
Abstract
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.
Summary / 总结
The research addresses the challenge of high inference costs in Vision-and-Language Navigation (VLN) models by proposing VLN-Cache, a caching framework that accounts for visual and semantic dynamics. It introduces view-aligned remapping to handle viewpoint shifts and a task-relevance saliency filter to manage semantic transitions. The study demonstrates up to 1.52x speedup in inference time with comparable navigation success rates on the R2R-CE benchmark.
研究针对Vision-and-Language Navigation (VLN)模型高推理成本的问题,提出了VLN-Cache,一种考虑视觉和语义动态的缓存框架。该框架通过视图对齐重新映射来处理视角变化,并通过任务相关性显著性过滤器来管理语义转换。研究在R2R-CE仿真基准上展示了最高1.52倍的推理时间加速,同时保持了相当的导航成功率。
CoFL: Continuous Flow Fields for Language-Conditioned Navigation
Authors: Haokun Liu, Zhaoqi Ma, Yicheng Chen, Masaki Kitagawa, Wentao Zhang, Zicen Xiong, Jinjie Li, Moju Zhao
First: 2026-03-03T11:02:55+00:00 · Latest: 2026-04-29T04:47:16+00:00
Comments: 18 pages, 13 figures
Abstract
Existing language-conditioned navigation systems typically rely on modular pipelines or trajectory generators, but the latter use each scene--instruction annotation mainly to supervise one start-conditioned rollout. To address these limitations, we present CoFL, an end-to-end policy that maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. CoFL reformulates navigation as workspace-conditioned field learning rather than start-conditioned trajectory prediction: it learns local motion vectors at arbitrary BEV locations, turning each scene--instruction annotation into dense spatial control supervision. Trajectories are generated from any start by numerical integration of the predicted field, enabling simple real-time rollout and closed-loop recovery. To enable large-scale training and evaluation, we build a dataset of over 500k BEV image--instruction pairs, each procedurally annotated with a flow field and a trajectory derived from semantic maps built on Matterport3D and ScanNet. Evaluating on strictly unseen scenes, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and trajectory generation policies in both navigation precision and safety, while maintaining real-time inference. Finally, we deploy CoFL zero-shot in real-world experiments with BEV observations across multiple layouts, maintaining feasible closed-loop control and a high success rate.
中文标题/摘要
标题:CoFL:连续流场用于语言条件导航
现有的语言条件导航系统通常依赖于模块化管道或轨迹生成器,但后者主要利用每个场景-指令注释监督一次起始条件下的展开。为了解决这些限制,我们提出了CoFL,这是一种端到端的策略,将鸟瞰图(BEV)观察和语言指令映射到一个连续的流场以进行导航。CoFL将导航重新表述为工作空间条件下的场学习,而不是起始条件下的轨迹预测:它在BEV的任意位置学习局部运动向量,将每个场景-指令注释转化为密集的空间控制监督。轨迹通过预测场的数值积分从任何起始点生成,这使得实时展开和闭环恢复变得简单。为了实现大规模训练和评估,我们构建了一个包含超过50万张BEV图像-指令对的数据集,每个数据对都通过Matterport3D和ScanNet上的语义地图程序化注释了一个流场和轨迹。在严格未见过的场景上评估,CoFL在导航精度和安全性方面显著优于基于模块化视觉-语言模型(VLM)的规划者和轨迹生成策略,同时保持实时推理。最后,我们在多个布局的BEV观察中零样本部署CoFL,保持了可行的闭环控制和高成功率。
Summary / 总结
CoFL is an end-to-end policy that maps BEV observations and language instructions to a continuous flow field for navigation, addressing limitations of modular pipelines and trajectory generators. It learns local motion vectors at arbitrary locations, providing dense spatial control supervision and enabling real-time trajectory generation through numerical integration. CoFL outperforms VLM-based planners and trajectory generation policies in navigation precision and safety, and maintains real-time inference even on unseen scenes. It also demonstrates feasibility in real-world experiments with multiple layouts.
CoFL 是一个端到端的策略,将鸟瞰图观察和语言指令映射到连续的流场以实现导航,解决了模块化管道和轨迹生成器的局限性。它在任意位置学习局部运动向量,提供密集的空间控制监督,并通过数值积分生成实时轨迹。CoFL 在导航精度和安全性方面优于基于视觉-语言模型的规划器和轨迹生成策略,并且即使在未见过的场景中也能保持实时推理。此外,它还在多个布局的现实世界实验中展示了可行性。
Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan
Venue: CVPR 2026
First: 2025-11-25T07:58:57+00:00 · Latest: 2026-04-29T04:41:31+00:00
Comments: CVPR 2026
Abstract
Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.
中文标题/摘要
标题:让模型知道该看哪里:通过视觉引导注意力减轻MLLM幻觉
视觉注意力是MLLMs解释视觉信息的主要机制;然而,其有限的定位能力常常导致幻觉。我们观察到,尽管MLLMs可以从视觉标记中准确提取视觉语义,但在后续推理中却未能充分利用这一优势。为解决这一局限,我们提出了视觉引导注意力(VGA),这是一种无需训练的方法,首先通过利用视觉标记的语义内容构建精确的视觉接地,然后使用这种接地引导模型的注意力聚焦于相关视觉区域。在图像字幕生成中,VGA进一步在生成过程中动态细化这一指导,通过抑制已描述的区域。在VGA中,每个标记仅经历一次前向传递,引入了可忽略的延迟开销。此外,VGA完全兼容高效的注意力实现,如FlashAttention。广泛的实验表明,VGA在减轻幻觉方面达到了最先进的性能。进一步的分析证实,明确的视觉指导在增强MLLMs的视觉理解能力方面起着关键作用。
Summary / 总结
The research aims to mitigate hallucinations in multimodal language models (MLLMs) by enhancing their visual attention capabilities. Vision-Guided Attention (VGA) is proposed as a training-free method that constructs precise visual grounding using the semantic content of visual tokens and guides the model's focus to relevant visual regions. In image captioning, VGA dynamically refines this guidance by suppressing described regions. Experiments show VGA achieves state-of-the-art dehallucination performance across various MLLMs and benchmarks, confirming the importance of explicit visual guidance for visual understanding in MLLMs.
研究旨在通过增强视觉注意力能力来减轻多模态语言模型(MLLMs)中的幻觉现象。提出了一种名为Vision-Guided Attention (VGA) 的无训练方法,利用视觉标记的语义内容构建精确的视觉接地,并引导模型聚焦到相关视觉区域。在图像描述中,VGA通过抑制已描述的区域动态细化这种指导。实验表明,VGA在各种MLLMs和基准测试中的去幻觉性能达到最新水平,进一步分析证实了显式视觉指导对提升MLLMs的视觉理解能力至关重要。
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
Authors: Sonali Sharma, Jin Long, George Shih, Sarah Eid, Christian Bluethgen, Francine L. Jacobson, Emily B. Tsai, Global Radiology Consortium, Ahmed M. Alaa, Curtis P. Langlotz
First: 2026-04-29T04:33:43+00:00 · Latest: 2026-04-29T04:33:43+00:00
Comments: 51 pages, 7 figures, 10 tables
Abstract
Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision--language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state--of--the--art vision--language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference--time hint recovers missed findings and significantly reduces hallucinations. Third, models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought's multi-reader annotations, we predict both human--human and human--AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision--language models.
Summary / 总结
CheXthought is a global multimodal dataset containing 103,592 chain-of-thought reasoning traces and 6,609,082 visual attention annotations from 50,312 chest X-rays read by 501 radiologists. It reveals clinical reasoning patterns and visual search strategies. The dataset demonstrates significant improvements in factual accuracy, spatial grounding, pathology classification, visual faithfulness, temporal reasoning, and uncertainty communication when used to train models. Additionally, it enables the prediction of human-human and human-AI disagreements, enhancing model transparency and interpretability.
CheXthought 是一个包含 103,592 条临床推理链和 6,609,082 个视觉注意力标注的多模态数据集,涉及来自 71 个国家的 501 名放射科医生的 50,312 张胸部 X 光片。该数据集揭示了临床推理模式,并展示了在事实准确性、减少幻觉、病理分类增强以及透明沟通不确定性方面的改进。基于 CheXthought 训练的模型在多个临床维度上优于最先进的视觉-语言模型。
FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation
Authors: Huy Che, Vinh-Tiep Nguyen
Venue: Neurocomputing 660 (2026) 131844
First: 2025-06-29T16:41:41+00:00 · Latest: 2026-04-29T03:47:47+00:00
Abstract
Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code is available at https://github.com/chequanghuy/FA-Seg.
中文标题/摘要
标题:FA-Seg:一种基于扩散模型的快速准确开放词汇分割方法
开放词汇语义分割(OVSS)旨在从任意文本类别中分割对象,无需密集标注数据集。尽管基于对比学习的模型能够实现零样本分割,但它们往往在像素级失去精细的空间精度,由于全局表示偏差。相比之下,基于扩散模型自然通过注意力机制编码细粒度的空间特征,捕捉全局上下文和局部细节。然而,它们往往面临在计算成本和分割掩码质量之间平衡的挑战。在本文中,我们提出了FA-Seg,一种基于扩散模型的无需训练的快速准确开放词汇分割框架。FA-Seg仅使用预训练扩散模型的(1+1)步进行分割。此外,FA-Seg一次对所有类别进行分割,而不是多次运行。为了进一步提高分割质量,FA-Seg引入了三个关键组件:(i)一种判别性、类别感知的注意力提取的双提示机制,(ii)一种多分辨率注意力融合的层次注意力精炼方法(HARD),以及(iii)一种测试时翻转(TTF)方案,旨在提高空间一致性。大量实验表明,FA-Seg在PASCAL VOC、PASCAL Context和COCO Object基准测试中实现了最先进的无需训练性能,平均mIoU为43.8%,同时保持了优越的推理效率。我们的结果表明,FA-Seg为可扩展性提供了坚实的基础,弥合了分割质量和推理效率之间的差距。源代码可在https://github.com/chequanghuy/FA-Seg/ 获取。
Summary / 总结
FA-S is proposes a framework for one-vocabulary semantic segmentation using a diffusion-based model-free method. It The method employs a single-step process with a pretrained diffusion model, and introduces three components mechanisms: a dual-prompt mechanism for discriminative class-aware class extraction,, a Hierarchical Attention Refinement Mechanism (HARD A) for enhancing semantic precision,,, and a Test Time Flipping (TTF) scheme for improving spatial consistency. The method achieves state-of-the-art results-free performance on P the PASCAL VOC,, PASCal al Context class and COCO benchmarks while maintaining efficient inference....