Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Gordon Guocheng Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov
Venue: CVPR 2026
First: 2025-12-11T18:59:56+00:00 · Latest: 2026-04-30T17:59:43+00:00
Comments: CVPR 2026. Project page: https://snap-research.github.io/omni-attribute
Abstract
Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders
Authors: Emma Andrews, Sahan Sanjaya, Prabhat Mishra
First: 2026-04-30T17:56:40+00:00 · Latest: 2026-04-30T17:56:40+00:00
Abstract
Machine learning models can learn from data samples to carry out various tasks efficiently. When data samples are adversarially manipulated, such as by insertion of carefully crafted noise, it can cause the model to make mistakes. Quantum machine learning models are also vulnerable to such adversarial attacks, especially in image classification using variational quantum classifiers. While there are promising defenses against these adversarial perturbations, such as training with adversarial samples, they face practical limitations. For example, they are not applicable in scenarios where training with adversarial samples is either not possible or can overfit the models on one type of attack. In this paper, we propose an adversarial training-free defense framework that utilizes a quantum autoencoder to purify the adversarial samples through reconstruction. Moreover, our defense framework provides a confidence metric to identify potentially adversarial samples that cannot be purified the quantum autoencoder. Extensive evaluation demonstrates that our defense framework can significantly outperform state-of-the-art in prediction accuracy (up to 68%) under adversarial attacks.
Summary / 总结
The paper addresses the vulnerability of quantum classifiers to adversarial attacks by proposing a defense framework that uses a quantum autoencoder for sample purification without requiring adversarial training. The method involves reconstructing adversarial samples to remove noise and provides a confidence metric to identify uncleanable samples. Experimental results show that this approach significantly improves prediction accuracy by up to 68% under adversarial attacks compared to state-of-the-art methods.
论文提出了一种无需对抗训练的防御框架,利用量子自编码器对受到干扰的样本进行净化,以抵御量子分类器的对抗攻击。该方法通过重建样本去除噪声,并提供一个置信度指标来识别无法净化的样本。实验结果表明,这种方法在对抗攻击下将预测准确性显著提高,最高可达68%,优于现有最先进的方法。
PhyCo: Learning Controllable Physical Priors for Generative Motion
Authors: Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, Manmohan Chandraker
Venue: CVPR 2026
First: 2026-04-30T17:53:03+00:00 · Latest: 2026-04-30T17:53:03+00:00
Comments: CVPR 2026. Project Page: https://phyco-video.github.io/
Abstract
Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.
中文标题/摘要
标题:PhyCo:学习可控的物理先验以生成运动
现代视频扩散模型在外观合成方面表现出色,但在物理一致性方面仍然存在问题:物体漂移,碰撞缺乏真实的反弹,材料响应很少符合其内在属性。我们提出了PhyCo框架,该框架将连续、可解释和物理基础的控制引入视频生成。我们的方法结合了三个关键组件:(i) 包含超过10万段逼真模拟视频的大规模数据集,其中摩擦、弹性、变形和力在多种场景中系统地变化;(ii) 使用与像素对齐的物理属性图进行条件控制的预训练扩散模型的物理监督微调;以及(iii) VLM指导的奖励优化,其中微调的视觉语言模型使用针对物理查询的生成视频进行评估并提供可微反馈。这种组合使生成模型能够通过物理属性的变化生成物理一致且可控的输出,而无需在推理时使用任何模拟器或几何重建。在Physics-IQ基准测试中,PhyCo在物理现实性方面显著优于强大的基线,而人类研究也证实了对物理属性的更清晰和更忠实的控制。我们的结果表明了一条通向可扩展的、物理一致且可控的生成视频模型的道路,这些模型可以超越合成训练环境。
Summary / 总结
PhyCo is a framework that introduces physical control into video generation by combining a large-scale dataset of photorealistic simulations with physics-supervised fine-tuning of a pretrained diffusion model and VLM-guided reward optimization. The approach significantly enhances physical realism in generated videos, outperforming strong baselines on the Physics-IQ benchmark, and human studies confirm clearer and more faithful control over physical attributes.
PhyCo 是一个框架,通过结合大规模的光orealistic 模拟数据集、预训练扩散模型的物理监督微调以及基于 VLM 的奖励优化,将物理控制引入视频生成中。该方法显著提高了生成视频中的物理真实性,超越了强基线在 Physics-IQ 基准上的表现,并且人类研究证实了对物理属性的更清晰和更忠实的控制。
FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction
Authors: Zeyu Jiang, Changqing Zhou, Xingxing Zuo, Changhao Chen
Venue: RSS 2026
First: 2026-04-30T17:05:56+00:00 · Latest: 2026-04-30T17:05:56+00:00
Comments: RSS 2026
Abstract
Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over $2\times$ improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines. Project page: https://the-masses.github.io/freeocc-web/.
中文标题/摘要
标题:FreeOcc:无需训练的开放词汇占用预测
现有的基于学习的占用预测方法依赖于大规模的3D注释,并且在不同环境中泛化能力较差。我们提出了FreeOcc,这是一种无需训练的框架,可以从单目或RGB-D序列中进行开放词汇的占用预测。与之前需要体素级监督和地面真实相机姿态的方法不同,FreeOcc无需3D注释、姿态地面真实或任何学习阶段。FreeOcc通过四层流水线逐步构建全局一致的占用图:SLAM骨干估计姿态和稀疏几何;几何一致的高斯更新构建密集的3D高斯图;来自现成的视觉语言模型的开放词汇语义与高斯原语关联;概率高斯到占用的投影生成密集体素占用。尽管完全无需训练且姿态无关,FreeOcc在EmbodiedOcc-ScanNet上的IoU和mIoU相比之前的半监督方法提高了超过2倍。我们还引入了ReplicaOcc,一个室内开放词汇占用预测基准,并展示了FreeOcc可以在新环境中零样本迁移,显著优于监督和半监督基线。项目页面:https://the-masses.github.io/freeocc-web/
Summary / 总结
FreeOcc is a training-free framework for open-vocabulary occupancy prediction using monocular or RGB-D sequences. It builds a globally consistent occupancy map through a four-layer pipeline without relying on 3D annotations or pose ground truth. FreeOcc significantly improves IoU and mIoU by over 2 times compared to previous self-supervised methods on EmbodiedOcc-ScanNet. It also demonstrates zero-shot transfer to novel environments, outperforming both supervised and self-supervised baselines on the newly introduced ReplicaOcc benchmark.
FreeOcc 是一个无需训练的框架,用于从单目或 RGB-D 序列进行开放词汇量占用预测。它通过四层管道构建全局一致的占用图,无需 3D 注释或地面真值相机姿态。FreeOcc 在 EmbodiedOcc-ScanNet 上将 IoU 和 mIoU 提高了超过 2 倍,相比之前的自监督方法。此外,它在 ReplicaOcc 基准上展示了零样本迁移能力,优于监督和自监督基线方法。
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
Authors: Junqi Gao, Dazhi Zhang, Zhichang Guo, Biqing Qi, Yi Ran, Wangmeng Zuo
First: 2026-04-30T16:58:05+00:00 · Latest: 2026-04-30T16:58:05+00:00
Abstract
Model merging has attracted attention as an effective path toward multi-task adaptation by integrating knowledge from multiple task-specific models. Among existing approaches, dynamic merging mitigates performance degradation caused by conflicting parameter updates across tasks by flexibly combining task-specific parameters at inference time, thereby maintaining high performance. However, these methods require storing independent parameters for each task, resulting in prohibitive storage overhead. To address this issue, we first experimentally demonstrate that the fine-tuned weight increments (referred to as task vectors) exhibit an impulse-like activation pattern and high robustness to low-bit representations. Driven by this insight, we propose T-Switch, which decomposes task vectors into three compact components: a binary sparse mask, a sign vector, and a scalar scaling factor, achieving high-fidelity approximation at high compression ratios. We then introduce Auto-Switch, a training-free merging scheme that automatically composes task vectors via feature similarity retrieval. Building on this, we develop Auto-Switch, a training-free merging scheme that automatically assembles task vectors through feature similarity retrieval. Furthermore, to transform task vector sparsification and quantization from static rules to adaptive learning, we propose FlexSwitch, a learnable framework which jointly optimizes the compression strategy for each model unit via Learnable Gating Sparsification (LGS) and Bit-width Adaptive Selection (BAS), while employing the Sparsity-Aware Storage Strategy (SASS) to select the optimal storage encoding structure. Finally, by incorporating a K-Nearest Neighbor (KNN) inference scheme with a learnable low-rank metric, we present Auto-FlexSwitch, a dynamic model merging approach that supports highly efficient task vector compression.
Summary / 总结
The paper addresses the challenge of high storage overhead in dynamic model merging by proposing Auto-FlexSwitch, which uses learnable task vector compression. It first analyzes the properties of task vectors and proposes T-Switch, decomposing them into a binary sparse mask, a sign vector, and a scalar scaling factor. Auto-Switch then automatically assembles task vectors through feature similarity retrieval. FlexSwitch further introduces learnable gating sparsification and bit-width adaptive selection to adaptively optimize compression strategies. Finally, Auto-FlexSwitch incorporates a KNN inference scheme with a learnable low-rank metric to support efficient task vector compression.
该论文通过提出Auto-FlexSwitch来解决动态模型合并中的高存储开销问题,该方法利用可学习的任务向量压缩。首先分析了任务向量的特性,并提出了T-Switch,将其分解为二进制稀疏掩码、符号向量和标量缩放因子。然后,Auto-Switch通过特征相似性检索自动组装任务向量。FlexSwitch进一步引入了可学习的门控稀疏化和位宽自适应选择来适应性优化压缩策略。最后,Auto-FlexSwitch结合了带有可学习低秩度量的KNN推理方案,以支持高效的任务向量压缩。
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Authors: Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Venue: ICLR 2026
First: 2025-10-20T06:17:57+00:00 · Latest: 2026-04-30T15:40:24+00:00
Comments: ICLR 2026 camera-ready version
Abstract
Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
Authors: Hong-Tao Yu, Yuxin Peng, Serge Belongie, Xiu-Shen Wei
Venue: ICLR 2026
First: 2025-04-21T09:30:41+00:00 · Latest: 2026-04-30T15:12:07+00:00
Comments: Accepted to ICLR 2026
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.
Summary / 总结
This study aims to evaluate Large Vision-Language Models (LVLMs) on fine-grained image tasks, which are crucial for computer vision. It introduces FG-BMK, a benchmark with 1.01 million questions and 0.33 million images, to comprehensively assess LVLMs. Experiments on twelve representative LVLMs reveal that training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning significantly impact performance. This work highlights the limitations of current LVLMs and provides guidance for future model development.
该研究旨在评估大型视觉-语言模型(LVLMs)在细粒度图像任务上的表现,这些任务对于计算机视觉至关重要。研究人员开发了一个全面的基准FG-BMK,包含101万道问题和33万张图像。通过十二个LVLMs的实验,他们发现训练范式、模态对齐、扰动敏感性和细粒度类别推理对任务性能有显著影响。这项工作揭示了当前LVLMs的局限性,并为未来的数据构建和模型设计提供了指导。
TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions
Authors: Ce Chen, Yi Ren, Yuanming Li, Viktor Goriachko, Zhenhui Ye, Zujin Guo, Zhibin Hong, Mingming Gong
Venue: www
First: 2026-04-30T15:05:06+00:00 · Latest: 2026-04-30T15:05:06+00:00
Comments: This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/
Abstract
Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/
中文标题/摘要
标题:TransVLM:一种检测任意镜头过渡的视觉-语言框架和基准
传统的镜头边界检测(SBD)固有地难以处理复杂的过渡,因为它围绕孤立的剪辑点来定义任务,经常导致视频剪辑被破坏。我们通过正式化镜头过渡检测(STD)任务来解决这一根本限制。不同于常规地在剪辑点上寻找模糊的点,STD 明确地检测过渡的连续时间段。为了解决这个问题,我们提出了 TransVLM,一种用于 STD 的视觉-语言模型(VLM)框架。与主要依赖空间语义的常规 VLM 不同,我们的方法在输入阶段显式地注入了光学流作为关键的运动先验。通过一种简单而有效的特征融合策略,TransVLM 直接处理了颜色和运动的拼接表示,显著增强了其时间意识,而不会在语言骨干上增加任何额外的视觉标记开销。为了克服公共数据中严重的类别不平衡,我们设计了一个可扩展的数据引擎来合成多样化的过渡视频以进行稳健训练,并且还提供了一个全面的 STD 基准。广泛的实验表明,TransVLM 达到了优越的整体性能,超越了传统的启发式方法、专门的空间-时间网络以及顶级的 VLM。这项工作已部署到生产环境中。如需更多相关研究,请访问 HeyGen 研究(https://www.heygen.com/research)和 HeyGen Avatar-V(https://www.heygen.com/research/avatar-v-model)。项目页面:https://chence17.github.io/TransVLM/
Summary / 总结
The paper addresses the limitations of traditional Shot Boundary Detection (SBD) in handling complex transitions by introducing the Shot Transition Detection (STD) task. It proposes TransVLM, a Vision-Language Model that incorporates optical flow as a motion prior to enhance temporal awareness. Experiments show that TransVLM outperforms existing methods in detecting various types of transitions, demonstrating superior overall performance and robustness. This work has been deployed in production and includes a comprehensive benchmark for STD.
论文通过引入Shot Transition Detection (STD)任务来解决传统Shot Boundary Detection (SBD)在处理复杂过渡方面的局限性。提出了一种Vision-Language Model TransVLM,该模型在输入阶段注入了光学流作为运动先验,以增强其时间感知能力。实验表明,TransVLM在检测各种类型的过渡方面表现出色,整体性能优于现有方法,并且包括了STD的全面基准。
FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting
Authors: Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Xiuying Chen
First: 2026-04-30T15:03:56+00:00 · Latest: 2026-04-30T15:03:56+00:00
Abstract
Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbf{FineState-Bench}, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textit{FineState-Metrics}, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play \textit{Visual Diagnostic Assistant} (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs.\ w/o comparisons. On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8\% on Web and 22.8\% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction \href{https://github.com/FengxianJi/FineState-Bench}{Github.}
中文标题/摘要
标题:FineState-Bench:细粒度GUI状态设置中基于状态条件的对接基准测试
尽管大型视觉-语言模型(LVLMs)取得了快速进展,但细粒度的基于状态的GUI交互仍然具有挑战性。当前的评估覆盖范围有限,目标状态定义不够精确,并且过度依赖最终任务的成功,这掩盖了代理失败的具体位置和原因。为了解决这一差距,我们引入了**FineState-Bench**,一个基准测试,评估代理是否能够正确地将指令对接到预期的UI控件并达到精确的目标状态。FineState-Bench 包含了2,209个实例,覆盖了桌面、网络和移动平台,跨越了四种交互家族和23种UI组件类型,每个实例都明确指定了细粒度状态设置的精确目标状态。我们还提出了**FineState-Metrics**,一个四阶段诊断流水线,每个阶段的成功率分别为定位成功率(SR@Loc)、交互成功率(SR@Int)、定位时的精确状态成功率(ES-SR@Loc)和交互时的精确状态成功率(ES-SR@Int),以及一个即插即用的**视觉诊断助手**(VDA),它生成描述和边界框定位提示,通过有控制的有/无比较来诊断视觉对接原因。在FineState-Bench上,精确目标状态的成功率仍然很低:交互时的精确状态成功率(ES-SR@Int)在Web上达到32.8%,平均在所有平台上为22.8%。使用VDA定位提示,Gemini-2.5-Flash获得了+14.9 ES-SR@Int点的提升,表明通过改进视觉对接仍有很大的改进空间,但总体准确性仍然不足以实现可靠的基于状态条件的细粒度交互。[GitHub](https://github.com/FengxianJi/FineState-Bench)
The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models
Authors: Kenneth J. K. Ong
First: 2026-04-30T14:50:48+00:00 · Latest: 2026-04-30T14:50:48+00:00
Abstract
As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' cooperative behavior using the Iterated Prisoner's Dilemma (IPD) as a test scenario. We examine whether exposure to images depicting behavioral concepts (kindness/helpfulness vs. aggressiveness/selfishness) and color-coded reward matrices alters VLM decision patterns. Experiments were conducted across multiple state-of-the-art VLMs. We further explore mitigation strategies including prompt modifications, Chain of Thought (CoT) reasoning, and visual token reduction. Results show that VLM behavior can be influenced by both image content and color cues, with varying susceptibility and mitigation effectiveness across models. These findings not only underscore the importance of robust evaluation frameworks for VLM deployment in visually rich and safety-critical environments, but also highlight how architectural and training differences among models may lead to distinct behavioral responses-an area worthy of further investigation.
中文标题/摘要
标题:视觉先兆对视觉语言模型合作行为的影响
随着视觉语言模型(VLMs)越来越多地集成到决策系统中,理解视觉输入如何影响其行为变得至关重要。本文通过迭代囚徒困境(IPD)作为测试场景,研究视觉先兆对VLMs合作行为的影响。我们探讨了暴露于描绘行为概念的图像(友善/乐于助人 vs. 好斗/自私)和颜色编码的奖励矩阵是否改变了VLM的决策模式。实验在多个最先进的VLMs上进行。我们进一步探讨了包括提示修改、链式思考(CoT)推理和视觉标记减少在内的缓解策略。结果表明,VLM的行为可以受到图像内容和颜色提示的影响,不同模型在影响程度和缓解效果上存在差异。这些发现不仅强调了在视觉丰富和安全关键环境中部署VLM时需要稳健的评估框架的重要性,还突显了模型之间架构和训练差异可能导致不同行为反应这一领域值得进一步研究。
Summary / 总结
This paper explores how visual priming affects the cooperative behavior of Vision-Language Models (VLMs) using the Iterated Prisoner's Dilemma. Experiments with multiple state-of-the-art VLMs show that exposure to images and color-coded reward matrices influences VLM decision patterns, with varying effectiveness across models. The study also examines mitigation strategies like prompt modifications, Chain of Thought reasoning, and visual token reduction. Results indicate that both image content and color cues can alter VLM behavior, emphasizing the need for robust evaluation frameworks for VLMs in safety-critical environments.
本文研究视觉提示如何影响视觉语言模型(VLM)的合作行为,使用迭代囚徒困境作为测试场景。实验表明,多种先进的VLM在暴露于图像和颜色编码的奖励矩阵后,其决策模式会发生变化,且不同模型的效果各异。研究还探讨了包括提示修改、链式思考推理和视觉标记减少在内的缓解策略。结果表明,图像内容和颜色提示都能影响VLM的行为,强调了在安全关键环境中部署VLM时需要有 robust 的评估框架的重要性。
Diffusion-OAMP for Joint Image Compression and Wireless Transmission
Authors: Wentao Hou, Yimin Bai, Zelei Luo, Jiadong Hong, Lei Liu
First: 2026-04-30T14:49:31+00:00 · Latest: 2026-04-30T14:49:31+00:00
Comments: 6 pages, 5 figures, 2 tables, submitted for a possible publication
Abstract
Joint image compression and wireless transmission remain relatively underexplored compared to generic image restoration, despite its importance in practical communication systems. We formulate this problem under an equivalent linear model, and propose Diffusion-OAMP, a training-free reconstruction framework that embeds a pre-trained diffusion model into the OAMP algorithm. In Diffusion-OAMP, the OAMP linear estimator produces pseudo-AWGN observations, while the diffusion model serves as a nonlinear estimator under an SNR-matching rule. This framework offers a way to incorporate multiple generative priors into OAMP. Experiments with varying compression ratios and noise levels show that Diffusion-OAMP performs favorably against classic methods in the evaluated settings.
Summary / 总结
The paper addresses the underexplored area of joint image compression and wireless transmission, proposing Diffusion-OAMP, a training-free reconstruction framework. This framework integrates a pre-trained diffusion model into the OAMP algorithm, where the OAMP linear estimator generates pseudo-AWGN observations, and the diffusion model acts as a nonlinear estimator under an SNR-matching rule. Experiments demonstrate that Diffusion-OAMP outperforms classic methods across different compression ratios and noise levels.
论文针对联合图像压缩与无线传输这一相对较少研究的领域,提出了一种无训练的重建框架Diffusion-OAMP。该框架将预训练的扩散模型嵌入到OAMP算法中,其中OAMP线性估计器生成伪AWGN观测值,扩散模型在SNR匹配规则下作为非线性估计器。实验结果显示,Diffusion-OAMP在不同压缩比和噪声水平下优于经典方法。
Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training
Authors: Mingliang Liang, Zhuoran Liu, Arjen P. de Vries, Martha Larson
First: 2026-04-30T14:33:23+00:00 · Latest: 2026-04-30T14:33:23+00:00
Abstract
The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emph{long-tail concepts} remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emph{dynamic cluster-based sampling approach (DynamiCS)} that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.
Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction
Authors: Shipeng Liu, Liang Zhao, Dengfeng Chen, Zhanping Song
First: 2026-04-30T14:31:00+00:00 · Latest: 2026-04-30T14:31:00+00:00
Abstract
Tunnel inspection requires outputs that can support defect localization, measurement, severity grading, and engineering documentation. Existing training-free foundation-model pipelines usually stop at coarse open-vocabulary proposals, which are difficult to use directly in interference-heavy tunnel scenes. We propose a training-free framework TunnelMIND. Specifically, language-guided defect proposals are not treated as final outputs; instead, their spatial support is recalibrated at inference time through dense visual consistency, so that coarse semantic anchors can be transformed into more reliable prompts under tunnel-specific hard negatives. The resulting masks are further reconstructed into structured defect entities with category, location, geometry, severity, and context attributes, which are then mapped to retrieval-grounded explanation and engineering-readable report generation under expert knowledge constraints. On visible, GPR, and road defect tasks, TunnelMIND achieves F1 scores of 0.68, 0.78, and 0.72, respectively. Overall, TunnelMIND shows that training-free tunnel inspection can move beyond coarse localization toward structured defect evidence for engineering assessment.
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
Authors: Zhenguo Zhang, Haohan Zheng, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang
First: 2025-12-16T03:19:28+00:00 · Latest: 2026-04-30T14:06:23+00:00
Abstract
The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
中文标题/摘要
标题:OmniDrive-R1:强化驱动的交织多模态链式思考框架以实现可信赖的视觉语言自动驾驶
在自动驾驶(AD)等安全关键领域部署视觉语言模型(VLMs)受到可靠性故障的严重阻碍,尤其是对象幻觉。这种故障源于它们依赖于基于文本的链式思考(CoT)推理。尽管现有的多模态CoT方法试图缓解这一问题,但它们存在两个根本缺陷:(1)感知和推理阶段的分离,这阻碍了端到端联合优化,(2)依赖昂贵的密集定位标签。因此,我们提出了OmniDrive-R1,这是一种为自动驾驶设计的端到端VLM框架,通过交织多模态链式思考(iMCoT)机制统一了感知和推理。我们的核心创新是一种强化驱动的视觉定位能力,使模型能够自主地引导其注意力并“聚焦”在关键区域进行精细分析。这种能力得益于我们纯两阶段强化学习训练管道和Clip-GRPO算法。关键的是,Clip-GRPO引入了一种无需标注的过程导向定位奖励。这种奖励不仅消除了对密集标签的需求,还通过强制实时跨模态一致性来避免外部工具调用的不稳定性。在DriveLMM-o1上的大量实验表明,我们的模型取得了显著改进。与基线Qwen2.5VL-7B相比,OmniDrive-R1的整体推理得分从51.77%提高到80.35%,最终答案准确性从37.81%提高到73.62%。
Summary / 总结
OmniDrive-R1 is an end-to-end Vision-Language Model framework for autonomous driving that integrates perception and reasoning through an interleaved Multi-modal Chain-of-Thought mechanism. It introduces a reinforcement-driven visual grounding capability, which allows the model to autonomously focus on critical regions for fine-grained analysis. This is achieved through a two-stage reinforcement learning training pipeline and the Clip-GRPO algorithm, which provides annotation-free, process-based grounding rewards. Experimental results on DriveLMM-o1 show significant improvements, with reasoning scores increasing from 51.77% to 80.35% and final answer accuracy from 37.81% to 73.62% compared to the baseline Qwen2.5VL-7B.
OmniDrive-R1 是一个端到端的视觉-语言模型框架,用于自动驾驶,它通过交错的多模态链式思考机制将感知和推理结合起来。该框架引入了一种基于强化学习的视觉定位能力,使模型能够自主地聚焦于关键区域进行精细分析。这通过两阶段的强化学习训练管道和 Clip-GRPO 算法实现,该算法提供了无需标注的过程导向定位奖励。实验结果表明,与基线 Qwen2.5VL-7B 相比,推理得分从 51.77% 提高到 80.35%,最终答案准确性从 37.81% 提高到 73.62%。
NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains
Authors: Shiyao Peng, Qianhe Zheng, Zhuodi Hao, Zichen Tang, Rongjin Li, Qing Huang, Jiayu Huang, Jiacheng Liu, Yifan Zhu, Haihong E
Venue: WWW 2026
First: 2026-04-30T13:37:01+00:00 · Latest: 2026-04-30T13:37:01+00:00
Comments: Accepted to WWW 2026
Abstract
Although precise recall is a core objective in Retrieval-Augmented Generation (RAG), a critical oversight persists in the field: improvements in retrieval performance do not consistently translate to commensurate gains in downstream reasoning. To diagnose this gap, we propose the Recall Conversion Rate (RCR), a novel evaluation metric to quantify the contribution of retrieval to reasoning accuracy. Our quantitative analysis of mainstream RAG methods reveals that as Recall@5 improves, the RCR exhibits a near-linear decay. We identify the neglect of retrieval quality in these methods as the underlying cause. In contrast, approaches that focus solely on quality optimization often suffer from inferior recall performance. Both categories lack a comprehensive understanding of retrieval quality optimization, resulting in a trade-off dilemma. To address these challenges, we propose comprehensive retrieval quality optimization criteria and introduce the NeocorRAG framework. This framework achieves holistic retrieval quality optimization by systematically mining and utilizing Evidence Chains. Specifically, NeocorRAG first employs an innovative activated search algorithm to obtain a refined candidate space. Then it ensures precise evidence chain generation through constrained decoding. Finally, the retrieved set of evidence chains guides the retrieval optimization process. Evaluated on benchmarks including HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, NeocorRAG achieves SOTA performance on both 3B and 70B parameter models, while consuming less than 20% of tokens used by comparable methods. This study presents an efficient, training-free paradigm for RAG enhancement that effectively optimizes retrieval quality while maintaining high recall. Our code is released at https://github.com/BUPT-Reasoning-Lab/NeocorRAG.
中文标题/摘要
标题:NeocorRAG:减少无关信息,增强明确证据,通过证据链实现更有效的回忆
尽管精确回忆是检索增强生成(RAG)的核心目标,但在该领域中存在一个关键的忽视:检索性能的改进并不总是与下游推理的相应提升保持一致。为诊断这一差距,我们提出了检索转换率(RCR)这一新的评估指标,以量化检索对推理准确性的贡献。对主流RAG方法的定量分析显示,随着Recall@5的提高,RCR几乎呈线性衰减。我们发现这些方法忽视检索质量是其根本原因。相比之下,专注于质量优化的方法往往在检索性能上表现较差。两类方法都缺乏对检索质量优化的全面理解,导致了权衡困境。为解决这些挑战,我们提出了全面的检索质量优化标准,并引入了NeocorRAG框架。该框架通过系统地挖掘和利用证据链实现整体检索质量优化。具体而言,NeocorRAG首先采用创新的激活搜索算法获得精炼的候选空间,然后通过受限解码确保精确的证据链生成。最后,检索出的证据链集合指导检索优化过程。NeocorRAG在HotpotQA、2WikiMultiHopQA、MuSiQue和NQ等基准测试中,无论是3B参数模型还是70B参数模型,都实现了SOTA性能,同时消耗的令牌数量不到同类方法的20%。本研究提出了一种高效的、无需训练的RAG增强范式,能够在保持高召回率的同时有效优化检索质量。我们的代码已发布在https://github.com/BUPT-Reasoning-Lab/NeocorRAG。
Summary / 总结
The paper addresses the gap between improved retrieval performance and downstream reasoning accuracy in Retrieval-Augmented Generation (RAG) by proposing the Recall Conversion Rate (RCR) metric. It identifies that mainstream RAG methods suffer from a trade-off between retrieval quality and recall. To address this, the authors introduce NeocorRAG, which optimizes retrieval quality through Evidence Chains, achieving state-of-the-art performance on benchmarks while using fewer tokens. The study evaluates NeocorRAG on HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, demonstrating its effectiveness.
论文通过提出召回转换率(RCR)指标,诊断了检索增强生成(RAG)中检索性能提升与下游推理准确性的不一致问题。主流RAG方法在检索质量和召回率之间存在权衡。为此,作者提出了NeocorRAG框架,通过证据链优化检索质量,实现了在HotpotQA、2WikiMultiHopQA、MuSiQue和NQ等基准上的最佳性能,同时使用较少的token。研究展示了NeocorRAG的有效性。
Hyper-Dimensional Fingerprints as Molecular Representations
Authors: Jonas Teufel, Luca Torresi, André Eberhard, Pascal Friederich
First: 2026-04-30T12:53:58+00:00 · Latest: 2026-04-30T12:53:58+00:00
Comments: Code: https://doi.org/10.5281/zenodo.19373621
Abstract
Computational molecular representations underpin virtual screening, property prediction, and materials discovery. Conventional fingerprints are efficient and deterministic but lose structural information through hash-based compression, particularly at low dimensionalities. Learned representations from graph neural networks recover this expressiveness but require task-specific training and substantial computational resources. Here we introduce hyperdimensional fingerprints (HDF), which replace the learned transformations of message-passing neural networks with algebraic operations on high-dimensional vectors, producing deterministic molecular representations without any training. Across diverse property prediction benchmarks, HDF outperforms conventional fingerprints in the majority of tasks while exhibiting greater consistency across datasets and models. Crucially, HDF embeddings preserve molecular similarity faithfully: at 32 dimensions, distances in HDF space achieve a 0.9 Pearson correlation with graph edit distance, compared to 0.55 for Morgan fingerprints at equivalent size. This structural fidelity persists at low dimensions where hash-based methods degrade, allowing simple nearest-neighbor regression to remain predictive with as few as 64 components. We further demonstrate the practical impact in Bayesian molecular optimization, where HDF-based surrogate models achieve substantially improved sample efficiency in regimes where Morgan fingerprints perform comparably to random search. HDF thus provides a general-purpose, training-free alternative to conventional molecular fingerprints, suggesting that the information loss long accepted as inherent to fixed-length fingerprints is a limitation of the hash-based encoding scheme rather than the fingerprint paradigm itself.
Summary / 总结
This study introduces hyperdimensional fingerprints (HDF) as a new method for molecular representation, addressing the limitations of conventional fingerprints by preserving structural information without task-specific training. HDF outperforms conventional fingerprints in property prediction benchmarks and maintains high structural fidelity even at low dimensions, allowing simple regression methods to remain effective. HDF-based surrogate models in Bayesian optimization show improved sample efficiency compared to Morgan fingerprints and random search.
该研究引入了超维度指纹(HDF)作为新的分子表示方法,通过在不进行任务特定训练的情况下保留结构信息来解决传统指纹的局限性。HDF在属性预测基准测试中优于传统指纹,并且即使在低维度下也能保持高结构一致性,使得简单的回归方法仍然有效。HDF在贝叶斯分子优化中的代理模型显示出比摩根指纹和随机搜索更高的样本效率。
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
Authors: Wenrui Zhou, Mohamed Hendy, Shu Yang, Qingsong Yang, Zikun Guo, Yuyu Luo, Lijie Hu, Di Wang
Venue: ACL 2026
First: 2025-06-08T15:00:21+00:00 · Latest: 2026-04-30T12:27:19+00:00
Comments: 27 Pages, Accepted by ACL 2026 Main Conference
Abstract
As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the videolanguage domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE(Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISEpioneeringly brings linguistic perspectives on sycophancy into the video domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. Furthermore, we propose two potential training-free mitigation strategies revealing potential paths for reducing sycophantic bias: (i) enhancing visual grounding through interpretable key-frame selection and (ii) steering model behavior away from sycophancy via targeted, inference-time intervention on its internal neural representations. Our code is available at https://anonymous.4open.science/r/VideoSycophancy-567F.
Summary / 总结
This paper addresses the issue of sycophancy in Video-LLMs, where these models may align with user input despite contradicting visual evidence. To fill the gap in systematic evaluation, the authors propose VISE, a benchmark for assessing sycophantic behavior in Video-LLMs across various question formats and visual reasoning tasks. They also suggest two mitigation strategies: enhancing visual grounding and steering model behavior during inference. Key findings include the identification of different sycophancy types and the effectiveness of the proposed mitigation methods in reducing sycophantic bias.
该论文关注视频大型语言模型(Video-LLMs)中的奉承行为问题,即这些模型可能在视觉证据与用户输入矛盾时仍会与其一致。为填补系统性评估的空白,作者提出了VISE,一个用于评估Video-LLMs在各种问题格式和视觉推理任务中的奉承行为的基准。他们还提出了两种缓解策略:增强视觉定位和在推理过程中引导模型行为。关键发现包括识别不同类型的奉承行为以及所提缓解方法在减少奉承偏见方面的有效性。
Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering
Authors: Xupeng Chen, Binbin Shi, Chenqian Le, Jiaqi Zhang, Kewen Wang, Ran Gong, Jinhan Zhang, Chihang Wang
First: 2026-04-30T11:16:07+00:00 · Latest: 2026-04-30T11:16:07+00:00
Abstract
Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.
Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation
Authors: Xupeng Chen, Binbin Shi, Chenqian Le, Qifu Yin, Lang Lin, Haowei Ni, Ran Gong, Panfeng Li
First: 2026-04-30T11:11:47+00:00 · Latest: 2026-04-30T11:11:47+00:00
Abstract
Deploying vision-language models (VLMs) in clinical settings demands auditable behavior under realistic failure conditions, yet the failure landscape of frontier VLMs on specialized medical inputs is poorly characterized. We audit five recent frontier and grounding-aware VLMs (Gemini~2.5~Pro, GPT-5, o3, GLM-4.5V, Qwen~2.5~VL) on Medical VQA along two trust-relevant axes. Perception: all models localize anatomical and pathological targets poorly -- the best model reaches only 0.23 mean IoU and 19.1% Acc@0.5 -- and exhibit clinically dangerous laterality confusion. Pipeline integration: a self-grounding pipeline, where the same model localizes then answers, degrades VQA accuracy for every model -- driven by both inaccurate localization and format-compliance failures under the two-step prompt (parse failure rises to 70%--99% for Gemini and GPT-5 on VQA-RAD). Replacing predicted boxes with ground-truth annotations recovers and improves VQA accuracy, consistent with the failure residing in the perception module rather than in the decomposition itself. These observational findings identify grounding quality as a primary trustworthiness bottleneck in our SLAKE bounding-box setting. As a complementary fine-tuning follow-up, supervised fine-tuning of Qwen~2.5~VL on combined Med-VQA training data attains the highest reported SLAKE open-ended recall (85.5%) among comparable methods, suggesting that the VQA-level gap is tractable with domain adaptation; whether this also closes the perception/trustworthiness bottleneck is left to future work.
Summary / 总结
This study audits five advanced vision-language models (Gemini~2.5~Pro, GPT-5, o3, GLM-4.5V, Qwen~2.5~VL) on medical visual question answering (VQA) to assess their trustworthiness. The models struggle with anatomical and pathological target localization, achieving only 0.23 mean IoU and 19.1% accuracy. A self-grounding pipeline further degrades VQA accuracy due to both localization errors and format-compliance failures. Replacing predicted boxes with ground-truth annotations improves VQA accuracy, indicating the issue lies in perception rather than decomposition. Supervised fine-tuning of Qwen~2.5~VL on medical VQA data improves open-ended recall to 85.5%, suggesting potential for domain adaptation but leaving the perception/trustworthiness bottleneck unresolved.
该研究审计了五种先进的视觉-语言模型(Gemini~2.5~Pro、GPT-5、o3、GLM-4.5V、Qwen~2.5~VL)在医学视觉问答(VQA)任务中的表现,以评估其可信度。这些模型在解剖学和病理学目标定位方面表现不佳,仅达到0.23的平均IoU和19.1%的准确率。自我定位管道进一步降低了VQA的准确性,主要是由于定位错误和格式合规性失败。用真实标注替换预测框可以提高VQA准确性,表明问题在于感知模块而非分解本身。监督微调Qwen~2.5~VL在医学VQA数据上的表现提高了开放性召回率到85.5%,表明领域适应具有潜力,但感知/可信度瓶颈问题仍需进一步研究。
Training-Free Reward-Guided Image Editing via Trajectory Optimal Control
Authors: Jinho Chang, Jaemin Kim, Jong Chul Ye
Venue: ICLR 2026 Poster
First: 2025-09-30T06:34:37+00:00 · Latest: 2026-04-30T11:11:41+00:00
Comments: Poster in ICLR 2026; 22 pages, 9 figures. The code is available at https://github.com/jinhojsk515/ITOC
Abstract
Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.
中文标题/摘要
标题:无训练奖励引导图像编辑通过轨迹最优控制
近期在扩散和流匹配模型方面的进展展示了在高保真图像合成方面的出色能力。研究的一个主要方向是奖励引导的指导,该方法在推理过程中引导生成过程以与特定目标对齐。然而,将这种奖励引导的方法应用于需要保留源图像的语义内容同时增强目标奖励的图像编辑任务,尚未得到充分探索。在本文中,我们提出了一种新的无训练奖励引导图像编辑框架。我们将编辑过程形式化为一个轨迹最优控制问题,其中扩散模型的逆过程被视为从源图像出发的可控轨迹,并通过迭代更新伴随状态来引导编辑过程。通过在不同编辑任务上的广泛实验,我们证明了我们的方法在奖励最大化和对源图像保真度之间取得了显著优于现有基于反转的无训练指导基线的表现,同时没有出现奖励作弊。
Summary / 总结
This work addresses the challenge of training-free reward-guided image editing by formulating the editing process as a trajectory optimal control problem. The framework uses the reverse process of a diffusion model to generate controllable trajectories from the source image, with adjoint states iteratively updated to align with the target reward. Experiments show that this approach outperforms existing inversion-based methods, balancing reward maximization and source image fidelity effectively.
该研究提出了一种无需训练的奖励引导图像编辑框架。它将编辑过程视为轨迹最优控制问题,使用扩散模型的逆过程作为可控轨迹。该方法通过平衡奖励最大化和源图像保真度,优于现有基于反演的方法,且未出现奖励作弊现象。在多种编辑任务上的实验验证了其有效性。
Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
Authors: Hyeonseo Jang, Jaebyeong Jeon, Joong-Won Hwang, Kibok Lee
Venue: CVPR 2026
First: 2026-04-30T11:01:23+00:00 · Latest: 2026-04-30T11:01:23+00:00
Comments: CVPR 2026
Abstract
Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines--without modifying any other components--is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and incurs no additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code is available at: https://github.com/YonseiML/fpp.
Summary / 总结
This work addresses the issue of poorly calibrated models in test-time prompt tuning (TPT) for vision-language models by proposing Flatness-aware Prompt Pretraining (FPP). FPP initializes prompts in flatter regions of the loss landscape, improving both calibration and performance without additional computational costs. This method does not require labeled data and can be easily integrated into existing TPT pipelines.
本文解决了测试时提示调整(TPT)在视觉-语言模型中导致模型校准不佳的问题。它提出了平滑度感知提示预训练(FPP),通过将提示初始化在损失景观的更平滑区域来提高校准,同时不降低性能。实验表明,FPP可以同时提升校准和性能,且不需要额外的标注数据或测试时的计算成本。
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
Authors: Ke Xu
First: 2026-04-30T09:19:26+00:00 · Latest: 2026-04-30T09:19:26+00:00
Comments: 16 pages, 3 figures, 8 tables
Abstract
We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis.
Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.
Summary / 总结
WaferSAGE is a framework for wafer defect visual question answering using small vision-language models. It addresses data scarcity in semiconductor manufacturing through a three-stage synthesis pipeline involving structured rubric generation. Starting from limited labeled wafer maps, the framework employs clustering-based cleaning, generates comprehensive defect descriptions, and converts them into structured evaluation rubrics. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. The dual assessment framework uses Bayesian optimization to align rule-based metrics with LLM-Judge scores, and through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO), the 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, approaching Gemini-3-Flash (7.149) while enabling on-premise deployment.
WaferSAGE 是一种使用小型视觉语言模型的晶圆缺陷视觉问答框架,旨在解决半导体制造中的数据稀缺问题。该框架通过包含结构化评估标准生成的三阶段合成管道来实现。从有限的标注晶圆图开始,框架采用聚类基清洗,生成全面的缺陷描述,并将其转换为结构化的评估标准。这些标准指导生成 VQA 对,确保覆盖缺陷类型识别、空间分布、形态和根本原因分析。双评估框架使用贝叶斯优化来对齐基于规则的度量与LLM-裁判评分,并通过基于课程的强化学习与组序列策略优化(GSPO),4B参数的Qwen3-VL 模型实现了6.493 LLM-裁判评分,接近Gemini-3-Flash(7.149),同时支持本地部署。
Test-Time Distillation for Continual Model Adaptation
Authors: Xiao Chen, Jiazhen Huang, Zhiming Liu, Qinting Jiang, Fanding Huang, Jingyan Jiang, Zhi Wang
Venue: CVPR 2026
First: 2025-06-03T09:16:51+00:00 · Latest: 2026-04-30T09:01:20+00:00
Comments: Accepted by CVPR 2026 Findings
Abstract
Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner. However, existing methods that rely on self-supervision are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: (1) the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts; and (2) the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls highlight the need to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then it applies an Optimal Transport-based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% with only 48% of its time cost on ImageNet-C. Project page is publicly available at https://github.com/walawalagoose/TTD.
中文标题/摘要
标题:部署时的蒸馏技术用于持续模型适应
深度神经网络在部署时由于分布偏移往往会遭受性能下降。持续测试时适应(CTTA)旨在以无监督的方式解决这一问题。然而,现有的依赖自我监督的方法容易陷入固有的自我反馈循环,这会放大初始预测错误,导致模型漂移。我们重新审视了这一局限性,并提出了测试时蒸馏(TTD),将其重新定义为由冻结的视觉-语言模型(VLM)作为外部信号引导的蒸馏过程。虽然前景广阔,但我们发现直接蒸馏存在两个陷阱:(1)专家陷阱,VLM 的广泛但非专门化的知识导致特定任务和转移上的次优性能;(2)熵偏差,基于熵的简单模型融合技术由于异构模型的不一致校准而失效。这些陷阱突显了建立稳健的监督信号并利用其引导目标模型实现稳定适应的必要性。因此,我们提出了CoDiRe,一种持续蒸馏和校正框架用于TTD。CoDiRe 首先通过动态融合VLM 和目标模型的预测来构建一个稳健的混合教师。关键的是,它通过利用最大softmax概率(MSP)作为更可靠的置信度度量来规避熵偏差,为每个模型的专业知识分配权重。然后,它应用基于最优传输的校正,进一步使预测与混合教师对齐,从而实现持续和稳定的适应。广泛的实验表明,CoDiRe 在ImageNet-C 上仅花费CoTTA 48%的时间成本时,性能超过了最先进的基线,超过了CoTTA 10.55%。项目页面在 https://github.com/walawalagoose/TTD 公开可用。
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
Authors: Sonali Sharma, Jin Long, George Shih, Sarah Eid, Christian Bluethgen, Francine L. Jacobson, Emily B. Tsai, Global Radiology Consortium, Ahmed M. Alaa, Curtis P. Langlotz
First: 2026-04-29T04:33:43+00:00 · Latest: 2026-04-30T08:02:58+00:00
Comments: 51 pages, 7 figures, 10 tables
Abstract
Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision-language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state-of-the-art vision-language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference-time hint recovers missed findings and significantly reduces hallucinations. Third, vision-language models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought's multi-reader annotations, we predict both human-human and human-AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision-language models.
中文标题/摘要
标题:CheXthought:全球多模态胸部X光解释临床推理和视觉注意力数据集
胸部X光解释是医学中最常见的诊断任务之一,也是AI开发的主要目标,然而当前的视觉-语言模型主要是在成对的图像和报告数据集上进行训练,而不是临床推理和视觉注意力的认知过程。在此,我们介绍了CheXthought,这是一个包含来自71个国家501名放射科医生的50,312张多读胸部X光片的全球多模态资源,其中包含103,592条临床推理链和6,609,082个同步视觉注意力注释。我们的分析揭示了专家在使用不同的视觉搜索策略、整合临床背景和传达不确定性方面的临床推理模式。我们从四个维度展示了CheXthought的临床应用价值。首先,CheXthought的推理在事实准确性和空间定位方面显著优于最先进的视觉-语言模型的推理链。其次,作为推理时的提示使用的视觉注意力数据可以恢复遗漏的发现并显著减少幻觉。第三,使用CheXthought数据训练的视觉-语言模型在病理分类、视觉忠实度、时间推理和不确定性传达方面表现更佳。第四,利用CheXthought的多读者注释,我们直接从图像中预测人类-人类和人类-AI的分歧,从而实现病例难度、不确定性和模型可靠性的透明沟通。这些发现确立了CheXthought作为促进多模态临床推理和开发更透明、可解释的视觉-语言模型的资源。
EdgeFM: Efficient Edge Inference for Vision-Language Models
Authors: Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An
First: 2026-04-30T06:18:50+00:00 · Latest: 2026-04-30T06:18:50+00:00
Comments: Technique Report version
Abstract
Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design
Authors: Meghana Kshirsagar, Allen Nie, Ching-An Cheng, Fanglei Xue, Rahul Dodhia, Juan Lavista Ferres, Kevin K. Yang, Frank DiMaio
First: 2026-04-19T00:20:18+00:00 · Latest: 2026-04-30T05:12:50+00:00
Abstract
We introduce RosettaSearch, an inference-time multi-objective optimization approach for backbone conditioned protein sequence design. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model, under a strict computational budget. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN's single-pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18% to 68%, translating to a 2.5x improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability.
We further demonstrate that RosettaSearch improves the sequence fidelity of ProteinMPNN designs for de novo backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. To our knowledge, this is the first large-scale demonstration that LLMs can serve as effective generative optimizers for backbone-conditioned protein sequence design, yielding systematic gains without any model retraining.
Summary / 总结
RosettaSearch is is an inference-time multi-objective optimization approach for backbone-conditioned protein sequence design. It It It using using LLMs as as as a generative optimizer capable of exploring exploration and exploitation using using rewards computed from RosettaFold3. Under a strict computational budget, RosettaSearch successfully recovers high-fidelity protein designs for 4 sub suboptimal sequences generated by LigMPNN, improving structural fidelity metrics by up 18 to 6 on, average 1 55 a 1. 5 5x improvement in success rate.. It that the approach generalizes well to computationally generated back backbones from the Dayhoff atlas. RosettaSearch also also a multi-modal approach also also incorporates structural context feedback to guide protein sequence generation. demonstrating that LLMs can serve as as gener gener optimizers for backbone-conditioned protein sequence design.
RosettaSearch 是一种在推断时进行多目标优化的蛋白质序列设计方法,使用大型语言模型作为生成优化器嵌入在搜索算法中。它将结构准确度指标提高了18%到68%,使设计成功率提高了2.5倍。该方法在不同的LLM家族中具有普适性,并且对于天然和计算生成的蛋白质骨架也显示出有效性。
Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis
Authors: David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Mert D. Pese
Venue: SAE Technical Paper 2026-01-0170, SAE WCX 2026
First: 2026-04-30T04:33:38+00:00 · Latest: 2026-04-30T04:33:38+00:00
Comments: 9 pages, 2 figures. Accepted at SAE WCX 2026
Abstract
Vision-language models (VLMs) are increasingly used in autonomous driving because they combine visual perception with language-based reasoning, supporting more interpretable decision-making, yet their robustness to physical adversarial attacks, especially whether such attacks transfer across different VLM architectures, is not well understood and poses a practical risk when attackers do not know which model a vehicle uses. We address this gap with a systematic cross-architecture study of adversarial transferability in VLM-based driving, evaluating three representative architectures (Dolphins, OmniDrive, and LeapVAD) using physically realizable patches placed on roadside infrastructure in both crosswalk and highway scenarios. Our transfer-matrix evaluation shows high cross-architecture effectiveness, with transfer rates of 73-91% (mean TR = 0.815 for crosswalk and 0.833 for highway) and sustained frame-level manipulation over 64.7-79.4% of the critical decision window even when patches are not optimized for the target model.
中文标题/摘要
标题:理解视觉-语言模型在自动驾驶中的对抗转移性:跨架构分析
视觉-语言模型(VLMs)在自动驾驶中越来越受欢迎,因为它们结合了视觉感知和基于语言的推理,支持更可解释的决策制定,然而它们对物理对抗攻击的鲁棒性,尤其是这些攻击是否在不同的VLM架构之间转移,尚未得到充分理解,当攻击者不知道车辆使用的是哪个模型时,这会带来实际风险。我们通过在人行横道和高速公路场景中使用物理可实现的补丁放置在路边基础设施上,对基于VLM的驾驶中的对抗转移性进行了系统性的跨架构研究,评估了三种代表性架构(Dolphins、OmniDrive和LeapVAD)。我们的转移矩阵评估显示了高跨架构有效性,转移率为73-91%(人行横道的平均转移率TR = 0.815,高速公路为0.833),即使补丁未针对目标模型进行优化,也能够在关键决策窗口的64.7-79.4%的帧级上维持操纵。
Summary / 总结
This study investigates the transferability of adversarial attacks across different vision-language model architectures in autonomous driving. By evaluating three representative architectures (Dolphins, OmniDrive, and LeapVAD) with physically realizable patches on roadside infrastructure, the research demonstrates high cross-architecture effectiveness with transfer rates of 73-91% and sustained manipulation over 64.7-79.4% of the critical decision window, indicating a significant risk of adversarial attacks transferring between models.
研究探讨了不同视觉-语言模型架构在自动驾驶中对抗攻击的跨架构转移性。通过在路边基础设施上使用物理可实现的贴片评估三个代表性架构(Dolphins、OmniDrive 和 LeapVAD),研究显示了高跨架构有效性,转移率为73-91%,并在关键决策窗口的64.7-79.4%时间内持续操纵,表明攻击在不同模型之间转移的风险显著。
VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching
Authors: Yihong Guo, Youwei Lyu, Jiajun Tang, Yizhuo Zhou, Hongliang Wang, Jinwei Chen, Changqing Zou, Qingnan Fan
First: 2026-04-30T03:39:32+00:00 · Latest: 2026-04-30T03:39:32+00:00
Abstract
Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.
中文标题/摘要
标题:VeraRetouch:一种轻量级全可微多任务推理照片修图框架
照片修图推理已获得显著关注,要求模型分析图像缺陷、给出推理过程并执行精确的修图增强。然而,现有方法往往依赖于非可微分的外部软件,造成优化障碍并导致参数冗余和泛化能力有限。为解决这些挑战,我们提出VeraRetouch,一种轻量级全可微多任务照片修图框架。我们采用0.5B视觉-语言模型(VLM)作为中心智能体,基于指令和场景语义制定修图计划。此外,我们开发了一种全可微分的修图渲染器,取代外部工具,通过解耦控制隐变量直接进行端到端像素级训练,用于照明、全局色彩和特定色彩调整。为克服数据稀缺,我们引入了AetherRetouch-1M+,这是首个百万规模的专业修图数据集,通过新的逆退化工作流构建。此外,我们提出了DAPO-AE,一种强化学习后训练策略,增强自主美学认知。大量实验表明,VeraRetouch在多个基准测试中达到最先进的性能,同时保持显著更小的体积,支持移动部署。我们的代码和模型已公开发布在https://github.com/OpenVeraTeam/VeraRetouch。
Summary / 总结
VeraRetouch is a lightweight fully differentiable framework for multi-task photo retouching, addressing the limitations of existing non-differentiable approaches. It uses a 0.5B Vision-Language Model to generate retouching plans and a fully differentiable Retouch Renderer to replace external tools, allowing end-to-end training. VeraRetouch also introduces AetherRetouch-1M+, a million-scale dataset, and DAPO-AE, a reinforcement learning strategy for enhancing aesthetic cognition. Experiments show that VeraRetouch outperforms existing methods while being more compact, suitable for mobile deployment.
VeraRetouch 是一个轻量级且完全可微的多任务照片修复框架,通过使用 0.5B 视觉-语言模型生成修复计划,并用完全可微的修复渲染器替代非可微的外部工具。它还引入了 AetherRetouch-1M+,一个百万规模的数据集,以及 DAPO-AE,一种强化学习后训练策略。实验表明,VeraRetouch 在多个基准上优于现有方法,同时更紧凑,适用于移动部署。
CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling
Authors: Yingrui Wu, Youkang Kong, Mingyang Zhao, Weize Quan, Dong-Ming Yan, Yang Liu
First: 2026-04-30T03:18:26+00:00 · Latest: 2026-04-30T03:18:26+00:00
Comments: SIGGARPH 2026 (Journal Track), Code: https://github.com/YingruiWoo/CasLayout
Abstract
Synthesizing realistic 3D indoor scenes remains challenging due to data scarcity and the difficulty of simultaneously enforcing global architectural constraints and local semantic consistency. Existing approaches often overlook structural boundaries or rely on fully connected relation graphs that introduce redundant generation errors. Inspired by human design cognition, we present CasLayout, a cascaded diffusion framework that decomposes the joint scene generation task into four conditional sub-stages with explicit physical and semantic roles: (1) predicting furniture quantity and categories, (2) refining object sizes and feature embeddings, (3) modeling spatial relationships in a latent space, and (4) generating Oriented Bounding Boxes (OBBs). This decoupled architecture reduces data requirements and enables flexible integration of Large Language Models (LLMs) and Vision Language Models (VLMs) for zero-shot tasks such as image-to-scene generation. To maintain physical validity within complex floor plans, we explicitly model building elements (e.g., walls, doors, and windows) as conditional constraints. Furthermore, to address the high entropy of dense relation graphs, we introduce a sparse relation graph formulation aligned with human spatial descriptions. By encoding these sparse graphs into a compact latent space using a bidirectional Variational Autoencoder (VAE), the proposed framework provides enhanced relational controllability, allowing generated layouts to better respect functional organization. Experiments demonstrate that CasLayout achieves state-of-the-art performance in fidelity and diversity while enabling improved controllability in practical applications.
Summary / 总结
CasLayout is a cascaded diffusion framework designed to synthesize realistic 3D indoor scenes by decomposing the task into four stages: predicting furniture, refining object details, modeling spatial relationships, and generating OBBs. This approach reduces data requirements and enhances controllability, especially for complex floor plans. Experiments show that CasLayout outperforms existing methods in terms of fidelity, diversity, and practical controllability.
CasLayout 是一种分阶段的扩散框架,通过预测家具、细化对象、建模空间关系和生成OBBs来合成逼真的3D室内场景。这种方法减少了数据需求,并允许集成LLMs和VLMs。实验表明,CasLayout在保真度、多样性和可控性方面优于现有方法。
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization
Authors: Naeem Rehmat, Muhammad Saad Saeed, Ijaz Ul Haq, Khalid Malik
Venue: CVPR
First: 2026-04-30T02:25:33+00:00 · Latest: 2026-04-30T02:25:33+00:00
Comments: Accepted at CVPR NeXD Workshop (2026)
Abstract
Web filtering systems rely on accurate web content classification to block cyber threats, prevent data exfiltration, and ensure compliance. However, classification is increasingly difficult due to the dynamic and rapidly evolving nature of the modern web. Embedding-based zero-shot approaches map content and category descriptions into a shared semantic space, enabling label assignment without labeled training data, but remain highly sensitive to definition quality. Poorly specified or ambiguous definitions create semantic overlap in the embedding space, leading to systematic misclassification.
In this paper, we propose a training-free, adaptive iterative definition refinement framework that improves zero-shot web content classification by progressively optimizing category definitions rather than updating model parameters. Using LLMs as feedback-driven definition optimizers, we investigate three refinement strategies namely example-guided, confusion-aware, and history-aware, each refining class descriptions using structured signals from misclassified instances. Furthermore, we introduce a human-labeled benchmark of 10 URL categories with 1,000 samples per class and evaluate across 13 state-of-the-art embedding foundation models. Results demonstrate that iterative definition refinement consistently improves classification performance across diverse architectures, establishing definition quality as a critical and underexplored factor in embedding-based systems. The dataset is available at https://github.com/naeemrehmat/B2MWT-10C.
Summary / 总结
This paper addresses the challenge of accurate web content classification in dynamic web environments by proposing an iterative definition refinement framework. It uses LLMs to optimize category definitions without retraining models, focusing on three strategies: example-guided, confusion-aware, and history-aware. The framework improves zero-shot classification performance across various embedding models, highlighting the importance of definition quality in such systems. A benchmark dataset of 10 URL categories is provided for evaluation.
论文提出了一种迭代定义精炼框架,用于解决动态网络环境中准确的网页内容分类问题。该框架利用LLM通过基于示例、混淆意识和历史意识的策略优化类别定义,而不重新训练模型。实验结果表明,该方法在各种嵌入模型中都能一致地提高分类性能,突显了定义质量在基于嵌入的系统中的重要性。