arXiv 论文速递

Snapshot: 20260504_0410

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Gordon Guocheng Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov

Venue: CVPR 2026

First: 2025-12-11T18:59:56+00:00 · Latest: 2026-04-30T17:59:43+00:00

Comments: CVPR 2026. Project page: https://snap-research.github.io/omni-attribute

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

Summary / 总结

The research aims to develop a method for transferring specific image attributes like identity and expression into new contexts without entangling other visual factors. The Omni-Attribute encoder is introduced, which uses semantically linked image pairs and a dual-objective training approach to learn attribute-specific representations. The method outperforms existing techniques in open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance on multiple benchmarks.

研究旨在开发一种方法，以在不纠缠其他视觉因素的情况下，将特定的图像属性如身份和表情转移到新的上下文中。引入了Omni-Attribute编码器，该编码器使用语义关联的图像对和双重目标训练方法来学习属性特定的表示。该方法在开放词汇量的属性检索、个性化和组合生成方面优于现有技术，在多个基准测试中达到了最先进的性能。

Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders

Authors: Emma Andrews, Sahan Sanjaya, Prabhat Mishra

First: 2026-04-30T17:56:40+00:00 · Latest: 2026-04-30T17:56:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Machine learning models can learn from data samples to carry out various tasks efficiently. When data samples are adversarially manipulated, such as by insertion of carefully crafted noise, it can cause the model to make mistakes. Quantum machine learning models are also vulnerable to such adversarial attacks, especially in image classification using variational quantum classifiers. While there are promising defenses against these adversarial perturbations, such as training with adversarial samples, they face practical limitations. For example, they are not applicable in scenarios where training with adversarial samples is either not possible or can overfit the models on one type of attack. In this paper, we propose an adversarial training-free defense framework that utilizes a quantum autoencoder to purify the adversarial samples through reconstruction. Moreover, our defense framework provides a confidence metric to identify potentially adversarial samples that cannot be purified the quantum autoencoder. Extensive evaluation demonstrates that our defense framework can significantly outperform state-of-the-art in prediction accuracy (up to 68%) under adversarial attacks.

中文标题/摘要

标题：通过量子自编码器防御量子分类器对抗性扰动

机器学习模型可以从数据样本中学习以高效地执行各种任务。当数据样本被恶意操纵，例如插入精心设计的噪声时，会导致模型出错。量子机器学习模型也容易受到这种对抗性攻击的影响，尤其是在使用变分量子分类器进行图像分类时。虽然有一些有前景的防御对抗性扰动的方法，例如使用对抗样本进行训练，但它们存在实际限制。例如，在无法使用对抗样本进行训练或对抗样本会导致模型过度拟合特定攻击类型的情况下，这些方法不适用。在本文中，我们提出了一种无需对抗训练的防御框架，利用量子自编码器通过重构来净化对抗样本。此外，我们的防御框架提供了一个置信度指标，以识别无法被量子自编码器净化的潜在对抗样本。广泛的评估表明，在对抗攻击下，我们的防御框架在预测准确性方面可以显著优于现有最佳方法（最高可达68%）

Summary / 总结

This paper addresses the vulnerability of quantum classifiers to adversarial attacks by proposing a defense framework that uses a quantum autoencoder to purify adversarial samples through reconstruction. The method does not rely on adversarial training and includes a confidence metric to identify unpurifiable samples. Experimental results show that this approach significantly improves prediction accuracy under adversarial attacks, up to 68% better than state-of-the-art methods.

论文提出了一种无需对抗训练的防御框架，利用量子自编码器对受到对抗扰动的样本进行净化，同时提供一个置信度指标来识别无法净化的样本。实验结果表明，该方法在对抗攻击下的预测准确率显著提高，最高可提升68%，优于现有最佳方法。

PhyCo: Learning Controllable Physical Priors for Generative Motion

Authors: Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, Manmohan Chandraker

Venue: CVPR 2026

First: 2026-04-30T17:53:03+00:00 · Latest: 2026-04-30T17:53:03+00:00

Comments: CVPR 2026. Project Page: https://phyco-video.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.

中文标题/摘要

标题：PhyCo：学习可控的物理先验以生成运动

现代视频扩散模型在外观合成方面表现出色，但在物理一致性方面仍然存在问题：物体漂移，碰撞缺乏真实的反弹，材料响应很少符合其内在属性。我们提出了PhyCo框架，该框架将连续、可解释和物理基础的控制引入视频生成。我们的方法结合了三个关键组件：(i) 包含超过10万段逼真模拟视频的大规模数据集，其中摩擦、弹性、变形和力在多种场景中系统地变化；(ii) 使用与像素对齐的物理属性图条件下的ControlNet对预训练的扩散模型进行物理监督微调；以及(iii) VLM指导的奖励优化，其中微调的视觉语言模型使用针对物理查询的生成视频进行评估，并提供可微反馈。这种组合使生成模型能够在物理属性变化的情况下生成物理一致且可控的输出，而无需在推理时使用任何模拟器或几何重建。在Physics-IQ基准测试中，PhyCo在物理现实性方面显著优于强大的基线，而人类研究也证实了对物理属性的更清晰和更忠实的控制。我们的结果表明了一条通向可扩展的、物理一致且可控的生成视频模型的道路，这些模型可以超越合成训练环境。

Summary / 总结

PhyCo is a framework that integrates a large-scale dataset of photorealistic simulation videos, interpretable and physically grounded controllable motion generation. through integrating a pretrained diffusion model with physics-sup and on LM-guided reward optimization. a on-tuned vision vision.. a vision-language model. that evaluates generated videos with targeted physics queries. The method improves physical realism on baselines and enables human judges to provide clearer and more faithful control on physical attributes.

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

Authors: Zeyu Jiang, Changqing Zhou, Xingxing Zuo, Changhao Chen

Venue: RSS 2026

First: 2026-04-30T17:05:56+00:00 · Latest: 2026-04-30T17:05:56+00:00

Comments: RSS 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over $2\times$ improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines. Project page: https://the-masses.github.io/freeocc-web/.

中文标题/摘要

标题：FreeOcc：无需训练的开放词汇占用预测

现有的基于学习的占用预测方法依赖于大规模的3D注释，并且在不同环境中泛化能力较差。我们提出了FreeOcc，这是一种无需训练的框架，可以从单目或RGB-D序列中进行开放词汇的占用预测。与之前需要体素级监督和地面真实相机姿态的方法不同，FreeOcc无需3D注释、姿态地面真实或任何学习阶段。FreeOcc通过四层流水线逐步构建全局一致的占用图：SLAM主干估计姿态和稀疏几何；几何一致的高斯更新构建密集的3D高斯图；来自现成的视觉语言模型的开放词汇语义与高斯原语关联；概率高斯到占用的投影生成密集体素占用。尽管完全无需训练且姿态无关，FreeOcc在EmbodiedOcc-ScanNet上的IoU和mIoU相比之前的半监督方法提高了超过2倍。我们还引入了ReplicaOcc，一个用于室内开放词汇占用预测的基准，并展示了FreeOcc可以零样本地转移到新的环境中，显著优于监督和半监督基线。项目页面：https://the-masses.github.io/freeocc-web/

Summary / 总结

FreeOcc is a training-free framework for open-vocabulary occupancy prediction that uses monocular or RGB-D sequences. It avoids the need for 3D annotations and ground-truth camera poses, instead relying on a SLAM backbone to estimate poses and sparse geometry, followed by a geometrically consistent Gaussian update to construct dense 3D Gaussian maps. FreeOcc then associates open-vocabulary semantics from vision-language models with these Gaussian primitives and projects them into dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc significantly improves IoU and mIoU on EmbodiedOcc-ScanNet compared to previous self-supervised methods and outperforms both supervised and self-supervised baselines on the new ReplicaOcc benchmark for indoor environments.

FreeOcc 是一个无需训练的框架，用于从单目或 RGB-D 序列进行开放词汇的占用预测。它通过四层管道构建全局一致的占用图：SLAM 用于姿态和稀疏几何估计，几何一致的高斯更新用于生成密集的 3D 高斯图，使用现成的视觉语言模型关联开放词汇语义与高斯原语，以及概率投影生成密集体素占用。尽管缺乏 3D 注释和学习阶段，FreeOcc 在 EmbodiedOcc-ScanNet 上显著提高了 IoU 和 mIoU，优于之前的自监督方法。它还在新的 ReplicaOcc 室内环境基准测试中超越了监督和自监督基线。

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

Authors: Junqi Gao, Dazhi Zhang, Zhichang Guo, Biqing Qi, Yi Ran, Wangmeng Zuo

First: 2026-04-30T16:58:05+00:00 · Latest: 2026-04-30T16:58:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Model merging has attracted attention as an effective path toward multi-task adaptation by integrating knowledge from multiple task-specific models. Among existing approaches, dynamic merging mitigates performance degradation caused by conflicting parameter updates across tasks by flexibly combining task-specific parameters at inference time, thereby maintaining high performance. However, these methods require storing independent parameters for each task, resulting in prohibitive storage overhead. To address this issue, we first experimentally demonstrate that the fine-tuned weight increments (referred to as task vectors) exhibit an impulse-like activation pattern and high robustness to low-bit representations. Driven by this insight, we propose T-Switch, which decomposes task vectors into three compact components: a binary sparse mask, a sign vector, and a scalar scaling factor, achieving high-fidelity approximation at high compression ratios. We then introduce Auto-Switch, a training-free merging scheme that automatically composes task vectors via feature similarity retrieval. Building on this, we develop Auto-Switch, a training-free merging scheme that automatically assembles task vectors through feature similarity retrieval. Furthermore, to transform task vector sparsification and quantization from static rules to adaptive learning, we propose FlexSwitch, a learnable framework which jointly optimizes the compression strategy for each model unit via Learnable Gating Sparsification (LGS) and Bit-width Adaptive Selection (BAS), while employing the Sparsity-Aware Storage Strategy (SASS) to select the optimal storage encoding structure. Finally, by incorporating a K-Nearest Neighbor (KNN) inference scheme with a learnable low-rank metric, we present Auto-FlexSwitch, a dynamic model merging approach that supports highly efficient task vector compression.

中文标题/摘要

标题：Auto-FlexSwitch：通过可学习的任务向量压缩实现高效动态模型合并

模型合并已成为一种有效途径，通过整合多个任务特定模型的知识来实现多任务适应。现有方法中，动态合并通过在推理时灵活组合任务特定参数来减轻因任务间参数更新冲突导致的性能下降，从而保持高性能。然而，这些方法需要为每个任务存储独立的参数，导致存储开销巨大。为解决这一问题，我们首先实验证明，微调后的权重增量（称为任务向量）表现出类似脉冲的激活模式，并且对低比特表示具有高度鲁棒性。基于这一洞察，我们提出了T-Switch，它将任务向量分解为三个紧凑的组件：二进制稀疏掩码、符号向量和标量缩放因子，从而在高压缩比下实现高保真度近似。我们还引入了无需训练的合并方案Auto-Switch，该方案通过特征相似性检索自动组合任务向量。在此基础上，我们开发了无需训练的合并方案Auto-Switch，该方案通过特征相似性检索自动组装任务向量。为了将任务向量稀疏化和量化从静态规则转变为适应性学习，我们提出了FlexSwitch，这是一种可学习框架，通过Learnable Gating Sparsification (LGS) 和 Bit-width Adaptive Selection (BAS) 共同优化每个模型单元的压缩策略，同时采用Sparsity-Aware Storage Strategy (SASS) 选择最优的存储编码结构。最后，通过结合可学习的低秩度量和K-最近邻（KNN）推理方案，我们提出了支持高效任务向量压缩的Auto-FlexSwitch动态模型合并方法。

Summary / 总结

The paper addresses the challenge of storage overhead in dynamic model merging by proposing Auto-FlexSwitch, which uses learnable task vector compression techniques. It introduces T-Switch, which decomposes task vectors into a binary sparse mask, a sign vector, and a scalar scaling factor, and Auto-Switch, which automatically assembles task vectors through feature similarity retrieval. FlexSwitch further enhances this by optimizing compression strategies via learnable gating sparsification and bit-width adaptive selection, while Auto-FlexSwitch incorporates a KNN inference scheme for efficient task vector compression.

该论文通过提出Auto-FlexSwitch，解决了动态模型合并中的存储开销问题，该方法使用了可学习的任务向量压缩技术。它引入了T-Switch，将任务向量分解为二进制稀疏掩码、符号向量和标量缩放因子，并引入了Auto-Switch，通过特征相似性检索自动组装任务向量。FlexSwitch进一步通过可学习的门控稀疏化和位宽自适应选择优化压缩策略，而Auto-FlexSwitch则结合了KNN推理方案和可学习的低秩度量，以实现高效的任务向量压缩。

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Authors: Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu

Venue: ICLR 2026

First: 2025-10-20T06:17:57+00:00 · Latest: 2026-04-30T15:40:24+00:00

Comments: ICLR 2026 camera-ready version

Abs · PDF · Code1 · Code2

Abstract

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.

Summary / 总结

This work addresses the challenge of processing long contexts in language models by dissecting chunk-based sparse attention models. It identifies three key design principles: an expressive Chunk Encoder with a CLS token, a Bypassing Residual Path, and enforced selection sparsity during pre-training. These principles enable the model to generalize effectively to longer contexts, setting a new state-of-the-art on RULER and BABILong benchmarks without training. The study provides a theoretical basis for intra-chunk information processing and landmark generation, offering clear design guidelines for future models.

该研究通过剖析基于块的稀疏注意力模型来解决语言模型处理长上下文的挑战，确定了三个关键设计原则：具有CLS标记的表达性块编码器、旁路残差路径以及预训练期间的强制选择稀疏性。这些原则使模型能够有效泛化到更长的上下文，打破了RULER和BABILong基准的最新记录，且无需训练。研究提供了块内信息处理和地标生成的理论基础，为未来模型的设计提供了明确的指导方针。

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Authors: Hong-Tao Yu, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

Venue: ICLR 2026

First: 2025-04-21T09:30:41+00:00 · Latest: 2026-04-30T15:12:07+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

Summary / 总结

This study introduces FG-BMK, a comprehensive benchmark for evaluating large vision-language models (LVLMs) on fine-grained image tasks, addressing the gap in specialized task assessments. Through experiments on twelve LVLMs, the research identifies key factors such as training paradigms and modality alignment that significantly influence model performance, providing insights for future model development and data construction.

该研究旨在评估大型视觉-语言模型（LVLMs）在细粒度图像任务上的表现，这是一个之前未被充分探索的领域。研究人员开发了FG-BMK，这是一个包含101万个问题和33万张图像的综合基准，从人类和机器两个角度评估LVLMs。通过对十二个LVLMs的实验，他们发现训练范式、模态对齐、扰动敏感性和细粒度类别推理对任务性能有显著影响，揭示了当前LVLMs的局限性，并为未来数据建设和模型设计提供了指导。

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Authors: Ce Chen, Yi Ren, Yuanming Li, Viktor Goriachko, Zhenhui Ye, Zujin Guo, Zhibin Hong, Mingming Gong

Venue: www

First: 2026-04-30T15:05:06+00:00 · Latest: 2026-04-30T15:05:06+00:00

Comments: This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

中文标题/摘要

标题：TransVLM：一种检测任意镜头过渡的视觉-语言框架和基准

传统的镜头边界检测（SBD）固有地难以处理复杂的过渡，因为它围绕孤立的剪辑点来定义任务，经常导致视频剪辑被破坏。我们通过正式化镜头过渡检测（STD）任务来解决这一根本限制。不同于常规地在剪辑点上寻找模糊的点，STD 明确地检测过渡的连续时间段。为了解决这个问题，我们提出了 TransVLM，一种用于 STD 的视觉-语言模型（VLM）框架。与主要依赖空间语义且难以处理细粒度的跨剪辑动态的常规 VLM 不同，我们的方法在输入阶段显式地注入了光学流作为关键的运动先验。通过一种简单而有效的特征融合策略，TransVLM 直接处理了颜色和运动的拼接表示，显著增强了其时间意识，而不会在语言骨干上增加任何额外的视觉标记开销。为了克服公共数据中严重的类别不平衡，我们设计了一个可扩展的数据引擎来合成多样化的过渡视频以进行稳健训练，并且还提供了一个全面的 STD 基准。广泛的实验表明，TransVLM 在整体性能上表现出色，优于传统的启发式方法、专门的空间-时间网络以及顶级的 VLM。此工作已部署到生产环境中。如需更多相关研究，请访问 HeyGen 研究（https://www.heygen.com/research）和 HeyGen Avatar-V（https://www.heygen.com/research/avatar-v-model）。项目页面：https://chence17.github.io/TransVLM/

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Authors: Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Xiuying Chen

First: 2026-04-30T15:03:56+00:00 · Latest: 2026-04-30T15:03:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbf{FineState-Bench}, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textit{FineState-Metrics}, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play \textit{Visual Diagnostic Assistant} (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs.\ w/o comparisons. On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8\% on Web and 22.8\% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction \href{https://github.com/FengxianJi/FineState-Bench}{Github.}

中文标题/摘要

标题：FineState-Bench：细粒度GUI状态设置中基于状态条件的对接基准测试

尽管大型视觉-语言模型（LVLMs）取得了快速进展，但细粒度的基于状态的GUI交互仍然具有挑战性。当前的评估覆盖范围有限，目标状态定义不够精确，并且过度依赖最终任务的成功，这掩盖了代理失败的具体位置和原因。为了解决这一差距，我们引入了**FineState-Bench**，一个评估代理是否能够正确地将指令对接到目标UI控件并达到精确目标状态的基准测试。FineState-Bench 包含了跨桌面、网络和移动平台的2,209个实例，涵盖了四种交互家族和23种UI组件类型，每个实例都明确指定了细粒度状态设置的精确目标状态。我们还提出了**FineState-Metrics**，一个四阶段诊断流水线，每个阶段的成功率分别为定位成功率（SR@Loc）、交互成功率（SR@Int）、定位时的精确状态成功率（ES-SR@Loc）和交互时的精确状态成功率（ES-SR@Int），以及一个即插即用的**视觉诊断助手**（VDA），它生成描述和边界框定位提示，通过有/无控制比较来诊断视觉对接原因。在FineState-Bench上，精确目标状态的成功率仍然较低：交互时的精确状态成功率（ES-SR@Int）在Web上达到32.8%，平均跨平台为22.8%。使用VDA定位提示，Gemini-2.5-Flash获得了+14.9 ES-SR@Int点的提升，表明通过改进视觉对接仍有很大的改进空间，但总体准确性仍然不足以实现可靠的基于状态条件的细粒度交互。[GitHub](https://github.com/FengxianJi/FineState-Bench)

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Authors: Kenneth J. K. Ong

First: 2026-04-30T14:50:48+00:00 · Latest: 2026-04-30T14:50:48+00:00

Abs · PDF · Code1 · Code2

Abstract

As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' cooperative behavior using the Iterated Prisoner's Dilemma (IPD) as a test scenario. We examine whether exposure to images depicting behavioral concepts (kindness/helpfulness vs. aggressiveness/selfishness) and color-coded reward matrices alters VLM decision patterns. Experiments were conducted across multiple state-of-the-art VLMs. We further explore mitigation strategies including prompt modifications, Chain of Thought (CoT) reasoning, and visual token reduction. Results show that VLM behavior can be influenced by both image content and color cues, with varying susceptibility and mitigation effectiveness across models. These findings not only underscore the importance of robust evaluation frameworks for VLM deployment in visually rich and safety-critical environments, but also highlight how architectural and training differences among models may lead to distinct behavioral responses-an area worthy of further investigation.

中文标题/摘要

标题：视觉先兆对视觉语言模型合作行为的影响

随着视觉语言模型（VLMs）越来越多地集成到决策系统中，理解视觉输入如何影响其行为变得至关重要。本文通过迭代囚徒困境（IPD）作为测试场景，研究视觉先兆对VLMs合作行为的影响。我们探讨了暴露于描绘行为概念的图像（友善/乐于助人 vs. 好斗/自私）和颜色编码的奖励矩阵是否改变了VLM的决策模式。实验在多个最先进的VLMs上进行。我们进一步探讨了包括提示修改、思维链（CoT）推理和视觉标记减少在内的缓解策略。结果表明，VLM的行为可以受到图像内容和颜色提示的影响，不同模型在影响程度和缓解效果上存在差异。这些发现不仅强调了在视觉丰富和安全关键环境中部署VLM时需要稳健的评估框架的重要性，还突显了模型架构和训练差异可能导致不同行为反应这一领域值得进一步研究。

Summary / 总结

This study investigates the effects of visual priming on on cooperative behavior in Vision-Language Models (VLM) behavior, the Iterated Prison on D Dile D game.. behavioral concepts and on-coded reward matrices alter V on V VLM behavior. multiple.. experiments. multiple on on multiple multiple on on prompt modifications and on-chain thought reasoning.. results results LM behavior behavior behavior influence on on visual content and cues with varying susceptibility and mitigation effectiveness. different multiple on on on multiple multiple on on on on multiple on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on

Diffusion-OAMP for Joint Image Compression and Wireless Transmission

Authors: Wentao Hou, Yimin Bai, Zelei Luo, Jiadong Hong, Lei Liu

First: 2026-04-30T14:49:31+00:00 · Latest: 2026-04-30T14:49:31+00:00

Comments: 6 pages, 5 figures, 2 tables, submitted for a possible publication

Abs · PDF · Code1 · Code2

Abstract

Joint image compression and wireless transmission remain relatively underexplored compared to generic image restoration, despite its importance in practical communication systems. We formulate this problem under an equivalent linear model, and propose Diffusion-OAMP, a training-free reconstruction framework that embeds a pre-trained diffusion model into the OAMP algorithm. In Diffusion-OAMP, the OAMP linear estimator produces pseudo-AWGN observations, while the diffusion model serves as a nonlinear estimator under an SNR-matching rule. This framework offers a way to incorporate multiple generative priors into OAMP. Experiments with varying compression ratios and noise levels show that Diffusion-OAMP performs favorably against classic methods in the evaluated settings.

中文标题/摘要

标题：扩散-OAMP 用于联合图像压缩与无线传输

联合图像压缩与无线传输在实际通信系统中具有重要意义，但与通用图像恢复相比，其研究相对较少。我们在此问题上构建了一个等效的线性模型，并提出了一种名为扩散-OAMP的无训练重建框架，该框架将预训练的扩散模型嵌入到OAMP算法中。在扩散-OAMP中，OAMP线性估计器产生伪AWGN观测值，而扩散模型则在信噪比匹配规则下作为非线性估计器。该框架提供了一种将多个生成先验信息整合到OAMP中的方法。实验结果表明，在不同压缩比和噪声水平下，扩散-OAMP在评估的设置中优于经典方法。

Summary / 总结

The paper addresses the underexplored area of joint image compression and wireless transmission, proposing Diffusion-OAMP, a training-free reconstruction framework. This framework integrates a pre-trained diffusion model into the OAMP algorithm, where the OAMP linear estimator generates pseudo-AWGN observations, and the diffusion model acts as a nonlinear estimator under an SNR-matching rule. Experiments demonstrate that Diffusion-OAMP outperforms classic methods across different compression ratios and noise levels.

论文针对联合图像压缩与无线传输这一相对未被充分探索的领域，提出了Diffusion-OAMP，这是一种无需训练的重建框架。该框架将预训练的扩散模型集成到OAMP算法中，其中OAMP线性估计器生成伪AWGN观测值，而扩散模型则在SNR匹配规则下作为非线性估计器。实验结果表明，Diffusion-OAMP在不同压缩比和噪声水平下优于经典方法。

Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

Authors: Mingliang Liang, Zhuoran Liu, Arjen P. de Vries, Martha Larson

First: 2026-04-30T14:33:23+00:00 · Latest: 2026-04-30T14:33:23+00:00

Abs · PDF · Code1 · Code2

Abstract

The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emph{long-tail concepts} remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emph{dynamic cluster-based sampling approach (DynamiCS)} that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.

中文标题/摘要

标题：动态聚类数据采样以实现高效且长尾意识的视觉-语言预训练

通过采样训练数据可以降低视觉-语言模型（VLM）的训练计算成本。先前关于高效VLM预训练的工作强调了语义数据平衡的重要性，通过调整数据中的主题分布来提高VLM的准确性。然而，现有的高效预训练方法可能会不成比例地从训练语料库中移除稀有概念，导致训练数据中长尾概念的代表性不足，且在训练过程中未能有效捕捉。在本文中，我们提出了一种动态聚类采样方法（DynamiCS），该方法在每个周期中对大数据簇进行下采样，对小数据簇进行上采样。该方法是动态的，因为它在每个周期中应用采样。我们首先展示了动态采样对于VLM训练的重要性，然后展示了我们聚类缩放方法的优势，该方法在数据中保持语义簇的相对顺序，并强调长尾。这种方法与当前工作不同，后者仅关注数据语义分布的扁平化。我们的实验表明，DynamiCS可以降低VLM训练的计算成本，并为长尾概念提供性能优势。

Summary / 总结

This paper proposes a dynamic cluster-based sampling approach (DynamiCS) to reduce the computational cost of vision-language model (VLM) training while ensuring long-tail concepts are adequately represented. The method downsamples large clusters and upsamples small ones, dynamically at each epoch. Experiments show that DynamiCS improves the training efficiency and enhances the performance of long-tail concepts compared to existing methods that only flatten the semantic distribution.

本文提出了一种动态聚类采样方法（DynamiCS），以降低视觉-语言模型（VLM）训练的计算成本，同时确保长尾概念得到充分表示。该方法在每个epoch动态地对大聚类进行下采样并对小聚类进行上采样。实验表明，DynamiCS提高了训练效率，并且在长尾概念的性能上优于现有方法，这些方法仅关注于平滑语义分布。

Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

Authors: Shipeng Liu, Liang Zhao, Dengfeng Chen, Zhanping Song

First: 2026-04-30T14:31:00+00:00 · Latest: 2026-04-30T14:31:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Tunnel inspection requires outputs that can support defect localization, measurement, severity grading, and engineering documentation. Existing training-free foundation-model pipelines usually stop at coarse open-vocabulary proposals, which are difficult to use directly in interference-heavy tunnel scenes. We propose a training-free framework TunnelMIND. Specifically, language-guided defect proposals are not treated as final outputs; instead, their spatial support is recalibrated at inference time through dense visual consistency, so that coarse semantic anchors can be transformed into more reliable prompts under tunnel-specific hard negatives. The resulting masks are further reconstructed into structured defect entities with category, location, geometry, severity, and context attributes, which are then mapped to retrieval-grounded explanation and engineering-readable report generation under expert knowledge constraints. On visible, GPR, and road defect tasks, TunnelMIND achieves F1 scores of 0.68, 0.78, and 0.72, respectively. Overall, TunnelMIND shows that training-free tunnel inspection can move beyond coarse localization toward structured defect evidence for engineering assessment.

中文标题/摘要

标题：无需训练的隧道缺陷检测与工程解释通过视觉重新校准和实体重建

隧道检查需要支持缺陷定位、测量、严重程度分级和工程记录的输出。现有的无需训练的基础模型管道通常只停留在粗略的开放式词汇提案阶段，在干扰重重的隧道场景中难以直接使用。我们提出了一种无需训练的框架TunnelMIND。具体来说，语言引导的缺陷提案并不是最终输出；相反，它们的空间支持在推理时通过密集的视觉一致性进行重新校准，这样粗略的语义锚点可以在隧道特定的硬负样本下转化为更可靠的提示。生成的掩码进一步重建为具有类别、位置、几何形状、严重程度和上下文属性的结构化缺陷实体，然后在专家知识约束下映射到检索驱动的解释和工程可读的报告生成。在可见、GPR和道路缺陷任务上，TunnelMIND分别实现了0.68、0.78和0.72的F1分数。总体而言，TunnelMIND表明无需训练的隧道检查可以超越粗略的定位，转向为工程评估提供结构化的缺陷证据。

Summary / 总结

The research aims to improve training-free defect inspection in tunnel environments by proposing TunnelMIND, which recalibrates coarse language-guided defect proposals through dense visual consistency and reconstructs them into structured defect entities with detailed attributes. The framework achieves F1 scores of 0.68, 0.78, and 0.72 for visible, GPR, and road defect tasks, respectively, demonstrating its ability to generate structured defect evidence for engineering assessment.

研究旨在通过提出TunnelMIND框架，改进隧道环境中的无训练缺陷检测，该框架通过密集的视觉一致性重新校准粗略的语言引导缺陷提案，并将其重建为具有详细属性的结构化缺陷实体。该框架在可见、GPR和道路缺陷任务上的F1分数分别为0.68、0.78和0.72，展示了其生成用于工程评估的结构化缺陷证据的能力。

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Authors: Zhenguo Zhang, Haohan Zheng, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang

First: 2025-12-16T03:19:28+00:00 · Latest: 2026-04-30T14:06:23+00:00

Abs · PDF · Code1 · Code2

Abstract

The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

Summary / 总结

OmniDrive-R1 is an end-to-end Vision-Language Model framework for autonomous driving that integrates perception and reasoning through an interleaved Multi-modal Chain-of-Thought mechanism. It introduces a reinforcement-driven visual grounding capability, allowing the model to focus on critical regions for fine-grained analysis. Experimental results on DriveLMM-o1 show that OmniDrive-R1 significantly improves reasoning scores and final answer accuracy compared to the baseline Qwen2.5VL-7B, with scores increasing from 51.77% to 80.35% and from 37.81% to 73.62%, respectively.

OmniDrive-R1 是一个端到端的视觉-语言模型框架，用于自动驾驶，通过交错的多模态链式思考机制将感知和推理结合起来。它引入了一种基于强化学习的视觉定位能力，使模型能够聚焦于关键区域进行精细分析。实验结果表明，与基准模型 Qwen2.5VL-7B 相比，OmniDrive-R1 在推理得分和最终答案准确性上有了显著提高，得分分别从 51.77% 提高到 80.35% 和从 37.81% 提高到 73.62%。

NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains

Authors: Shiyao Peng, Qianhe Zheng, Zhuodi Hao, Zichen Tang, Rongjin Li, Qing Huang, Jiayu Huang, Jiacheng Liu, Yifan Zhu, Haihong E

Venue: WWW 2026

First: 2026-04-30T13:37:01+00:00 · Latest: 2026-04-30T13:37:01+00:00

Comments: Accepted to WWW 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Although precise recall is a core objective in Retrieval-Augmented Generation (RAG), a critical oversight persists in the field: improvements in retrieval performance do not consistently translate to commensurate gains in downstream reasoning. To diagnose this gap, we propose the Recall Conversion Rate (RCR), a novel evaluation metric to quantify the contribution of retrieval to reasoning accuracy. Our quantitative analysis of mainstream RAG methods reveals that as Recall@5 improves, the RCR exhibits a near-linear decay. We identify the neglect of retrieval quality in these methods as the underlying cause. In contrast, approaches that focus solely on quality optimization often suffer from inferior recall performance. Both categories lack a comprehensive understanding of retrieval quality optimization, resulting in a trade-off dilemma. To address these challenges, we propose comprehensive retrieval quality optimization criteria and introduce the NeocorRAG framework. This framework achieves holistic retrieval quality optimization by systematically mining and utilizing Evidence Chains. Specifically, NeocorRAG first employs an innovative activated search algorithm to obtain a refined candidate space. Then it ensures precise evidence chain generation through constrained decoding. Finally, the retrieved set of evidence chains guides the retrieval optimization process. Evaluated on benchmarks including HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, NeocorRAG achieves SOTA performance on both 3B and 70B parameter models, while consuming less than 20% of tokens used by comparable methods. This study presents an efficient, training-free paradigm for RAG enhancement that effectively optimizes retrieval quality while maintaining high recall. Our code is released at https://github.com/BUPT-Reasoning-Lab/NeocorRAG.

Summary / 总结

The paper addresses the gap between improved retrieval performance and downstream reasoning accuracy in Retrieval-Augmented Generation (RAG) by proposing the Recall Conversion Rate (RCR) as a new evaluation metric. It identifies the neglect of retrieval quality as the underlying issue and introduces NeocorRAG, a framework that optimizes retrieval quality through systematic mining and utilization of Evidence Chains. NeocorRAG achieves state-of-the-art performance on benchmarks while using fewer tokens compared to other methods.

论文通过提出一个新的评估指标——召回转换率（RCR），解决了检索增强生成（RAG）中检索性能提升与下游推理准确性之间的差距问题。它指出忽视检索质量是根本原因，并引入了NeocorRAG框架，该框架通过系统地挖掘和利用证据链来优化检索质量。NeocorRAG在基准测试中实现了最先进的性能，同时使用了比其他方法更少的令牌。

Hyper-Dimensional Fingerprints as Molecular Representations

Authors: Jonas Teufel, Luca Torresi, André Eberhard, Pascal Friederich

First: 2026-04-30T12:53:58+00:00 · Latest: 2026-04-30T12:53:58+00:00

Comments: Code: https://doi.org/10.5281/zenodo.19373621

Abs · PDF · Code1 · Code2

Abstract

Computational molecular representations underpin virtual screening, property prediction, and materials discovery. Conventional fingerprints are efficient and deterministic but lose structural information through hash-based compression, particularly at low dimensionalities. Learned representations from graph neural networks recover this expressiveness but require task-specific training and substantial computational resources. Here we introduce hyperdimensional fingerprints (HDF), which replace the learned transformations of message-passing neural networks with algebraic operations on high-dimensional vectors, producing deterministic molecular representations without any training. Across diverse property prediction benchmarks, HDF outperforms conventional fingerprints in the majority of tasks while exhibiting greater consistency across datasets and models. Crucially, HDF embeddings preserve molecular similarity faithfully: at 32 dimensions, distances in HDF space achieve a 0.9 Pearson correlation with graph edit distance, compared to 0.55 for Morgan fingerprints at equivalent size. This structural fidelity persists at low dimensions where hash-based methods degrade, allowing simple nearest-neighbor regression to remain predictive with as few as 64 components. We further demonstrate the practical impact in Bayesian molecular optimization, where HDF-based surrogate models achieve substantially improved sample efficiency in regimes where Morgan fingerprints perform comparably to random search. HDF thus provides a general-purpose, training-free alternative to conventional molecular fingerprints, suggesting that the information loss long accepted as inherent to fixed-length fingerprints is a limitation of the hash-based encoding scheme rather than the fingerprint paradigm itself.

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Authors: Wenrui Zhou, Mohamed Hendy, Shu Yang, Qingsong Yang, Zikun Guo, Yuyu Luo, Lijie Hu, Di Wang

Venue: ACL 2026

First: 2025-06-08T15:00:21+00:00 · Latest: 2026-04-30T12:27:19+00:00

Comments: 27 Pages, Accepted by ACL 2026 Main Conference

Abs · PDF · Code1 · Code2

Abstract

As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the videolanguage domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE(Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISEpioneeringly brings linguistic perspectives on sycophancy into the video domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. Furthermore, we propose two potential training-free mitigation strategies revealing potential paths for reducing sycophantic bias: (i) enhancing visual grounding through interpretable key-frame selection and (ii) steering model behavior away from sycophancy via targeted, inference-time intervention on its internal neural representations. Our code is available at https://anonymous.4open.science/r/VideoSycophancy-567F.

中文标题/摘要

标题：运动中的奉承：视频大语言模型的基准测试与分析

随着视频大语言模型（Video-LLMs）在要求多模态推理的真实世界应用中越来越普及，确保其事实一致性和可靠性变得至关重要。然而，这些模型倾向于与用户输入一致，即使这与视觉证据相矛盾，这会削弱它们在这些情境中的可信度。当前关于奉承的研究大多忽略了视频语言领域中的具体表现形式，导致缺乏系统性的基准测试和针对性评估，以理解Video-LLMs在误导性用户输入下的反应。为填补这一空白，我们提出了VISE（Video-LLM奉承基准测试与评估），这是第一个旨在评估最先进的Video-LLMs在多种问题格式、提示偏差和视觉推理任务中的奉承行为的基准测试。具体而言，VISE首次将语言学视角的奉承引入视频领域，使我们能够对多种奉承类型和交互模式进行精细分析。此外，我们提出了两种潜在的无训练集缓解策略，揭示了减少奉承偏见的可能路径：（i）通过可解释的关键帧选择增强视觉接地，（ii）通过目标干预其内部神经表示来引导模型行为远离奉承。我们的代码可在https://anonymous.4open.science/r/VideoSycophancy-567F获取。

Summary / 总结

The research aims to address the issue of sycophancy in Video-LLMs, where these models align with user input despite contradicting visual evidence. The study introduces VISE, a benchmark to evaluate sycophantic behavior in Video-LLMs across various question formats and visual reasoning tasks. Key findings include the identification of different sycophancy types and the proposal of two mitigation strategies: enhancing visual grounding and steering model behavior through inference-time interventions on neural representations.

研究旨在解决Video-LLMs中的奉承行为问题，这可能会影响其事实一致性与可靠性。为此，作者提出了VISE基准，用于评估Video-LLMs在不同问题格式和视觉推理任务中的奉承行为。主要发现包括识别不同类型的奉承行为，并提出了两种缓解策略：增强视觉接地和通过推理时对神经表示的干预来引导模型行为。

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Authors: Xupeng Chen, Binbin Shi, Chenqian Le, Jiaqi Zhang, Kewen Wang, Ran Gong, Jinhan Zhang, Chihang Wang

First: 2026-04-30T11:16:07+00:00 · Latest: 2026-04-30T11:16:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.

Summary / 总结

The paper proposes MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over images from PMC document pages instead of text chunks. It uses ColQwen2.5 patch-level page embeddings and a sharded MapReduce LLM filter to scale to 350K pages with efficient retrieval. MED-VRAG achieves 78.6% average accuracy across four medical QA benchmarks, with significant gains from retrieval and iterative reasoning. Ablation studies show that page-image retrieval, iteration, and the memory bank contribute to the model's performance.

研究旨在通过纳入医学文档中的视觉内容来提高医学问答的准确性。MED-VRAG 是一个迭代的多模态检索增强生成框架，它从 PMC 文档页面中检索和推理图像。该系统使用 ColQwen2.5 像素级页面嵌入和分片 MapReduce LLM 过滤器，平均准确率达到 78.6%。该系统在四个医学 QA 基准测试中显著优于仅文本的方法，并比 MedRAG + GPT-4 高出 1.8 个百分点，尽管比较不是直接的。

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Authors: Xupeng Chen, Binbin Shi, Chenqian Le, Qifu Yin, Lang Lin, Haowei Ni, Ran Gong, Panfeng Li

First: 2026-04-30T11:11:47+00:00 · Latest: 2026-04-30T11:11:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Deploying vision-language models (VLMs) in clinical settings demands auditable behavior under realistic failure conditions, yet the failure landscape of frontier VLMs on specialized medical inputs is poorly characterized. We audit five recent frontier and grounding-aware VLMs (Gemini~2.5~Pro, GPT-5, o3, GLM-4.5V, Qwen~2.5~VL) on Medical VQA along two trust-relevant axes. Perception: all models localize anatomical and pathological targets poorly -- the best model reaches only 0.23 mean IoU and 19.1% Acc@0.5 -- and exhibit clinically dangerous laterality confusion. Pipeline integration: a self-grounding pipeline, where the same model localizes then answers, degrades VQA accuracy for every model -- driven by both inaccurate localization and format-compliance failures under the two-step prompt (parse failure rises to 70%--99% for Gemini and GPT-5 on VQA-RAD). Replacing predicted boxes with ground-truth annotations recovers and improves VQA accuracy, consistent with the failure residing in the perception module rather than in the decomposition itself. These observational findings identify grounding quality as a primary trustworthiness bottleneck in our SLAKE bounding-box setting. As a complementary fine-tuning follow-up, supervised fine-tuning of Qwen~2.5~VL on combined Med-VQA training data attains the highest reported SLAKE open-ended recall (85.5%) among comparable methods, suggesting that the VQA-level gap is tractable with domain adaptation; whether this also closes the perception/trustworthiness bottleneck is left to future work.

Summary / 总结

This study evaluates the performance of five advanced vision-language models (Gemini~2.5~Pro, GPT-5, o3, GLM-4.5V, Qwen~2.5~VL) on medical visual question answering (VQA) to identify their weaknesses in perception and pipeline integration. The models struggle with anatomical target localization, achieving only 0.23 mean IoU and 19.1% accuracy at IoU 0.5, and exhibit dangerous laterality confusion. Integrating the same model for both localization and answering significantly degrades VQA accuracy, primarily due to inaccurate localization and format compliance issues. Replacing predicted boxes with ground-truth annotations improves VQA accuracy, indicating that the perception module is the main bottleneck. The study also finds that supervised fine-tuning of Qwen~2.5~VL can achieve high VQA accuracy, suggesting potential for domain adaptation, but leaves the trustworthiness bottleneck unresolved.

研究评估了五种先进的视觉-语言模型（Gemini~2.5~Pro、GPT-5、o3、GLM-4.5V、Qwen~2.5~VL）在医学视觉问答（VQA）中的表现，以识别它们在感知和管道集成方面的弱点。这些模型在解剖目标定位方面表现不佳，仅达到0.23的平均IoU和19.1%的IoU 0.5准确率，并且表现出危险的左右侧混淆。将同一模型用于定位和回答显著降低了VQA准确性，主要由于定位不准确和格式合规问题。用真实标注替换预测框可以提高VQA准确性，表明感知模块是主要瓶颈。研究还发现，监督微调Qwen~2.5~VL可以实现高VQA准确性，表明领域适应具有潜力，但信任瓶颈问题仍需未来研究解决。

Training-Free Reward-Guided Image Editing via Trajectory Optimal Control

Authors: Jinho Chang, Jaemin Kim, Jong Chul Ye

Venue: ICLR 2026 Poster

First: 2025-09-30T06:34:37+00:00 · Latest: 2026-04-30T11:11:41+00:00

Comments: Poster in ICLR 2026; 22 pages, 9 figures. The code is available at https://github.com/jinhojsk515/ITOC

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.

中文标题/摘要

标题：基于轨迹最优控制的无需训练奖励引导图像编辑

近期在扩散和流匹配模型方面的进展展示了其在高保真图像合成方面的卓越能力。研究的一个重要方向是奖励引导的指导，该方法在推理过程中引导生成过程以满足特定目标。然而，将这种奖励引导的方法应用于需要保留源图像语义内容同时增强目标奖励的图像编辑任务，尚未得到充分探索。在本文中，我们提出了一种新的无需训练的奖励引导图像编辑框架。我们将编辑过程形式化为一个轨迹最优控制问题，其中扩散模型的逆过程被视为从源图像出发的可控轨迹，通过迭代更新伴随状态来引导编辑过程。通过在不同编辑任务上的广泛实验，我们证明了我们的方法在奖励最大化和对源图像保真度之间取得了优于现有基于反转的无需训练指导基线的显著平衡，同时没有出现奖励作弊。

Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

Authors: Hyeonseo Jang, Jaebyeong Jeon, Joong-Won Hwang, Kibok Lee

Venue: CVPR 2026

First: 2026-04-30T11:01:23+00:00 · Latest: 2026-04-30T11:01:23+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines--without modifying any other components--is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and incurs no additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code is available at: https://github.com/YonseiML/fpp.

Summary / 总结

This paper addresses the issue of poorly calibrated models in test-time prompt tuning (TPT) for vision-language models, which is motivated by the need for reliable predictions. The authors propose Flatness-aware Prompt Pretraining (FPP), a simple pretraining framework that initializes prompts in flatter regions of the loss landscape, improving both calibration and performance without additional computational costs. Experiments show that FPP enhances the reliability of predictions without degrading model performance. The method is practical for real-world deployment as it does not require labeled data or additional computational resources during test-time tuning.

本文旨在解决测试时提示调优（TPT）在视觉-语言模型中的校准问题，动机在于提高预测的可靠性。作者提出了一种名为Flatness-aware Prompt Pretraining（FPP）的简单预训练框架，该框架在损失景观的较平坦区域初始化提示，从而同时提高校准和性能，且无需额外的计算成本。实验表明，FPP能够增强预测的可靠性而不损害模型性能。该方法在实际部署中非常实用，因为它不需要标记数据或在测试时调优阶段增加额外的计算资源。

WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

Authors: Ke Xu

First: 2026-04-30T09:19:26+00:00 · Latest: 2026-04-30T09:19:26+00:00

Comments: 16 pages, 3 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.

中文标题/摘要

标题：WaferSAGE：基于合成数据生成和准则导向强化学习的大语言模型驱动晶圆缺陷分析

我们提出了WaferSAGE，一种使用小型视觉语言模型进行晶圆缺陷视觉问答的框架。为了解决半导体制造中的数据稀缺问题，我们提出了一种包含结构化准则生成的三阶段合成流水线，以实现精确评估。从有限的标注晶圆图开始，我们采用基于聚类的清理来过滤标签噪声，然后使用视觉语言模型生成全面的缺陷描述，这些描述被转换为结构化的评估准则。这些准则指导VQA对的合成，确保覆盖缺陷类型识别、空间分布、形态和根本原因分析。我们的双重评估框架通过贝叶斯优化将基于规则的度量与LLM-裁判评分对齐，从而实现可靠的自动化评估。通过基于课程的强化学习与组序列策略优化（GSPO）和准则对齐奖励，我们的4B参数Qwen3-VL模型获得了6.493的LLM-裁判评分，接近Gemini-3-Flash（7.149），同时实现了完全本地部署。我们证明了具有领域特定训练的小模型可以在专门的工业视觉理解中超越专有的大型模型，为半导体制造中提供隐私保护、成本效益的部署提供了一条可行路径。

Summary / 总结

WaferSAGE is a framework for wafer defect visual question answering using small vision-language models. It addresses data scarcity in semiconductor manufacturing through a three-stage synthesis pipeline involving structured rubric generation. Starting with limited labeled wafer maps, the framework employs clustering-based cleaning, generates comprehensive defect descriptions, and converts them into structured evaluation rubrics. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. The dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Using curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, the 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, approaching Gemini-3-Flash (7.149) while allowing complete on-premise deployment.

WaferSAGE 是一种使用小型视觉语言模型进行晶圆缺陷视觉问答的框架。它通过包含结构化评价准则生成的三阶段合成管道来解决半导体制造中的数据稀缺问题。从有限的标记晶圆图开始，框架采用聚类基清洗，生成全面的缺陷描述，并将其转换为结构化的评价准则。这些评价准则指导 VQA 对应物的合成，确保覆盖缺陷类型识别、空间分布、形态和根本原因分析。双重评估框架通过贝叶斯优化将基于规则的指标与LLM-裁判评分对齐，实现可靠的自动化评估。使用基于课程的强化学习和组序列策略优化（GSPO）以及评价准则对齐的奖励，4B参数的Qwen3-VL模型实现了6.493的LLM-裁判评分，接近Gemini-3-Flash（7.149），同时允许完全本地部署。

Test-Time Distillation for Continual Model Adaptation

Authors: Xiao Chen, Jiazhen Huang, Zhiming Liu, Qinting Jiang, Fanding Huang, Jingyan Jiang, Zhi Wang

Venue: CVPR 2026

First: 2025-06-03T09:16:51+00:00 · Latest: 2026-04-30T09:01:20+00:00

Comments: Accepted by CVPR 2026 Findings

Abs · PDF · Code1 · Code2 · Code3

Abstract

Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner. However, existing methods that rely on self-supervision are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: (1) the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts; and (2) the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls highlight the need to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then it applies an Optimal Transport-based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% with only 48% of its time cost on ImageNet-C. Project page is publicly available at https://github.com/walawalagoose/TTD.

中文标题/摘要

标题：部署时的蒸馏技术用于持续模型适应

深度神经网络在部署时由于分布偏移往往会遭受性能下降。持续测试时适应（CTTA）旨在以无监督的方式解决这一问题。然而，现有的依赖自我监督的方法容易陷入固有的自我反馈循环，这会放大初始预测错误，导致模型漂移。我们重新审视了这一局限性，并提出了测试时蒸馏（TTD），将其重新定义为由冻结的视觉-语言模型（VLM）作为外部信号引导的蒸馏过程。虽然前景广阔，但我们发现直接蒸馏存在两个陷阱：（1）专家陷阱，VLM 的广泛但非专门化的知识导致特定任务和转移上的次优性能；（2）熵偏差，基于熵的简单模型融合技术由于异构模型的不一致校准而失效。这些陷阱突显了建立稳健的监督信号并利用其引导目标模型实现稳定适应的必要性。因此，我们提出了CoDiRe，一种持续蒸馏和校正框架用于TTD。CoDiRe 首先通过动态融合VLM 和目标模型的预测来构建一个稳健的混合教师。关键的是，它通过利用最大softmax概率（MSP）作为更可靠的置信度度量来规避熵偏差，为每个模型的专业知识分配权重。然后，它应用基于最优传输的校正，进一步使预测与混合教师对齐，从而实现持续和稳定的适应。广泛的实验表明，CoDiRe 在ImageNet-C 上仅花费CoTTA 48%的时间成本时，性能超过了最先进的基线，超过了CoTTA 10.55%。项目页面在 https://github.com/walawalagoose/TTD 公开可用。

Summary / 总结

This paper addresses the performance degradation of deep neural networks due to distribution shifts by proposing Test-Time Distillation (TTD) as a continual test-time adaptation method. TTD reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM). However, direct distillation faces two challenges: the Generalist Trap and the Entropy Bias. To overcome these, the authors introduce CoDiRe, a framework that constructs a robust blended teacher and uses Maximum Softmax Probability (MSP) for weighting model expertise, followed by Optimal Transport-based rectification. Experiments show that CoDiRe outperforms existing methods, particularly on ImageNet-C, with better performance and lower computational cost.

该论文提出了一种持续测试时蒸馏（TTD）方法，以解决由于分布变化导致的深度神经网络性能下降问题。TTD将适应过程重新定义为由冻结的视觉-语言模型（VLM）引导的蒸馏过程。然而，直接蒸馏面临两大挑战：泛化陷阱和熵偏差。为克服这些挑战，作者引入了CoDiRe框架，该框架构建了一个稳健的混合教师，并使用最大软最大化概率（MSP）来加权每个模型的专业知识，随后应用最优传输基的校正以进一步对齐预测与混合教师。实验表明，CoDiRe在ImageNet-C上的性能优于现有方法，且具有更好的性能和更低的计算成本。

CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation

Authors: Sonali Sharma, Jin Long, George Shih, Sarah Eid, Christian Bluethgen, Francine L. Jacobson, Emily B. Tsai, Global Radiology Consortium, Ahmed M. Alaa, Curtis P. Langlotz

First: 2026-04-29T04:33:43+00:00 · Latest: 2026-04-30T08:02:58+00:00

Comments: 51 pages, 7 figures, 10 tables

Abs · PDF · Code1 · Code2

Abstract

Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision-language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state-of-the-art vision-language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference-time hint recovers missed findings and significantly reduces hallucinations. Third, vision-language models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought's multi-reader annotations, we predict both human-human and human-AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision-language models.

EdgeFM: Efficient Edge Inference for Vision-Language Models

Authors: Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An

First: 2026-04-30T06:18:50+00:00 · Latest: 2026-04-30T06:18:50+00:00

Comments: Technique Report version

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.

Summary / 总结

EdgeFM is an efficient edge inference framework for vision-language models (VLMs) designed to address the limitations of deterministic low latency and cross-platform adaptability. It reduces single-request latency by removing non-essential features and encapsulates agent-tuned kernel optimizations as a modular library. EdgeFM outperforms conventional vendor-specific toolchains, achieving up to 1.49 times speedup on the NVIDIA Orin platform and enhancing cross-platform portability for edge industrial scenarios.

EdgeFM 是一种高效的边缘推理框架，用于视觉-语言模型（VLMs），旨在解决确定性低延迟和跨平台适应性的问题。它通过去除不必要的功能来减少单次请求的延迟，并将代理调优的内核优化封装为可重用的技能库。EdgeFM 在 NVIDIA Orin 平台上比传统供应商特定的工具链快 1.49 倍，并增强了边缘工业场景中的跨平台可移植性。

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

Authors: Meghana Kshirsagar, Allen Nie, Ching-An Cheng, Fanglei Xue, Rahul Dodhia, Juan Lavista Ferres, Kevin K. Yang, Frank DiMaio

First: 2026-04-19T00:20:18+00:00 · Latest: 2026-04-30T05:12:50+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce RosettaSearch, an inference-time multi-objective optimization approach for backbone conditioned protein sequence design. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model, under a strict computational budget. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN's single-pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18% to 68%, translating to a 2.5x improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability. We further demonstrate that RosettaSearch improves the sequence fidelity of ProteinMPNN designs for de novo backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. To our knowledge, this is the first large-scale demonstration that LLMs can serve as effective generative optimizers for backbone-conditioned protein sequence design, yielding systematic gains without any model retraining.

中文标题/摘要

标题：RosettaSearch：基于骨架条件的蛋白质序列设计多目标推理时优化方法

我们介绍了RosettaSearch，这是一种基于骨架条件的蛋白质序列设计的推理时多目标优化方法。我们使用大型语言模型（LLMs）作为生成优化器，嵌入在一种能够进行受控探索和利用的搜索算法中，使用RosettaFold3（一种结构预测模型）计算的奖励，在严格的计算预算下进行。在大规模评估中，我们将RosettaSearch应用于由LigandMPNN（一种最先进的用于蛋白质序列设计的模型）生成的400个次优序列，恢复了LigandMPNN单次解码无法生成的高保真设计。RosettaSearch的设计在结构保真度指标上显示出18%到68%的改进，相当于设计成功率提高了2.5倍。我们观察到，当使用独立的结构预测先验（Chai-1）评估RosettaSearch设计的序列时，这些成功率的提升是稳健的，并且在两种不同的LLM家族（o4-mini和Gemini-3）中表现出良好的泛化能力，性能与推理能力一致地扩展。我们进一步证明，RosettaSearch可以提高ProteinMPNN设计的蛋白质序列的保真度，这些设计来自Dayhoff图谱的计算生成骨架，表明该方法可以超越天然蛋白质结构，应用于计算生成的骨架。我们还展示了RosettaSearch的多模态扩展，使用视觉语言模型，其中预测的蛋白质结构的图像用作反馈，以结合结构上下文来指导蛋白质序列的生成。据我们所知，这是首次大规模展示大型语言模型可以作为有效的生成优化器用于基于骨架条件的蛋白质序列设计，无需任何模型重训练，从而实现系统性改进。

Summary / 总结

RosettaSearch is an inference-time multi-objective optimization approach for protein sequence design, using large language models (LLMs) to generate high-fidelity designs from suboptimal sequences. It improves structural fidelity metrics by 18% to 68%, leading to a 2.5x increase in design success rate. The approach generalizes across different LLM families and is effective for both native and computationally generated protein backbones.

RosettaSearch 是一种多目标优化方法，用于蛋白质序列设计，在推理时使用大型语言模型（LLMs）从亚最优序列生成高质量设计。它在结构忠实度指标上提高了18%到68%，导致设计成功率提高了2.5倍。该方法在不同的LLM家族中具有通用性，并且对天然蛋白质结构和计算生成的蛋白质骨架都有效。

Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

Authors: David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Mert D. Pese

Venue: SAE Technical Paper 2026-01-0170, SAE WCX 2026

First: 2026-04-30T04:33:38+00:00 · Latest: 2026-04-30T04:33:38+00:00

Comments: 9 pages, 2 figures. Accepted at SAE WCX 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are increasingly used in autonomous driving because they combine visual perception with language-based reasoning, supporting more interpretable decision-making, yet their robustness to physical adversarial attacks, especially whether such attacks transfer across different VLM architectures, is not well understood and poses a practical risk when attackers do not know which model a vehicle uses. We address this gap with a systematic cross-architecture study of adversarial transferability in VLM-based driving, evaluating three representative architectures (Dolphins, OmniDrive, and LeapVAD) using physically realizable patches placed on roadside infrastructure in both crosswalk and highway scenarios. Our transfer-matrix evaluation shows high cross-architecture effectiveness, with transfer rates of 73-91% (mean TR = 0.815 for crosswalk and 0.833 for highway) and sustained frame-level manipulation over 64.7-79.4% of the critical decision window even when patches are not optimized for the target model.

中文标题/摘要

标题：理解视觉-语言模型在自动驾驶中的对抗转移性：跨架构分析

视觉-语言模型（VLMs）在自动驾驶中越来越受欢迎，因为它们结合了视觉感知和基于语言的推理，支持更具解释性的决策，但它们对物理对抗攻击的鲁棒性，尤其是这些攻击是否在不同的VLM架构之间转移，尚未得到充分理解，当攻击者不知道车辆使用的是哪个模型时，这会带来实际风险。我们通过在人行横道和高速公路场景中使用物理可实现的补丁放置在路边基础设施上，对基于VLM的驾驶中的对抗转移性进行了系统性的跨架构研究，评估了三种代表性架构（Dolphins、OmniDrive和LeapVAD）。我们的转移矩阵评估显示了高跨架构有效性，转移率为73-91%（人行横道的平均转移率TR = 0.815，高速公路为0.833），即使补丁未针对目标模型进行优化，也能够在关键决策窗口的64.7-79.4%的帧级上维持操纵。

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

Authors: Yihong Guo, Youwei Lyu, Jiajun Tang, Yizhuo Zhou, Hongliang Wang, Jinwei Chen, Changqing Zou, Qingnan Fan

First: 2026-04-30T03:39:32+00:00 · Latest: 2026-04-30T03:39:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.

Summary / 总结

VeraRetouch is a lightweight and fully differentiable framework for multi-task photo retouching, addressing the limitations of existing non-differentiable approaches by using a 0.5B Vision-Language Model to generate retouching plans and a fully differentiable Retouch Renderer to replace external tools. It also introduces AetherRetouch-1M+, a million-scale dataset, and a reinforcement learning post-training strategy, DAPO-AE, to enhance aesthetic cognition. Experimental results show that VeraRetouch outperforms existing methods on multiple benchmarks while being more compact for mobile deployment.

VeraRetouch 是一个轻量级且完全可微分的多任务照片修复框架，通过使用一个0.5B的视觉-语言模型生成修复计划，并用完全可微分的修复渲染器替换外部工具来解决现有非可微分方法的限制。它还引入了百万规模的数据集 AetherRetouch-1M+ 和增强美学认知的强化学习后训练策略 DAPO-AE。实验结果表明，VeraRetouch 在多个基准测试中表现出色，同时具有更小的体积，适用于移动部署。

CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling

Authors: Yingrui Wu, Youkang Kong, Mingyang Zhao, Weize Quan, Dong-Ming Yan, Yang Liu

First: 2026-04-30T03:18:26+00:00 · Latest: 2026-04-30T03:18:26+00:00

Comments: SIGGARPH 2026 (Journal Track), Code: https://github.com/YingruiWoo/CasLayout

Abs · PDF · Code1 · Code2 · Code3

Abstract

Synthesizing realistic 3D indoor scenes remains challenging due to data scarcity and the difficulty of simultaneously enforcing global architectural constraints and local semantic consistency. Existing approaches often overlook structural boundaries or rely on fully connected relation graphs that introduce redundant generation errors. Inspired by human design cognition, we present CasLayout, a cascaded diffusion framework that decomposes the joint scene generation task into four conditional sub-stages with explicit physical and semantic roles: (1) predicting furniture quantity and categories, (2) refining object sizes and feature embeddings, (3) modeling spatial relationships in a latent space, and (4) generating Oriented Bounding Boxes (OBBs). This decoupled architecture reduces data requirements and enables flexible integration of Large Language Models (LLMs) and Vision Language Models (VLMs) for zero-shot tasks such as image-to-scene generation. To maintain physical validity within complex floor plans, we explicitly model building elements (e.g., walls, doors, and windows) as conditional constraints. Furthermore, to address the high entropy of dense relation graphs, we introduce a sparse relation graph formulation aligned with human spatial descriptions. By encoding these sparse graphs into a compact latent space using a bidirectional Variational Autoencoder (VAE), the proposed framework provides enhanced relational controllability, allowing generated layouts to better respect functional organization. Experiments demonstrate that CasLayout achieves state-of-the-art performance in fidelity and diversity while enabling improved controllability in practical applications.

中文标题/摘要

标题：CasLayout：级联3D布局扩散模型在室内场景合成中的隐式关系建模

由于数据稀缺性和同时满足全局建筑约束和局部语义一致性难度大，合成逼真的3D室内场景仍然具有挑战性。现有方法往往忽视结构边界或依赖全连接关系图，引入冗余生成错误。受人类设计认知启发，我们提出CasLayout，一种级联扩散框架，将联合场景生成任务分解为四个具有明确物理和语义角色的条件子阶段：(1) 预测家具数量和类别，(2) 精细化对象尺寸和特征嵌入，(3) 在潜在空间中建模空间关系，(4) 生成定向包围盒(OBB)。这种解耦架构减少了数据需求，并允许灵活集成大型语言模型(LLMs)和视觉语言模型(VLMs)以实现零样本任务，如图像到场景生成。为了在复杂平面图内保持物理有效性，我们明确建模建筑元素（如墙、门和窗户）作为条件约束。此外，为了解决密集关系图的高熵问题，我们引入了一种与人类空间描述相一致的稀疏关系图表示。通过使用双向变分自编码器(VAE)将这些稀疏图编码到紧凑的潜在空间中，所提出框架提供了增强的关系可控性，使生成的布局更好地符合功能组织。实验表明，CasLayout在保真度和多样性方面达到了最先进的性能，同时在实际应用中提高了可控性。

Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization

Authors: Naeem Rehmat, Muhammad Saad Saeed, Ijaz Ul Haq, Khalid Malik

Venue: CVPR

First: 2026-04-30T02:25:33+00:00 · Latest: 2026-04-30T02:25:33+00:00

Comments: Accepted at CVPR NeXD Workshop (2026)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Web filtering systems rely on accurate web content classification to block cyber threats, prevent data exfiltration, and ensure compliance. However, classification is increasingly difficult due to the dynamic and rapidly evolving nature of the modern web. Embedding-based zero-shot approaches map content and category descriptions into a shared semantic space, enabling label assignment without labeled training data, but remain highly sensitive to definition quality. Poorly specified or ambiguous definitions create semantic overlap in the embedding space, leading to systematic misclassification. In this paper, we propose a training-free, adaptive iterative definition refinement framework that improves zero-shot web content classification by progressively optimizing category definitions rather than updating model parameters. Using LLMs as feedback-driven definition optimizers, we investigate three refinement strategies namely example-guided, confusion-aware, and history-aware, each refining class descriptions using structured signals from misclassified instances. Furthermore, we introduce a human-labeled benchmark of 10 URL categories with 1,000 samples per class and evaluate across 13 state-of-the-art embedding foundation models. Results demonstrate that iterative definition refinement consistently improves classification performance across diverse architectures, establishing definition quality as a critical and underexplored factor in embedding-based systems. The dataset is available at https://github.com/naeemrehmat/B2MWT-10C.

History

20260503_0414 20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553