World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
Authors: Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang
First: 2026-04-27T17:59:56+00:00 · Latest: 2026-04-27T17:59:56+00:00
Comments: Project Page: https://aka.ms/world-r1, Code: https://github.com/microsoft/World-R1
Abstract
Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
中文标题/摘要
标题:World-R1:强化3D约束以实现文本到视频生成
近期的视频基础模型展示了令人印象深刻的视觉合成能力,但经常存在几何不一致的问题。虽然现有方法试图通过架构修改注入3D先验知识,但往往会导致高计算成本并限制可扩展性。我们提出了一种名为World-R1的框架,通过强化学习使视频生成与3D约束保持一致。为了促进这种对齐,我们引入了一个专门用于世界模拟的纯文本数据集。利用Flow-GRPO,我们通过预训练的3D基础模型和视觉-语言模型的反馈来优化模型,以确保结构一致性而不改变基础架构。我们还采用了一种周期性解耦训练策略,以平衡刚性几何一致性与动态场景的流动性。广泛的评估表明,我们的方法显著提高了3D一致性,同时保留了基础模型的原始视觉质量,有效地弥合了视频生成与可扩展世界模拟之间的差距。
Summary / 总结
World-R1 is designed to improve 3D consistency in text-to-video generation by aligning video output with 3D constraints using reinforcement learning. It introduces a specialized text dataset for world simulation and employs Flow-GRPO to optimize the model without changing its architecture. The method also uses a periodic decoupled training strategy to balance rigid geometry and dynamic scenes. Experimental results show that World-R1 enhances 3D consistency while maintaining the original visual quality of the foundation model.
World-R1 通过使用强化学习将视频生成与 3D 约束对齐,以提高文本到视频生成中的 3D 一致性。它引入了一个专门用于世界模拟的文本数据集,并使用 Flow-GRPO 对模型进行优化,而不改变其架构。该方法还采用周期性解耦训练策略来平衡刚性几何和动态场景。实验结果表明,World-R1 在保持基础模型原始视觉质量的同时,显著提高了 3D 一致性。
WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring
Authors: Vandita Shukla, Fabio Remondino, Blair Costelloe, Benjamin Risse
First: 2026-04-27T17:29:22+00:00 · Latest: 2026-04-27T17:29:22+00:00
Abstract
Monocular RGB cameras mounted on drones are widely used for wildlife monitoring, yet most analytical pipelines remain confined to two-dimensional image space, leaving geometric information in video underexploited. We present WildLIFT, a computational framework that integrates three-dimensional scene geometry from monocular drone video with open-vocabulary 2D instance segmentation to enable species-agnostic 3D detection and tracking. Oriented 3D bounding box labels with semantic face information enable quantitative assessment of viewpoint coverage and inter-animal occlusion, producing structured metadata for downstream ecological analyses. We validate the framework on 2,581 manually curated frames comprising over 6,700 3D detections across four large mammal species. WildLIFT maintains high identity consistency in multi-animal scenes and substantially reduces manual 3D annotation effort through keyframe-based refinement. By transforming standard drone footage into structured 3D and viewpoint-aware representations, WildLIFT extends the analytical utility of aerial wildlife datasets for behavioural research and population monitoring.
中文标题/摘要
标题:WildLIFT:将单目无人机视频提升至三维以实现无物种差异的野生动物监测
安装在无人机上的单目RGB相机广泛用于野生动物监测,但大多数分析管道仍局限于二维图像空间,使得视频中的几何信息未得到充分利用。我们提出了一种计算框架WildLIFT,该框架将单目无人机视频中的三维场景几何与开放词汇的二维实例分割相结合,以实现无物种差异的三维检测和跟踪。带有语义面部信息的定向三维边界框标签能够定量评估视角覆盖范围和个体间的遮挡情况,生成下游生态分析所需的结构化元数据。我们在包含超过6,700个三维检测结果的2,581帧手动标注图像中验证了该框架,这些图像来自四种大型哺乳动物物种。WildLIFT在多动物场景中保持了高身份一致性,并通过关键帧优化大幅减少了手动三维标注的工作量。通过将标准无人机视频转换为结构化的三维和视角感知表示,WildLIFT扩展了空中野生动物数据集在行为研究和种群监测中的分析用途。
Training-Free Model Ensemble for Single-Image Super-Resolution via Strong-Branch Compensation
Authors: Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu
First: 2026-04-13T14:48:03+00:00 · Latest: 2026-04-27T17:20:38+00:00
Abstract
Single-image super-resolution has progressed from deep convolutional baselines to stronger Transformer and state-space architectures, yet the corresponding performance gains typically come with higher training cost, longer engineering iteration, and heavier deployment burden. In many practical settings, multiple pretrained models with partially complementary behaviors are already available, and the binding constraint is no longer architectural capacity but how effectively their outputs can be combined without additional training. Rather than pursuing further architectural redesign, this paper proposes a training-free output-level ensemble framework. A dual-branch pipeline is constructed in which a Hybrid attention network with TLC inference provides stable main reconstruction, while a MambaIRv2 branch with geometric self-ensemble supplies strong compensation for high-frequency detail recovery. The two branches process the same low-resolution input independently and are fused in the image space via a lightweight weighted combination, without updating any model parameters or introducing an additional trainable module. As our solution to the NTIRE 2026 Image Super-Resolution ($\times 4$) Challenge, the proposed design consistently improves over the base branch and slightly exceeds the pure strong branch in PSNR at the best operating point under a unified DIV2K bicubic $\times 4$ evaluation protocol. Ablation studies confirm that output-level compensation provides a low-overhead and practically accessible upgrade path for existing super-resolution systems.
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
Authors: Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel
First: 2025-11-21T17:09:43+00:00 · Latest: 2026-04-27T17:16:04+00:00
Abstract
Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available at https://spear.insait.ai.
中文标题/摘要
标题:SPEAR-1:通过三维理解超越机器人演示的扩展
机器人基础模型(RFMs)作为通用的端到端系统,在机器人控制方面具有巨大潜力。然而,它们在新环境、任务和实体间的泛化能力仍然有限。我们认为,主要瓶颈在于其基础:大多数RFMs是通过微调互联网预训练的视觉-语言模型(VLMs)构建的。然而,这些VLMs是在2D图像-语言任务上进行训练的,缺乏在三维世界中进行实体控制所需的三维空间推理能力。直接通过大规模机器人数据来弥合这一差距成本高昂且难以扩展。相反,我们提出了一种方法,即丰富易于收集的非机器人图像数据的三维注释,并增强预训练的VLM以具备三维理解能力。遵循这一策略,我们训练了SPEAR-VLM,这是一种三维感知的VLM,可以从单张2D图像中推断出物体在三维空间中的坐标。基于SPEAR-VLM,我们引入了我们的主要贡献——SPEAR-1:一种结合了基于语言的实体控制和三维感知的机器人基础模型。SPEAR-1在来自24个Open X-Embodiment数据集的约4500万帧数据上进行训练,其性能优于或匹配π_0-FAST和π_{0.5}等最先进的模型,同时使用了20倍少的机器人演示数据。这种精心设计的训练策略解锁了新的VLM能力,从而在仅使用机器人数据的情况下提升了实体控制的可靠性。我们将在https://spear.insait.ai/上公开我们的模型权重和三维标注的数据集。
Agent-Aided Design for Dynamic CAD Models
Authors: Mitch Adler, Matthew Russo, Michael Cafarella
First: 2026-04-16T16:15:23+00:00 · Latest: 2026-04-27T16:57:52+00:00
Comments: 5 pages, 3 figures, published in CAIS'26
Abstract
In the past year, researchers have created agentic systems that can design real-world CAD-style objects in a training-free setting, a new variety of system that we call Agent-Aided Design. These systems place an agent in a feedback loop in which it generates an assembly of CAD model(s), visualizes the assembly, and then iteratively refines its assembly based on visual and other feedback. Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors. In order for Agent-Aided Design to make a real impact in industrial manufacturing, we need a system that is capable of generating such 3D assemblies. In this paper we present a prototype of AADvark, an agentic system designed for this task. Unlike previous state-of-the-art systems, AADvark captures the dynamic part interactions with one or more degrees-of-freedom. This design decision allows AADvark to reason directly about assemblies with moving parts and can thereby achieve cross-cutting goals, including but not limited to mechanical movements. Unfortunately, current LLMs are imperfect spatial reasoners, a problem that AADvark addresses by incorporating external constraint solver tools with a specialized visual feedback mechanism. We demonstrate that, by modifying the agent's tools (FreeCAD and the assembly solver), we are able to create a strong verification signal which enables our system to build 3D assemblies with movable parts.
Summary / 总结
This paper addresses the challenge of designing complex 3D assemblies with moving parts in Agent-Aided Design systems. The authors introduce AADvark, an agentic system that captures dynamic part interactions, allowing it to generate assemblies with moving parts. By integrating external constraint solvers and specialized visual feedback, AADvark overcomes the limitations of current language models and successfully builds 3D assemblies with movable parts such as pistons and scissors.
本文解决了在Agent-Aided Design系统中设计具有移动部件的复杂3D装配件的挑战。作者引入了AADvark,这是一种能够捕捉动态部件交互的系统,使其能够生成具有移动部件的装配件。通过集成外部约束求解器和专门的视觉反馈机制,AADvark克服了当前语言模型的局限性,并成功构建了如活塞、摆锤和剪刀等具有可动部件的3D装配件。
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Authors: Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu
First: 2025-08-27T08:01:03+00:00 · Latest: 2026-04-27T16:28:37+00:00
Comments: 16 pages, two figures
Abstract
Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi reward loss objective. To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages calculated separately. Our experiments show that Vision SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision SR1 introduces no extra GPU overhead beyond that of standard training.
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
Authors: Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi
First: 2026-04-27T15:29:35+00:00 · Latest: 2026-04-27T15:29:35+00:00
Abstract
Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.
A systematic evaluation of vision-language models for observational astronomical reasoning tasks
Authors: Wenke Ren, Hengxiao Guo, Wenwen Zuo, Xiaoman Zhang
First: 2026-04-27T15:11:31+00:00 · Latest: 2026-04-27T15:11:31+00:00
Comments: 24 pages, 5 figures
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge. Phenomenological prompts describing what to look for improve accuracy by sharpening model focus, but physical prompts explaining why those features matter perform better overall and yield more balanced classifications with reduced class-specific bias. Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement. Reasoning quality analysis further demonstrates that, without explicit physical grounding, models may reach correct predictions from phenomenologically plausible cues while providing physically imprecise justifications, establishing that accuracy alone is insufficient for trustworthy scientific deployment. These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.
Summary / 总结
This study systematic evaluation of vision-language models (VLMs) for observational astronomical tasks finds that model performance models perform task-dependent performance, task-specific strengths vary significantly. models. While on ablation analyses reveal that performance on directing attention to salient visual features and physical knowledge significantly on overall performance predictions with reduced class-specific bias. On consistentifying predictions from phenomenologically plausible cues, on physical understanding on under under under physically improvement on accuracy alone is insufficient for trustworthy scientific deployment. These findings provide systematic multi-modal baselines for V V V-language models on observational astronomy and identify reasoning bottlenelenecks where current models fall.
研究评估了视觉语言模型(VLMs)在解释多种天文观测中的可靠性,使用了包含超过4,100个专家验证实例的AstroVLBench基准,涵盖了五个任务。六个模型的评估显示了强烈的模态依赖性,Gemini 3 Pro表现最为一致。各任务的具体优势不同,所有模型在专业方法面前表现不佳。通过解释物理知识的提示可以提高性能,直接呈现原始数据为数值表格也能显著提升准确性。研究强调,VLMs需要物理接地才能在科学应用中可靠。
Improving Vision-language Models with Perception-centric Process Reward Models
Authors: Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen
First: 2026-04-27T15:08:02+00:00 · Latest: 2026-04-27T15:08:02+00:00
Comments: 8 pages
Abstract
Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.
Summary / 总结
The research aims to enhance the reasoning ability of vision-language models (VLMs) by addressing the limitations of outcome-level supervision in reinforcement learning with verifiable rewards (RLVR). To achieve this, the authors propose Perceval, a process reward model that provides token-level error grounding, enabling the comparison of image-related claims with visual evidence. Perceval is trained with perception-intensive data and integrated into the RL training process to apply token-level advantages, leading to fine-grained supervision. Experiments show significant improvements across multiple VLMs trained with RL, indicating the effectiveness of perception-centric supervision. Additionally, Perceval aids in inference by truncating erroneous responses and allowing for model reflection, demonstrating consistent performance gains over other strategies like major voting.
研究旨在通过解决结果级监督在强化学习与可验证奖励(RLVR)中的局限性,提升视觉语言模型(VLMs)的推理能力。为此,作者提出了Perceval,一种过程奖励模型,能够提供标记级错误定位,使图像相关声明与视觉证据进行逐个对比。Perceval通过感知密集型数据进行训练,并集成到RL训练过程中,应用标记级优势,实现精细监督。实验结果显示,Perceval在多个使用RL训练的VLMs上取得了显著改进,表明感知中心监督作为通用策略的有效性。此外,Perceval在推理阶段通过截断错误响应并允许模型反思,展示了与多数投票等其他策略相比的一致性能提升。
Diffusion Model as a Generalist Segmentation Learner
Authors: Haoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun, Changhao Pan, Yifu Chen, Minjie Hong, Weijie Wang, Shuang Chen, Yue Chen, Zhou Zhao
First: 2026-04-27T15:04:13+00:00 · Latest: 2026-04-27T15:04:13+00:00
Abstract
Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.
中文标题/摘要
标题:扩散模型作为通用分割学习者
扩散模型主要被训练用于图像合成,但它们的去噪轨迹中编码了丰富的、空间对齐的视觉先验。在本文中,我们展示了这些先验可以用于文本条件的语义和开放词汇分割,并且这种方法可以泛化到各种下游任务,以构建一个通用的扩散分割框架。具体而言,我们引入了DiGSeg(扩散模型作为通用分割学习者),该方法将预训练的扩散模型重新用于统一的分割框架。我们的方法将输入图像和地面真值掩码编码到潜在空间中,并将其连接起来作为扩散U-Net的条件信号。一个并行的CLIP对齐文本路径注入多尺度的语言特征,使模型能够将文本查询与不断变化的视觉表示对齐。这种设计将现成的扩散主干网转换为一个通用接口,该接口可以根据外观和任意文本提示生成结构化的分割掩码。广泛的实验表明,该方法在标准语义分割基准测试中表现出最先进的性能,并且在开放词汇泛化和跨域迁移(医疗、遥感和农业场景)方面表现出强大的能力,无需特定领域的架构定制。这些结果表明,现代扩散主干网可以作为通用分割学习者,而不是纯粹的生成器,从而缩小了视觉生成与视觉理解之间的差距。
Summary / 总结
This paper explores the use of diffusion models, originally trained for image synthesis, for text-conditioned semantic and open-vocabulary segmentation. The authors introduce DiGSeg, which repurposes a pretrained diffusion model to generate structured segmentation masks conditioned on both appearance and text prompts. Experiments show that this approach achieves state-of-the-art performance on standard benchmarks and demonstrates strong generalization and transferability across various domains without domain-specific customization.
本文探讨了将最初用于图像合成的扩散模型应用于文本条件下的语义和开放词汇分割。作者引入了DiGSeg,该方法将预训练的扩散模型重新用于生成同时基于外观和文本提示的结构化分割掩码。实验表明,这种方法在标准基准上达到了最先进的性能,并且在各种领域中表现出强大的泛化能力和跨域迁移能力,无需特定领域的架构定制。
ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
Authors: Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang
First: 2026-04-24T13:36:41+00:00 · Latest: 2026-04-27T14:36:24+00:00
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
Summary / 总结
The research aims to address the issue of incomplete scene graph annotations in open-vocabulary scene graph generation, where many valid relations are missing and the same interaction can be described at different granularities. The proposed ReLIC-SGG framework treats unannotated relations as latent variables and builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates. This framework infers missing positive relations from visual-language compatibility, graph context, and semantic consistency, and uses a positive-unlabeled graph learning objective to reduce false-negative supervision. The results show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
研究旨在解决开放词汇场景图生成中注解不完整的问题,其中许多有效的关系缺失,同样的交互可以以不同的粒度描述。提出的ReLIC-SGG框架将未标注的关系视为潜在变量,并构建一个语义关系格来建模开放词汇谓词之间的相似性、蕴含性和矛盾性。该框架通过视觉-语言兼容性、图上下文和语义一致性来推断缺失的正关系,并使用正无标签图学习目标来减少假阴性监督。实验结果表明,ReLIC-SGG提高了罕见和未见过的谓词识别,并更好地恢复了缺失的关系。
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
Authors: Bin Wang, Ruotong Hu, Wentong Li, Wenqian Wang, Mingliang Gao, Runmin Cong, Wei Zhang, Xudong Jiang
First: 2025-11-27T05:36:47+00:00 · Latest: 2026-04-27T14:33:42+00:00
Comments: Technical Report
Abstract
Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Authors: Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He
Venue: CVPR 2026
First: 2026-04-23T14:33:27+00:00 · Latest: 2026-04-27T13:48:45+00:00
Comments: Accepted by CVPR 2026 (Findings Track)
Abstract
Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .
中文标题/摘要
标题:拉面:使用主动样本选择的视觉-语言模型测试时稳健适应
预训练的视觉-语言模型如CLIP表现出强大的零样本泛化能力,但仍然对分布偏移敏感。测试时适应在不访问源数据或目标标签的情况下对模型进行适应,提供了一种处理此类偏移的实用方法。然而,现有方法通常假设测试样本来自单一且一致的领域,而在实践中,测试数据往往包含来自具有不同特征的混合领域的样本。因此,在混合领域设置下其性能会下降。为了解决这一问题,我们提出了拉面框架,用于通过主动样本选择实现稳健的测试时适应。对于每个新的测试样本,拉面会根据两个标准从之前见过的数据中检索一个定制化的样本批次:领域一致性,确保适应集中在相似领域的数据上;预测平衡,减轻由于预测偏差引起的适应偏差。为了提高效率,拉面使用嵌入-梯度缓存来存储过去测试图像的嵌入和样本级梯度。存储的嵌入用于检索相关样本,相应的梯度被聚合以更新模型,从而无需任何额外的前向或反向传递。我们的理论分析提供了为什么提出的适应机制在混合领域偏移下有效的原因。在多个图像损坏和领域偏移基准测试上的实验表明,拉面在复杂混合领域场景中实现了强大且一致的性能,提供了稳健且高效的适应。我们的代码可在https://github.com/baowenxuan/Ramen 获取。
Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data
Authors: Mohammadmehdi Ataei, Farzaneh Askari, Kamal Rahimi Malekshan, Pradeep Kumar Jayaraman
First: 2026-04-27T13:46:41+00:00 · Latest: 2026-04-27T13:46:41+00:00
Abstract
Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset's utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.
Summary / 总结
The research aims to address the lack of parametric design information in large 3D datasets by introducing Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. The method involves embedding a large language model within a feedback-driven CAD environment, enabling iterative generation, execution, and validation of code. Key findings include the synthesis of over one million interpretable CAD sequences and the successful fine-tuning of a vision-language model to reconstruct CAD programs from images, outperforming strong baselines without real construction-history data. This work bridges the gap between geometric scale and parametric interpretability, providing a valuable resource for CAD AI development.
研究旨在通过引入Zero-to-CAD框架解决大型3D数据集中缺乏参数设计信息的问题,该框架能够合成可执行的CAD构造序列。方法包括将大型语言模型嵌入反馈驱动的CAD环境中,实现代码的迭代生成、执行和验证。关键发现包括合成了超过一百万个可解释的CAD序列,并成功地对视觉语言模型进行了微调,使其能够从多视角图像中重建CAD程序,超越了强大的基线模型,且无需实际的构造历史训练数据。这项工作填补了几何规模和参数可解释性之间的空白,为CAD AI的发展提供了宝贵的资源。
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
Authors: Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang
Venue: ICLR 2026
First: 2026-04-27T13:10:52+00:00 · Latest: 2026-04-27T13:10:52+00:00
Comments: 11 pages. Accepted by ICLR 2026 Workshop ES-Reasoning
Abstract
Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio-temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference-time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self-generated "Knowledge Notes." PhysNote stabilizes dynamic perception through spatio-temporal canonicalization, organizes self-generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi-agent baseline, with consistent gains across all four physical reasoning domains.
中文标题/摘要
标题:PhysNote:视觉语言模型中可进化物理推理的自我知识笔记
视觉语言模型(VLMs)在解答教科书风格的物理问题上表现出色,但在面对需要时间一致性和因果推理的动态现实场景时却经常失败。我们识别出导致这些失败的两个基本挑战:(1)时空身份漂移,即物体在连续帧中失去其物理身份,破坏因果链;(2)推理时见解的波动性,模型可能偶尔产生正确的物理推理,但从未将其巩固以供未来重用。为解决这些挑战,我们提出了PhysNote,一种代理框架,使VLMs能够通过自我生成的“知识笔记”来外部化和精炼物理知识。PhysNote通过时空规范化稳定动态感知,将自我生成的见解组织成层次化的知识库,并驱动一个迭代推理循环,在视觉证据支持下验证假设,然后巩固验证过的知识。在PhysBench上的实验表明,PhysNote的整体准确率为56.68%,比最佳多代理基线提高了4.96%,在所有四个物理推理领域均表现出一致的改进。
Summary / 总结
The research addresses the limitations of Vision-Language Models (VLMs) in handling dynamic real-world scenarios by proposing PhysNote, an agentic framework that helps VLMs externalize and refine physical knowledge through self-generated 'Knowledge Notes.' The method includes spatio-temporal canonicalization, hierarchical knowledge organization, and iterative reasoning grounded in visual evidence. Experiments show PhysNote improves overall accuracy by 4.96% compared to the best multi-agent baseline on PhysBench, with consistent gains across all physical reasoning domains.
研究针对Vision-Language模型(VLM)在处理动态现实场景时的局限性,提出了PhysNote框架,该框架通过自动生成的‘知识笔记’帮助VLM外部化和精炼物理知识。方法包括时空标准化、知识的分层组织以及基于视觉证据的迭代推理。实验表明,PhysNote在PhysBench上的整体准确率提高了4.96%,并且在所有物理推理领域都取得了持续的改进。
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
Authors: Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang
First: 2026-04-27T13:06:27+00:00 · Latest: 2026-04-27T13:06:27+00:00
Comments: Technical Report
Abstract
Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.
中文标题/摘要
标题:AutoGUI-v2:全面的多模态GUI功能理解基准
能够导航图形用户界面(GUI)的自主代理有望彻底改变数字生产力。然而,实现真正的数字自主性不仅限于反应式元素匹配,还需要预测界面动态的先验心智模型以及预见交互后的“数字世界状态”。尽管现代视觉-语言模型(VLM)具有感知能力,但现有基准仍然分裂(要么专注于黑盒任务完成,要么关注静态、浅层的语义关联),未能评估代理是否真正理解GUI的隐含功能和转换逻辑。为弥合这一差距,我们引入了AutoGUI-v2,这是一个全面的基准,旨在评估深度GUI功能理解和交互结果预测。我们使用一种新颖的VLM-人类协作流水线构建基准,该流水线递归地将多平台截图解析为分层的功能区域,以生成多样化的评估任务。AutoGUI-v2提供了跨六个操作系统共计2,753个任务,严格测试代理在区域和元素级语义、语义关联和动态状态预测方面的表现。我们的评估揭示了VLMs之间惊人的二元性:开源模型(如Qwen3-VL)在功能语义关联方面表现出色,而商业模型(如Gemini-2.5-Pro-Thinking)在功能描述方面占优。关键的是,所有模型在不常见动作的复杂交互逻辑方面都表现不佳,突显了深入的功能理解仍然是一个重大障碍。通过系统地测量这些基础能力,AutoGUI-v2为下一代GUI代理的进步提供了一个新的视角。
Summary / 总结
AutoGUI-v2 is a comprehensive benchmark for evaluating autonomous agents' understanding of GUI functionality and interaction outcomes. It uses a VLM-human collaborative pipeline to generate diverse tasks across six operating systems, testing agents on region and element-level semantics, grounding, and dynamic state prediction. The evaluation shows that open-source models are better at functional grounding, while commercial models excel in functionality captioning, but all struggle with complex interaction logic of uncommon actions, indicating a need for deeper functional understanding.
AutoGUI-v2 是一个用于评估 GUI 功能性和交互结果理解能力的基准。它使用 VLM-人类协作管道生成跨六个操作系统多样化的任务。评估结果显示,开源模型在功能定位方面表现更好,而商业模型在功能描述方面占优。然而,所有模型在复杂交互逻辑方面都存在困难,表明 GUI 代理需要更深层次的功能理解。
AD-Relight: Training-Free Banner Relighting via Illumination Translation with Diffusion Priors
Authors: Rameshwar Mishra, A V Subramanyam
First: 2026-04-27T12:30:04+00:00 · Latest: 2026-04-27T12:30:04+00:00
Abstract
The recent surge in content consumption through streaming services has driven a growing demand for personalized content. Personalized advertisements (ads) play a crucial role in enhancing both user engagement and ad effectiveness. A key aspect of ad personalization involves replacing existing regions in a frame with custom, Photoshop-generated banners. However, existing ad-placement pipelines typically rely on simple geometric warping, ignoring the scene's underlying lighting conditions. Similarly, state-of-the-art diffusion-based object insertion and relighting models struggle to accurately relight these newly inserted banners, as they are not trained on ad-banner data, and training such a model for ad banners would require millions of images. This highlights the need for an effective relighting framework that enables seamless integration of custom banners into the original scene. Motivated by this, we present AD-Relight, a novel multi-stage training-free framework that adapts a diffusion-based relighting model at test time to relight newly added Photoshop-generated ad banners. Through extensive evaluation, we demonstrate that AD-Relight outperforms both relighting baselines and existing ad-placement methods based on simple warping. User studies further show that participants consistently prefer the outputs of AD-Relight over those of prior approaches.
中文标题/摘要
标题:AD-Relight:基于照明转换与扩散先验的无需训练的横幅重新照明
流媒体服务内容消费的激增推动了个性化内容需求的增长。个性化广告(广告)在增强用户参与度和广告效果方面发挥着关键作用。广告个性化的一个关键方面是用定制的、Photoshop生成的横幅替换帧中的现有区域。然而,现有的广告投放管道通常依赖于简单的几何变形,忽略了场景的照明条件。同样,基于扩散的物体插入和重新照明模型在准确重新照明新插入的横幅方面也存在问题,因为它们没有针对广告横幅进行训练,而且为广告横幅训练这样的模型需要数百万张图像。这突显了需要一种有效的重新照明框架,以使定制横幅能够无缝集成到原始场景中。受此启发,我们提出了一种新颖的多阶段无需训练框架AD-Relight,在测试时适应基于扩散的重新照明模型,以重新照明新添加的Photoshop生成的广告横幅。通过广泛的评估,我们证明AD-Relight优于现有的重新照明基线和基于简单变形的现有广告投放方法。用户研究进一步表明,参与者一致偏好AD-Relight的输出,而不是先前的方法。
Summary / 总结
AD-Relight is a training-free framework that uses a diffusion-based relighting model to replace and relight Photoshop-generated ad banners in video frames. It addresses the limitations of existing methods by adapting the model at test time to accurately relight banners without requiring training on ad-banner data. Experimental results show that AD-Relight outperforms both relighting baselines and simple geometric warping methods, and user studies confirm its superior performance in enhancing ad personalization and user engagement.
AD-Relight 是一个无需训练的框架,利用基于扩散的光照调整模型在视频帧中替换和调整新插入的 Photoshop 广告横幅的光照。它通过在测试时适应模型来解决现有方法的局限性,无需针对广告横幅数据进行训练即可准确调整光照。实验结果表明,AD-Relight 在光照调整基准和简单的几何变形方法中表现更优,用户研究进一步证实了其在增强广告个性化和用户参与度方面的优越性能。
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
Authors: Yubo Jiang, Xin Yang, Abudukelimu Wuerkaixi, Zheming Yuan, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang
Venue: ACL 2025
First: 2026-04-27T12:23:00+00:00 · Latest: 2026-04-27T12:23:00+00:00
Comments: 9 pages, 8 figures, Findings of ACL 2025
Abstract
Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.
中文标题/摘要
标题:全局背景还是局部细节?自适应视觉定位以减轻幻觉
视觉-语言模型(VLMs)经常因过度依赖语言先验而受到物体幻觉的困扰——生成与视觉现实相矛盾的内容。我们提出了正向和负向解码(PND),这是一种无需训练的推理框架,直接干预解码过程以确保视觉真实性。PND 的动机是我们关于 VLMs 中关键注意力缺陷的关键发现,其中视觉特征在经验上被低估。我们的框架通过双路径对比来纠正这一问题:正路径使用多层注意力放大显著的视觉证据,鼓励忠实描述,直接抵消了注意力缺陷。同时,负路径识别并降级核心物体的特征,创建一个强有力的反事实,惩罚基于先验的非视觉描述。通过在每一步从这两个视角对比模型的输出,PND 引导生成不仅在语言上可能,而且在视觉上真实的文本。在 POPE、MME 和 CHAIR 等基准测试上的广泛实验表明,PND 达到了最先进的性能,准确率提高了高达 6.5%,显著减少了物体幻觉,同时增强了描述细节——所有这些都不需要任何模型重新训练。该方法在包括 LLaVA、InstructBLIP、InternVL 和 Qwen-VL 等多种 VLM 架构中表现出良好的泛化能力。
Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs
Authors: Byeonggeuk Lim, JungMin Yun, Junehyoung Kwon, Kyeonghyun Kim, YoungBin Kim
Venue: ACL 2026
First: 2026-04-27T12:22:35+00:00 · Latest: 2026-04-27T12:22:35+00:00
Comments: Accepted to ACL 2026
Abstract
Large Vision-Language Models (LVLMs) frequently suffer from hallucinations. Existing preference learning-based approaches largely rely on proprietary models to construct preference datasets. We identify that this reliance introduces a distributional mismatch between the proprietary and target models that hinders efficient alignment. To address this, we propose Alignment via VErified Self-correction DPO (AVES-DPO), a framework that aligns LVLMs using in-distribution data derived from the model's intrinsic knowledge. Our approach employs a consensus-based verification mechanism to diagnose diverse hallucinations and guides the model to self-correct, thereby generating preference pairs strictly compatible with its internal distribution. Extensive experiments demonstrate that AVES-DPO surpasses existing baselines in hallucination mitigation while requiring only 5.2k samples.
Summary / 总结
The paper addresses the issue of hallucinations in Large Vision-Language Models (LVLMs) by proposing AVES-DPO, a framework that aligns LVLMs using in-distribution data derived from the model's intrinsic knowledge. This method uses a consensus-based verification mechanism to diagnose and correct diverse hallucinations, generating preference pairs compatible with the model's internal distribution. Experiments show that AVES-DPO outperforms existing methods in hallucination mitigation while needing only 5.2k samples.
论文提出AVES-DPO框架,利用内在数据对大型视觉-语言模型(LVLM)进行对齐,该方法使用共识机制诊断和纠正各种幻觉,生成与模型内部分布兼容的偏好对。实验表明,AVES-DPO在仅需5.2k样本的情况下,比现有方法更有效地减轻幻觉问题。
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
Authors: Arya Shah, Deepali Mishra, Chaklam Silpasuwanchai
First: 2026-04-27T11:39:18+00:00 · Latest: 2026-04-27T11:39:18+00:00
Comments: 13 pages, 12 figures, 6 tables
Abstract
Vision-language models (VLMs) are increasingly deployed as evaluators in tasks requiring nuanced image understanding, yet their reliability in scoring alignment between images and text descriptions remains underexplored. We investigate whether small, open-weight VLMs exhibit \emph{sycophantic} behavior when evaluating image-text alignment: assigning high scores without grounding their judgments in visual evidence. To quantify this phenomenon, we introduce the \emph{Bluffing Coefficient} (\bc), a metric that measures the mismatch between a model's score and its evidence recall. We evaluate six open-weight VLMs ranging from 450M to 8B parameters on a benchmark of 173,810 AI-generated character portraits paired with detailed textual descriptions. Our analysis reveals a significant inverse correlation between model size and sycophancy rate ($r = -0.96$, $p = 0.002$), with smaller models exhibiting substantially higher rates of unjustified high scores. The smallest model tested (LFM2-VL, 450M) produced sycophantic evaluations in 22.3\% of cases, compared to 6.0\% for the largest (LLaVA-1.6, 7B). These findings have direct implications for the deployment of small, open-weight VLMs as automated evaluators within attribute-rich, synthetic image evaluation tasks, where the gap between assigned scores and cited visual evidence is both measurable and consequential.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
Authors: Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang
First: 2026-04-27T11:31:15+00:00 · Latest: 2026-04-27T11:31:15+00:00
Comments: CVPR2026
Abstract
Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
中文标题/摘要
标题:见得更远,思得更深:利用低级视觉线索和反思提升VLM的推理能力
近期视觉-语言模型(VLMs)的进步得益于强化学习(RL)的增强推理能力。然而,现有方法仍然面临关键限制,包括缺乏低级视觉信息和有效的视觉反馈。为了解决这些问题,本文提出了一种统一的多模态交错推理框架\textbf{ForeSight},使VLMs能够利用低级视觉线索\textbf{见得更远},并利用有效的视觉反馈\textbf{思得更深}。首先,它引入了一组低级视觉工具,将关键的视觉信息整合到推理链中,缓解了对细粒度视觉特征的忽视。其次,详细阐述了一种基于掩码的视觉反馈机制,将视觉反思融入思考过程,使模型能够动态重新审视和更新其答案。由RL驱动,ForeSight学习自主决定工具调用和答案验证,最终答案准确性作为奖励信号。为了评估所提框架的性能,我们基于SalBench数据集构建了一个新的数据集,即字符和语义标注基准(CG-SalBench)。实验结果表明,ForeSight-7B模型在参数规模相同的情况下显著优于其他模型,并且在某些指标上甚至超越了当前最先进的闭源模型。
ShapeUP: Scalable Image-Conditioned 3D Editing
Authors: Inbar Gat, Dana Cohen-Bar, Guy Levy, Elad Richardson, Daniel Cohen-Or
Venue: SIGGRAPH 2026
First: 2026-02-05T13:59:16+00:00 · Latest: 2026-04-27T11:20:10+00:00
Comments: SIGGRAPH 2026. Project page: https://inbar-2344.github.io/ShapeUp-page/
Abstract
Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
中文标题/摘要
标题:ShapeUP:可扩展的基于图像的3D编辑框架
近期3D基础模型的进步使得生成高保真资产成为可能,但精确的3D操作仍然是一个重大挑战。现有的3D编辑框架往往在视觉可控性、几何一致性与可扩展性之间面临难以调和的权衡。具体来说,基于优化的方法速度过慢,多视角2D传播技术存在视觉漂移问题,而无需训练的潜在空间操作方法则受限于固定的先验知识,无法直接从可扩展性中获益。在本文中,我们提出了ShapeUP,一种可扩展的、基于图像的3D编辑框架,将编辑形式化为在原生3D表示中的监督潜在域到潜在域的转换。这种形式化使得ShapeUP能够基于预训练的3D基础模型,利用其强大的生成先验知识,并通过监督训练进行适应。实践中,ShapeUP在包含源3D形状、编辑后的2D图像及其对应的编辑3D形状的三元组上进行训练,并使用3D扩散变换器(DiT)学习直接映射。这种基于图像的提示方法使得对局部和全局编辑具有精细的视觉控制,并实现了隐式的、无掩码的定位,同时保持与原始资产的严格结构一致性。我们的广泛评估表明,ShapeUP在身份保留和编辑保真度方面始终优于当前的训练有素和无需训练的基线,提供了一种稳健且可扩展的原生3D内容创建范式。
Summary / 总结
ShapeUP is a scalable 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation, using a 3D Diffusion Transformer (DiT) trained on triplets of source 3D shapes, edited 2D images, and corresponding edited 3D shapes. It achieves fine-grained visual control and strict structural consistency, outperforming current baselines in identity preservation and edit fidelity.
ShapeUP 是一种可扩展的 3D 编辑框架,将编辑视为在原生 3D 表示中的监督潜在到潜在的转换,使用 3D 扩散变换器(DiT)在包含源 3D 形状、编辑后的 2D 图像及其相应编辑后的 3D 形状的三元组上进行训练。它实现了精细的视觉控制和严格的结构一致性,优于当前的基线方法在身份保留和编辑保真度方面。
Don't Pause! Every prediction matters in a streaming video
Authors: Dibyadip Chatterjee, Zhanzhong Pang, Fadime Sener, Yale Song, Angela Yao
First: 2026-04-27T11:07:03+00:00 · Latest: 2026-04-27T11:07:03+00:00
Comments: 29 pages, 14 figures; https://dibschat.github.io/SPOT-Bench
Abstract
Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time - compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.
Summary / 总结
The research aims to evaluate streaming video models in real-time by introducing SPOT-Bench, which uses multi-turn proactive queries. Key findings include offline models' tendency to spam predictions, post-training silence reduction's trade-off between spamming and responsiveness, and the concept of dead-time where compute does not affect response latency. These insights led to the development of AsynKV, a streaming adaptation that retains event perception while improving streaming behavior, outperforming existing models on SPOT-Bench and retrospective benchmarks.
研究旨在实时评估流式视频模型,解决现有回顾性基准的局限性。SPOT-Bench引入了多轮主动查询来评估流式感知和辅助能力。关键发现包括离线模型的预测泛滥、通过后训练沉默减少泛滥、以及计算在无响应时间(死时间)中不影响响应延迟的概念。AsynKV作为一种无需训练的流式适应方法,通过保留事件感知并在死时间期间高效扩展计算来改善流式行为,在SPOT-Bench和回顾性基准上均表现出色。
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
Authors: Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, Angel X. Chang
First: 2026-04-27T10:45:51+00:00 · Latest: 2026-04-27T10:45:51+00:00
Comments: Project Page: https://3dlg-hcvc.github.io/revsi/
Abstract
Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.
GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models
Authors: Yiming Zhang, Sitong Liu, Ke Li, Zhihong Wu, Alex Cloninger, Melvin Leok
First: 2026-04-27T09:47:02+00:00 · Latest: 2026-04-27T09:47:02+00:00
Abstract
Diffusion models are a leading paradigm for data generation, but training-free editing typically re-runs the full denoising trajectory for every edit strength, making iterative refinement expensive. To address this issue, we instead edit near the data manifold, where small local updates can replace repeated re-synthesis. To enable this, we estimate a local manifold tangent space directly from perturbed samples and prove that this sample-based estimator closely approximates the true tangent. Building on this guarantee, we devise a Jacobian-free algorithm that constructs a tangent frame via small perturbations to the initial noise and alternates small tangent moves with diffusion-based projections. Updates within this frame follow principled on-manifold directions while suppressing off-manifold drift, enabling fine-grained edits without full re-diffusion or additional training. Edit strength is controlled by the number of steps for rapid, continuous adjustments that preserve fidelity and plug into existing samplers. Empirically, the resulting tangent directions yield smooth, semantic unsupervised traversals and effective CLIP-guided optimization, demonstrating practical interactive continuous editing.
中文标题/摘要
标题:GeoEdit:在流形上的快速、无需训练的编辑方法
扩散模型是数据生成的领先范式,但无需训练的编辑通常需要为每次编辑强度重新运行完整的去噪轨迹,使得迭代细化昂贵。为了解决这一问题,我们选择在数据流形附近进行编辑,这样小的局部更新可以替代重复的重新合成。为了实现这一点,我们直接从扰动样本中估计局部流形切空间,并证明这种基于样本的估计器接近真实的切空间。在此基础上,我们设计了一种无需雅可比的算法,通过初始噪声的小扰动构建切空间框架,并交替进行小切空间移动和基于扩散的投影。框架内的更新遵循在流形上的原则方向,同时抑制流形外的漂移,从而在无需重新扩散或额外训练的情况下实现精细的编辑。编辑强度通过步骤数量控制,实现快速、连续的调整,同时保持保真度并插入现有的采样器中。实验结果表明,生成的切空间方向提供了平滑的、语义的无监督遍历和有效的CLIP引导优化,展示了实际的交互式连续编辑。
Summary / 总结
The research aims to address the computational inefficiency of training-free editing in diffusion models by proposing GeoEdit, a method that edits near the data manifold. GeoEdit estimates the local manifold tangent space from perturbed samples and uses a Jacobian-free algorithm to construct a tangent frame for small, on-manifold updates. This approach enables fine-grained edits without full re-diffusion or additional training, allowing for rapid, continuous adjustments that preserve fidelity and are compatible with existing samplers. Empirical results show that GeoEdit provides smooth, semantic traversals and effective CLIP-guided optimization.
研究旨在通过提出GeoEdit方法解决无训练编辑在扩散模型中的计算效率问题,该方法在数据流形附近进行编辑。GeoEdit 从扰动样本中估计局部流形切空间,并使用无雅可比算法构建切空间框架,以进行小规模的流形内更新。这种方法允许在无需完整重新扩散或额外训练的情况下进行精细编辑,并且可以快速连续调整以保持保真度并兼容现有采样器。实验结果表明,GeoEdit 提供了平滑的语义遍历和有效的CLIP引导优化。
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
Authors: Mofei Li, Taozhi Chen, Guowei Yang, Jia Li
First: 2026-04-27T09:27:59+00:00 · Latest: 2026-04-27T09:27:59+00:00
Abstract
Large Language Models (LLMs) excel at general code generation, but their performance drops sharply in enterprise settings that rely on internal private libraries absent from public pre-training corpora. While Retrieval-Augmented Generation (RAG) offers a training-free alternative by providing static API documentation, we find that such documentation typically provides only isolated definitions, leaving a fundamental knowledge gap. Specifically, LLMs struggle with a task-level lack of coordination patterns between APIs and an API-level misunderstanding of parameter constraints and boundary conditions. To address this, we propose MEMCoder, a novel framework that enables LLMs to autonomously accumulate and evolve Usage Guidelines across these two dimensions. MEMCoder introduces a Multi-dimensional Evolving Memory that captures distilled lessons from the model's own problem-solving trajectories. During inference, MEMCoder employs a dual-source retrieval mechanism to inject both static documentation and relevant historical guidelines into the context. The framework operates in an automated closed loop by using objective execution feedback to reflect on successes and failures, resolve knowledge conflicts, and dynamically update memory. Extensive evaluations on the NdonnxEval and NumbaEval benchmarks demonstrate that MEMCoder substantially enhances existing RAG systems, yielding an average absolute pass@1 gain of 16.31%. Furthermore, MEMCoder exhibits vastly superior domain-specific adaptation compared to existing memory-based continual learning methods.
中文标题/摘要
标题:MEMCoder:面向私有库导向的代码生成多维度演化记忆
大型语言模型(LLMs)在通用代码生成方面表现出色,但在依赖内部私有库的企业环境中,其性能会急剧下降,而这些私有库并未包含在公共预训练语料库中。虽然检索增强生成(RAG)提供了一种无需训练的替代方案,通过提供静态API文档,但我们发现这些文档通常只提供了孤立的定义,留下了根本的知识空白。具体来说,LLMs 在API层面缺乏协调模式,且在API层面无法理解参数约束和边界条件。为了解决这个问题,我们提出了一种名为MEMCoder的新框架,该框架使LLMs能够自主积累和在这些两个维度上演化使用指南。MEMCoder引入了一种多维度演化记忆,捕捉了模型自身问题解决轨迹中提炼的经验教训。在推理过程中,MEMCoder采用双重来源检索机制,将静态文档和相关历史指南注入上下文。该框架通过使用客观执行反馈自动闭环运行,反思成功和失败,解决知识冲突,并动态更新记忆。在NdonnxEval和NumbaEval基准测试上的广泛评估表明,MEMCoder显著增强了现有的RAG系统,平均绝对pass@1提高了16.31%。此外,MEMCoder在领域特定适应方面明显优于现有的基于记忆的持续学习方法。
Summary / 总结
The research aims to improve the performance of Large Language Models (LLMs) in enterprise settings that rely on private libraries. MEMCoder is proposed as a novel framework that addresses the limitations of existing Retrieval-Augmented Generation (RAG) methods by enabling LLMs to autonomously accumulate and evolve Usage Guidelines. The framework uses a Multi-dimensional Evolving Memory to capture lessons from problem-solving trajectories and employs a dual-source retrieval mechanism to enhance context. Evaluations show that MEMCoder significantly improves existing RAG systems, achieving an average absolute pass@1 gain of 16.31%. Additionally, MEMCoder demonstrates superior domain-specific adaptation compared to other memory-based continual learning methods.
研究旨在提高大型语言模型(LLMs)在依赖私有库的企业环境中的性能。提出了MEMCoder这一新颖框架,通过使LLMs自主积累和进化使用指南来解决现有检索增强生成(RAG)方法的局限性。该框架使用多维演变记忆来捕捉模型解决问题过程中的教训,并采用双源检索机制增强上下文。评估结果显示,MEMCoder显著提高了现有RAG系统的性能,平均绝对pass@1增益达到16.31%。此外,MEMCoder在领域特定适应性方面优于其他基于记忆的持续学习方法。
CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
Authors: Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen
First: 2026-04-24T06:34:45+00:00 · Latest: 2026-04-27T08:59:57+00:00
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
Summary / 总结
This study addresses open-vocabulary scene-scene graph generation introduces a novel-factual active graph evidence (CAGE-SGG) framework. which generates candidate relations are verified through on multiple- specific visual, geometric and contextual cues. rather than priors occurrence. The method method method involves involves-- phrase decomposition and evidence-based optimization enhance the reliability model's's reliability and consistency. Experiments show benchmarks show conventional on-vocabulary and panoptic SGG show show that the method method on CAGE-SGG improves on unseen and counterfactual general on metrics.
研究旨在通过验证每个关系的视觉证据来提高开放词汇场景图生成的可靠性。方法包括使用视觉语言模型生成关系候选,并将谓词短语分解为软证据基础。关系条件下的证据编码器提取相关线索,而反事实验证器测试在移除证据时关系得分的稳定性。实验表明,在召回度量、未见谓词泛化和反事实定位质量方面有所改进,表明生成的场景图更可靠且基于证据。
MemeScouts@LT-EDI 2026: Asking the Right Questions -- Prompted Weak Supervision for Meme Hate Speech Detection
Authors: Ivo Bueno, Lea Hirlimann, Enkelejda Kasneci
First: 2026-04-27T08:36:23+00:00 · Latest: 2026-04-27T08:36:23+00:00
Comments: Accepted at Sixth Workshop on Language Technology for Equality, Diversity and Inclusion at ACL2026 (LT-EDI@ACL26)
Abstract
Detecting hate speech in memes is challenging due to their multimodal nature and subtle, culturally grounded cues such as sarcasm and context. While recent vision-language models (VLMs) enable joint reasoning over text and images, end-to-end prompting can be brittle, as a single prediction must resolve target, stance, implicitness, and irony. These challenges are amplified in multilingual settings. We propose a prompted weak supervision (PWS) approach that decomposes meme understanding into targeted, question-based labeling functions with constrained answer options for homophobia and transphobia detection in the LT-EDI 2026 shared task. Using a quantized Qwen3-VLM to extract features by answering targeted questions, our method outperforms direct VLM classification, with substantial gains for Chinese and Hindi, ranking 1st in English, 2nd in Chinese, and 3rd in Hindi. Iterative refinement via error-driven LF expansion and feature pruning reduces redundancy and improves generalization. Our results highlight the effectiveness of prompted weak supervision for multilingual multimodal hate speech detection.
中文标题/摘要
标题:MemeScouts@LT-EDI 2026: 提出正确的问题——针对表情包仇恨言论检测的提示式弱监督
由于表情包的多模态性质和隐含、文化基础的暗示,如讽刺和语境,检测其中的仇恨言论具有挑战性。尽管最近的视觉-语言模型(VLMs)能够联合处理文本和图像,但端到端的提示可能会很脆弱,因为单个预测必须解决目标、立场、隐含性和讽刺。这些挑战在多语言环境中被放大。我们提出了一种提示式弱监督(PWS)方法,将表情包理解分解为针对目标、基于问题的标签函数,并为LT-EDI 2026共享任务中的同性恋和跨性别仇恨言论检测提供有限的答案选项。使用量化后的Qwen3-VLM通过回答目标问题提取特征,我们的方法优于直接的VLM分类,特别是在中文和印地语方面取得了显著的提升,分别在英语中排名首位、中文中排名第二、印地语中排名第三。通过错误驱动的LF扩展和特征剪枝进行迭代优化,减少了冗余并提高了泛化能力。我们的结果突显了提示式弱监督在多语言多模态仇恨言论检测中的有效性。
Summary / 总结
The paper addresses the challenge of detecting hate speech in memes, which are multimodal and contain subtle cultural cues. It proposes a prompted weak supervision (PWS) method that decomposes meme understanding into targeted labeling functions with constrained answers for homophobia and transphobia detection. The method uses a quantized Qwen3-VLM to answer targeted questions and outperforms direct VLM classification, especially for Chinese and Hindi, ranking first in English, second in Chinese, and third in Hindi. Iterative refinement through error-driven labeling function expansion and feature pruning improves the method's performance.
论文针对 meme 中的仇恨言论检测难题,这些 meme 具有多模态特征且包含微妙的文化暗示。提出了一种提示弱监督 (PWS) 方法,将 meme 理解分解为针对同性恋和跨性别歧视检测的目标化标签函数,并使用量化 Qwen3-VLM 回答目标化问题。该方法在直接 VLM 分类上表现出色,特别是在中文和印地语中表现突出,分别在英语中排名第一、中文中排名第二、印地语中排名第三。通过错误驱动的标签函数扩展和特征剪枝进行迭代优化,提高了方法的性能。
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Authors: Danae Sánchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott
First: 2026-04-16T11:28:53+00:00 · Latest: 2026-04-27T08:17:45+00:00
Abstract
Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.
中文标题/摘要
标题:视觉语言模型中推理动态及其依赖推理模态限制的研究
近期视觉语言模型(VLMs)的发展提供了推理能力,但视觉和文本信息如何整合和展开仍不清楚。我们分析了18个VLMs中的推理动态,涵盖来自两个不同模型家族的指令调优和推理训练模型。我们跟踪了Chain-of-Thought(CoT)的信心,测量推理的纠正效果,并评估中间推理步骤的贡献。我们发现,模型容易出现答案惯性,即早期对预测的承诺在推理步骤中被强化而不是被修正。虽然推理训练模型表现出更强的纠正行为,但这些收益依赖于模态条件,从文本主导到仅视觉设置。通过使用误导性文本提示的受控干预,我们展示了即使在视觉证据充足的情况下,模型也始终受这些提示的影响,并评估这种影响是否可以从CoT中恢复。尽管这种影响可以在CoT中出现,但其可检测性在不同模型之间有所不同,且取决于所监控的内容。推理训练模型更有可能明确引用这些提示,但它们较长且流畅的CoT仍然可能看似视觉基础,实际上遵循文本提示,从而模糊了模态依赖性。相比之下,指令调优模型较少明确引用这些提示,但它们较短的轨迹揭示了与视觉输入的一致性问题。总体而言,这些发现表明CoT仅提供了不同模态如何驱动VLM决策的片面视角,这对多模态系统的透明性和安全性具有重要影响。
Summary / 总结
The study investigates the reasoning dynamics in 18 vision-language models, focusing on how these models integrate visual and textual information. By tracking confidence over Chain-of-Thought (CoT) and evaluating the corrective effect of reasoning, the research finds that models tend to stick to early predictions rather than revise them. Reasoning-trained models show stronger corrective behavior, but their performance varies depending on the modality conditions. The study also demonstrates that models can be influenced by misleading textual cues, even when visual evidence is sufficient, and that this influence is not always detectable in the CoT, highlighting the limitations of monitoring modality reliance in VLMs.
研究分析了18个视觉语言模型的推理动态,关注它们如何整合视觉和文本信息。通过跟踪推理过程中的置信度和评估推理的矫正效果,研究发现模型往往表现出答案惯性,即强化早期预测而不是修正它们。推理训练的模型表现出更强的矫正行为,但其效果依赖于模态条件。研究还揭示,即使有充足的视觉证据,模型也会受到误导性文本提示的影响,而这种影响在推理过程中的记录中并不总是可检测的,这突显了监测不同模态依赖性的局限性。
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Authors: Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang
First: 2024-06-14T17:14:22+00:00 · Latest: 2026-04-27T07:59:24+00:00
Abstract
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work have been released at https://github.com/ydk122024/Med-HallMark.
中文标题/摘要
标题:在大型视觉语言模型中检测和评估医疗幻觉
大型视觉语言模型(LVLMs)在医疗应用中越来越重要,包括医学视觉问答和影像报告生成。尽管这些模型继承了基础大型语言模型(LLMs)的强大功能,但它们也继承了幻觉的易感性——在高风险医疗环境中,这种问题尤为令人担忧,因为容错率极低。然而,目前在医疗领域中尚无专门用于幻觉检测和评估的方法或基准。为解决这一问题,我们引入了Med-HallMark,这是首个专门针对医疗多模态领域幻觉检测和评估的基准。该基准提供了多任务幻觉支持、多维度幻觉数据和分层幻觉分类。此外,我们提出了MediHall评分,这是一种新的医疗评估指标,通过分层评分系统考虑幻觉的严重性和类型,从而实现对潜在临床影响的精细评估。我们还介绍了MediHallDetector,这是一种新型的医疗LVLM,专门用于精确幻觉检测,采用多任务训练进行幻觉检测。通过广泛的实验评估,我们使用基准为流行的LVLMs建立了基线。研究结果表明,MediHall评分比传统指标提供了更细致的幻觉影响理解,并展示了MediHallDetector的增强性能。我们希望这项工作能显著提高LVLMs在医疗应用中的可靠性。所有与此工作相关资源已发布在https://github.com/ydk122024/Med-HallMark。
Summary / 总结
The research aims to address the issue of hallucinations in large vision language models (LVLMs) used in medical applications. The study introduces Med-HallMark, a benchmark for hallucination detection and evaluation in the medical domain, and proposes the MediHall Score, a new metric to assess the severity and type of hallucinations. Additionally, a novel detector called MediHallDetector is developed for precise hallucination detection. Experimental results show that MediHall Score offers a more detailed understanding of hallucination impacts and that MediHallDetector performs better than existing methods. This work aims to enhance the reliability of LVLMs in medical contexts.
研究旨在解决大型视觉语言模型(LVLM)在医疗应用中出现幻觉的问题。研究引入了Med-HallMark,这是一个针对医疗多模态领域的幻觉检测和评估基准,并提出了MediHall Score,这是一种新的评估幻觉严重性和类型的指标。此外,还开发了一种名为MediHallDetector的新检测器,用于精确检测幻觉。实验结果表明,MediHall Score提供了对幻觉影响的更详细理解,而MediHallDetector的表现优于现有方法。这项工作旨在提高LVLM在医疗应用中的可靠性。