arXiv 论文速递

2026-04-26 04:04
Snapshot: 20260426_0404
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
Authors: Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord
First: 2026-04-23T17:54:36+00:00 · Latest: 2026-04-23T17:54:36+00:00
Abstract
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .
中文标题/摘要
标题:当提示超越视觉:LVLM中的提示诱导幻觉
尽管大型视觉-语言模型(LVLM)的能力取得了令人印象深刻的进展,但这些系统仍然容易出现幻觉,即与视觉输入无关的输出。先前的研究将LVLM中的幻觉归因于视觉骨干的局限性或语言组件的主导地位,但这些因素的重要性尚不明确。为了解决这一模糊性,我们提出了HalluScope,一个基准测试,以更好地理解不同因素导致幻觉的程度。我们的分析表明,幻觉主要源自对文本先验和背景知识的过度依赖,尤其是通过文本指令引入的信息。为了减轻由文本指令先验引起的幻觉,我们提出了HalluVL-DPO框架,这是一种针对现成LVLM进行微调的方法,使其产生更符合视觉输入的响应。HalluVL-DPO利用我们精心构建的训练数据集中的偏好优化,引导模型更倾向于真实的响应而非幻觉。我们证明,优化后的模型有效地缓解了目标幻觉失败模式,同时在其他幻觉基准测试和视觉能力评估中保持或提高了性能。为了支持可重复性和进一步的研究,我们将公开发布我们的评估基准、偏好训练数据集和代码,网址为https://pegah-kh.github.io/projects/prompts-override-vision/。
Summary / 总结
This study investigates the causes of hallucinations in large vision-language models (LVLMs) and proposes HalluScope, a benchmark to understand the extent to which different factors induce hallucinations. The research finds that hallucinations mainly result from over-reliance on textual priors and background knowledge, especially when prompted by textual instructions. To address this, the authors introduce HalluVL-DPO, a fine-tuning framework that uses a curated dataset to guide the model towards more visually grounded responses, effectively reducing hallucinations while maintaining or improving other visual capabilities.
研究探讨了大型视觉语言模型(LVLM)中幻觉的原因,并提出了HalluScope基准,以了解不同因素引发幻觉的程度。研究发现,幻觉主要源于对文本先验知识和背景知识的过度依赖,尤其是在文本指令的提示下。为解决这一问题,作者引入了HalluVL-DPO框架,该框架利用一个精心构建的数据集来引导模型生成更符合视觉输入的响应,从而有效减少了幻觉现象,同时保持或提升了其他视觉能力的性能。
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Authors: Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou
First: 2025-06-26T17:59:12+00:00 · Latest: 2026-04-23T17:42:55+00:00
Comments: Project webpage: https://plan-lab.github.io/hallusegbench/
Abstract
Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi
First: 2025-06-24T12:45:09+00:00 · Latest: 2026-04-23T17:05:26+00:00
Abstract
Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely or reward fluent but factually incorrect captions. These findings highlight the limitations of deploying foundation models for embodied agents and the need for more robust architectures and evaluation protocols in physical robotic applications.
中文标题/摘要
标题:真假难辨,机器人能分辨吗?评估单视角机器人场景理解中VLM的域移鲁棒性
机器人场景理解越来越多地依赖于视觉-语言模型(VLMs)来生成对环境的自然语言描述。在本研究中,我们系统地评估了由机器人操作器捕获的桌面场景的单视角物体描述,引入了一种受控的物理域移,将真实世界的工具与几何相似但纹理、颜色和材料不同的3D打印对应物进行对比。我们使用多种指标对一系列最先进的、可本地部署的VLM进行基准测试,以评估语义对齐和事实基础。我们的结果表明,尽管VLMs能够有效地描述常见的现实世界物体,但在3D打印物品上表现却显著下降,尽管它们的结构形式相似。我们进一步揭示了标准评估指标的关键漏洞,表明一些指标无法检测到域移,或者奖励流畅但事实错误的描述。这些发现突显了在物理机器人应用中部署基础模型的局限性,并强调了需要更鲁棒的架构和评估协议。
Summary / 总结
This study evaluates the robustness of Vision-Language Models (VLMs) in single-view object captioning for robotic scene understanding, introducing a controlled physical domain shift between real-world tools and their 3D-printed counterparts. The research benchmarks several state-of-the-art VLMs using multiple metrics to assess semantic alignment and factual grounding. Key findings show that while VLMs can describe common real-world objects well, their performance significantly drops on 3D-printed items, indicating critical vulnerabilities in standard evaluation metrics that fail to detect domain shifts or reward incorrect captions. This highlights the need for more robust architectures and evaluation protocols in physical robotic applications.
研究评估了视觉-语言模型(VLMs)在单视角物体描述中的鲁棒性,通过引入3D打印对象的物理域移位实验,这些对象在几何上相似但材质、颜色和纹理不同,与真实世界工具形成对比。研究发现,虽然VLMs在描述常见真实世界物体方面表现良好,但在处理3D打印物品时性能显著下降,尽管它们的结构相似。研究还揭示了标准评估指标的关键缺陷,有时无法检测到域移位或奖励错误的描述。这些发现突显了当前VLMs在物理机器人应用中的局限性,并强调了需要更 robust的评估协议的必要性。
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
Authors: Katharina Prasse, Steffen Jung, Isaac Bravo, Stefanie Walter, Patrick Knab, Christian Bartelt, Margret Keuper
First: 2026-04-23T15:44:14+00:00 · Latest: 2026-04-23T15:44:14+00:00
Abstract
Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.
中文标题/摘要
标题:从代码本到VLM:评估自动化视觉话语分析在社交媒体上的气候变迁研究
社交媒体平台已成为气候沟通的主要场所,生成了数百万张图像和帖子,如果系统地分析这些内容,可以揭示哪些沟通策略能激发公众关注,哪些则不然。我们旨在通过分析计算机视觉方法在社交媒体话语分析中的应用来促进此类研究。该分析包括基于应用的分类学设计、模型选择、提示工程和验证。我们在X(原Twitter)的两个数据集上对六种可提示的视觉-语言模型和十五种零样本CLIP-like模型进行了基准测试——一个由1,038张由专家标注的图像集和一个包含超过120万张图像的更大语料库,其中5万个标签由人工验证,涵盖了五个标注维度:动物内容、气候变迁后果、气候行动、图像场景和图像类型。在基准测试的模型中,Gemini-3.1-flash-lite在所有超类别和两个数据集上均表现出色,而与中等规模的开放权重模型之间的差距相对较小。除了实例级指标外,我们提倡分布评估:VLM预测即使在单张图像准确性较低时也能可靠地恢复总体趋势,使它们成为大规模话语分析的可行起点。我们发现,链式推理反而降低了性能,而特定标注维度的提示设计则提高了性能。我们将在https://github.com/KathPra/Codebooks2VLMs.git发布推特ID和标签以及我们的代码。
Summary / 总结
The study aims to evaluate the use of vision-language models for analyzing climate change discourse on social media. It benchmarks six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter), finding that Gemini-3.1-flash-lite outperforms others across all categories. The research also highlights the importance of distributional evaluation and specific prompt design for better performance. Beyond instance-level metrics, VLM predictions can reliably capture population-level trends, making them suitable for large-scale discourse analysis. Chain-of-thought reasoning was found to be detrimental to performance in this context.
研究评估了视觉语言模型在社交媒体上分析气候变化话语的应用,重点关注计算机视觉方法及其在分类体系设计、模型选择和提示工程中的应用。研究在X(以前的Twitter)的两个数据集上对六种可提示的视觉语言模型和十五种零样本CLIP模型进行了基准测试,结果显示Gemini-3.1-flash-lite在两个数据集中均表现出色。研究还强调了分布性评估和特定提示设计对于提高性能的重要性。
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
Authors: Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Morten Rieger Hannemose
First: 2026-02-21T10:53:50+00:00 · Latest: 2026-04-23T15:15:48+00:00
Comments: Accepted by CVPR2026
Abstract
Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, performs inference over 30x faster than the baseline and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.
中文标题/摘要
标题:MaskDiME:自适应掩码扩散以实现精确高效的视觉反事实解释
视觉反事实解释旨在揭示能够改变模型预测的最小语义修改,为深度神经网络提供因果和可解释的洞察。然而,现有的基于扩散的反事实生成方法通常计算成本高、采样速度慢且在局部修改区域定位方面不够精确。为了解决这些限制,我们提出了一种名为MaskDiME的简单、快速且有效的扩散框架,通过局部采样统一语义一致性和空间精度。我们的方法适应性地关注决策相关区域,以实现局部和语义一致的反事实生成,同时保持高图像保真度。我们的无需训练框架MaskDiME在基准测试中的推理速度比基线快30倍,并在五个涵盖不同视觉领域的基准数据集上实现了可比或最先进的性能,为高效的反事实解释提供了一种实用且可泛化的解决方案。
Summary / 总结
MaskDiME is designed to generate precise and efficient visual counterfactual explanations by addressing the computational inefficiency and imprecision of existing methods. It uses a localized sampling approach to focus on decision-relevant regions, achieving both spatial precision and semantic consistency. MaskDiME is trained-free and significantly faster than baselines, demonstrating comparable or state-of-the-art performance across various visual domains.
MaskDiME旨在通过解决现有方法的计算效率低下和定位不精确问题,生成精确且高效的视觉反事实解释。它采用局部采样方法,专注于决策相关区域,实现空间精度和语义一致性。MaskDiME无需训练,比基线快30倍,并在多种视觉领域基准数据集上表现出可比或最先进的性能,提供了一种实用且通用的反事实解释解决方案。
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Authors: Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He
Venue: CVPR 2026
First: 2026-04-23T14:33:27+00:00 · Latest: 2026-04-23T14:33:27+00:00
Comments: Accepted by CVPR 2026 (Findings Track)
Abstract
Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .
Summary / 总结
Ramen is a framework for robust test-time adaptation of vision-language models with active sample selection. It retrieves a customized batch of relevant samples based on domain consistency and prediction balance to mitigate adaptation bias. Experiments show that Ramen performs well across various image corruption and domain-shift benchmarks, offering strong and consistent performance in mixed-domain scenarios.
Ramen 是一个用于提升视觉-语言模型在混合域设置下鲁棒测试时适应性的框架。它通过基于域一致性和预测平衡的主动样本选择来适应模型。Ramen 使用嵌入-梯度缓存来高效检索和更新相关样本,无需额外的前向或反向传播。实验表明,Ramen 在多种图像腐蚀和域偏移基准测试中表现出强大的一致性能,使其适用于复杂的混合域场景。
Causal Disentanglement for Full-Reference Image Quality Assessment
Authors: Zhen Zhang, Jielei Chu, Tian Zhang, Weide Liu, Fengmao Lv, Tianrui Li, Jun Cheng, Yuming Fang
First: 2026-04-23T13:18:13+00:00 · Latest: 2026-04-23T13:18:13+00:00
Abstract
Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.
Summary / 总结
This paper proposes a novel full-reference image quality assessment (FR-IQA) method based on causal inference and decoupled representation learning. Unlike traditional feature comparison-based models, the approach formulates degradation estimation as a causal disentanglement process. It decouples degradation and content representations and uses a masking module to model the causal relationship between content and degradation features. The method predicts quality scores from these features using supervised regression or label-free dimensionality reduction. Experiments show that the proposed method performs competitively across various IQA benchmarks and demonstrates superior cross-domain generalization on diverse image domains with scarce data.
本文提出了一种基于因果推理和解耦表示学习的全参考图像质量评估(FR-IQA)方法。不同于传统的基于特征对比的方法,该方法将退化估计视为因果分离过程。它解耦了退化和内容表示,并使用遮罩模块来建模内容和退化特征之间的因果关系。该方法使用监督回归或无标签降维来预测这些特征的质量评分。实验表明,所提出的方法在各种IQA基准上表现竞争力,并在具有稀缺数据的多种图像域中表现出优越的跨域泛化能力。
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Authors: Zhenyu Ning, Guangda Liu, Qihao Jin, Chengwei Li, Wenchao Ding, Minyi Guo, Jieru Zhao
Venue: 63rd ACM/IEEE Design Automation Conference (DAC '26), July 2026
First: 2025-05-21T08:47:15+00:00 · Latest: 2026-04-23T12:54:38+00:00
Comments: Accepted by DAC'26
Abstract
Recent developments in Video Large Language Models (Video LLMs) have enabled models to process hour-long videos and exhibit exceptional performance. Nonetheless, the Key-Value (KV) cache expands linearly over time, leading to substantial memory overhead and response delay--critical challenges in various real-world online applications, such as Deepseek services, autonomous driving and robotics. To mitigate these issues, we propose $\textbf{LiveVLM}$, a training-free and query-agnostic framework specifically designed for online video understanding and real-time interaction. LiveVLM employs a Vision Sink Bucketing (VSB) mechanism to process video streams in real time, retain long-term video details and eliminate redundant KVs. This mechanism utilizes vision-to-vision attention scores as the metric and seeks to maximize the coverage of contextual information during compression. Noting that KV cache compressed in a query-agnostic manner inevitably retains irrelevant information for specific queries, LiveVLM incorporates a Position-agnostic KV Retrieval (PaR) mechanism to reduce interference from redundant context. The keypoint of PaR lies in decoupling positional embeddings to enhance the similarity between key tensors, thereby supporting efficient retrieval at the granularity of pages. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to achieve state-of-the-art accuracy among both training-free query-agnostic methods and training-based online models.
Summary / 总结
LiveVLM is a framework designed for efficient online video understanding by processing video streams in real time and retaining long-term video details. It uses a Vision Sink Bucketing (VSB) mechanism to compress Key-Value (KV) cache and a Position-agnostic KV Retrieval (PaR) mechanism to reduce irrelevant information. Experiments show that LiveVLM enables the LLaVA-OneVision model to achieve state-of-the-art accuracy in both training-free and training-based online models.
LiveVLM 是一个用于高效在线视频理解的框架,通过实时处理视频流并保留长期视频细节。它使用 Vision Sink Bucketing (VSB) 机制压缩 Key-Value (KV) 缓存,并使用 Position-agnostic KV Retrieval (PaR) 机制减少无关信息。实验表明,LiveVLM 使 LLaVA-OneVision 模型在训练-free 和训练-based 在线模型中均达到了最先进的准确性。
Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models
Authors: Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu
First: 2026-04-17T10:17:16+00:00 · Latest: 2026-04-23T12:41:25+00:00
Comments: 30 pages, 5 figures
Abstract
While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.
中文标题/摘要
标题:流形上的推理:双向一致性在扩散语言模型中的自我验证
虽然扩散大型语言模型(dLLMs)在全局规划方面具有结构优势,但高效验证它们是否通过有效的推理轨迹得出正确答案仍然是一个关键挑战。在本文中,我们提出了一种几何视角:流形上的推理。我们假设有效的生成轨迹作为学习分布的高密度流形上的稳定吸引子存在,而不有效的路径则表现出流形外的漂移。为了实现这一点,我们引入了双向流形一致性(BMC),这是一种无需训练、无监督的度量标准,通过前向掩蔽和后向重构循环来量化生成序列的稳定性。实证上,我们展示了BMC在推理生命周期的全过程中具有灵活性:(1)在诊断中,它作为稳健的解决方案有效性鉴别器,无需参考答案;(2)在推理中,它使拒绝采样得以有效集中计算资源于复杂推理任务;(3)在对齐中,它作为密集的几何奖励,将稀疏的结果监督转化为精细的指导,使模型能够超越标准基线自我进化。我们的结果确立了内在的几何稳定性作为dLLMs正确性的稳健指标。
Summary / 总结
This work paper paper the geometric perspective of valid and invalid generation trajectories in Diffusion Language Models (dLLMs) introduces Bidirectional Manifold Consistency (BMC) as an unsupervised metric to assess the stability of generated paths on the on-manifold.. BMC is on shown to be effectively diagnose validity on ground ground ground truth tasks and on on on on alignment on on a dense geometric reward that on on on on empowering model on on on on-eviation on on on on baselines.
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Authors: Hao-Yuan Chen
First: 2026-04-23T12:36:12+00:00 · Latest: 2026-04-23T12:36:12+00:00
Abstract
Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.
Summary / 总结
The paper introduces Verbal Process Supervision (VPS), a framework that uses structured natural-language critique to guide LLMs in an iterative generate-critique-refine loop. Across various benchmarks, VPS improves reasoning performance. On GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reached 94.9% at R=4, surpassing the state of the art. On AIME 2025, VPS boosted scores from 11.7-26.7% to 63.3-90.0%, and at matched compute, VPS outperformed Reflexion and Self-Consistency@5 by +8.5 to +12.1 points and +5.0 to +8.3 points, respectively. Performance scales with the supervisor-actor capability gap and degrades when errors are not linguistically expressible.
论文引入了Verbal Process Supervision (VPS)框架,通过结构化的自然语言批评来引导LLM进行迭代生成-批评-改进循环。在各种基准测试中,VPS提高了推理性能。在GPQA Diamond上,GPT-5.4 (High) | GPT-5.4 (Low)在R=4时达到了94.9%,超越了现有技术。在AIME 2025上,VPS将分数从11.7-26.7%提升到63.3-90.0%,在匹配计算资源的情况下,VPS分别比Reflexion和Self-Consistency@5高出+8.5到+12.1分和+5.0到+8.3分。性能与监督者-执行者能力差距成正比,并在错误无法用语言表达时下降。
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Authors: Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Bo Zhang, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, Kinhei Lee, Z henxuan Zhang, Xiaobing Li, Maosong Sun
Venue: ACL 2026
First: 2025-11-24T06:40:38+00:00 · Latest: 2026-04-23T11:52:26+00:00
Comments: Accepted to ACL 2026 Main Conference
Abstract
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
Summary / 总结
The research aims to evaluate Large Language Models and Vision-Language Models in understanding complete musical scores, which involve complex reasoning over pitch, rhythm, harmony, and structure. The Musical Score Understanding Benchmark (MSU-Bench) was introduced, containing 1,800 generative question-answer pairs from famous composers. Evaluations of over fifteen state-of-the-art models showed significant modality gaps and unstable performance across different levels of difficulty, with fine-tuning improving results while preserving general knowledge. This benchmark is expected to serve as a robust foundation for future multimodal reasoning research.
研究旨在评估大型语言模型和视觉-语言模型在理解完整音乐谱方面的能力,这需要对音高、节奏、和声和结构进行复杂的推理。引入了音乐谱理解基准(MSU-Bench),包含来自著名作曲家的1,800个生成问题-答案对。超过十五个最先进的模型的评估显示了显著的模态差距和不同难度级别的不稳定性能,通过微调可以改善结果并保留一般知识。该基准预计将成为未来多模态推理研究的坚实基础。
Component-Based Out-of-Distribution Detection
Authors: Wenrui Liu, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
First: 2026-04-23T11:19:39+00:00 · Latest: 2026-04-23T11:19:39+00:00
Abstract
Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.
Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
Authors: Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra
First: 2026-04-23T10:36:50+00:00 · Latest: 2026-04-23T10:36:50+00:00
Abstract
Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.
PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation
Authors: Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim
Venue: ACL 2026
First: 2025-08-29T15:36:06+00:00 · Latest: 2026-04-23T09:35:10+00:00
Comments: ACL 2026
Abstract
Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.
Instance-level Visual Active Tracking with Occlusion-Aware Planning
Authors: Haowei Sun, Kai Zhou, Hao Gao, Shiteng Zhang, Jinwu Hu, Xutao Wen, Qixiang Ye, Mingkui Tan
Venue: CVPR 2026 Poster
First: 2026-04-23T09:11:50+00:00 · Latest: 2026-04-23T09:11:50+00:00
Comments: CVPR 2026 Poster
Abstract
Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.
中文标题/摘要
标题:基于实例的视觉主动跟踪与遮挡感知规划
视觉主动跟踪(VAT)旨在控制相机在三维空间内跟随目标,对于无人机导航和安全监控等应用至关重要。然而,实际部署中面临两个关键瓶颈:由于实例级区分不足导致的视觉相似干扰物混淆,以及由于缺乏主动规划而导致的严重遮挡失效。为了解决这些问题,我们提出了OA-VAT,这是一种统一的管道,包含三个互补模块。首先,一种无需训练的实例感知离线原型初始化模块通过DINOv3聚合多视角增强特征,构建区分性实例原型,减轻干扰物混淆。其次,一种在线原型增强跟踪器在线增强原型,并结合一种基于置信度的卡尔曼滤波器,以应对外观和运动变化下的稳定跟踪。第三,一种遮挡感知轨迹规划器,基于我们新构建的Planning-20k数据集进行训练,使用条件扩散生成避障路径,以恢复遮挡。实验表明,OA-VAT在UnrealCV上实现了0.93的平均SR(比SOTA TrackVLA高2.2%),在真实世界数据集上实现了90.8%的平均CAR(比SOTA GC-VAT高12.1%),在DJI Tello无人机上实现了81.6%的TSR。在RTX 3090上运行速度为35 FPS,实现了稳健的实时性能,适用于实际部署。
PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding
Authors: Junjie Wen, Junlin He, Fei Ma, Jinqiang Cui
First: 2026-04-17T07:24:14+00:00 · Latest: 2026-04-23T09:00:00+00:00
Comments: Accepted by ICCA 2026
Abstract
Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.
中文标题/摘要
标题:PLAF:像素级语言对齐特征提取以实现高效的3D场景理解
准确的开放词汇3D场景理解需要同时在像素级别上具有语义对齐和空间精确性的语义表示,同时在提升到3D空间时保持可扩展性。然而,现有的表示方法难以同时满足这些要求,而密集传播像素级语义到3D通常会导致大量冗余,导致在大规模场景中存储和查询效率低下。为了解决这些挑战,我们提出了\emph{PLAF},一种像素级语言对齐特征提取框架,能够在2D中实现密集且准确的语义对齐,而不牺牲开放词汇的表达能力。在此表示基础上,我们进一步设计了一种高效的语义存储和查询方案,显著减少了2D和3D域中的冗余。实验结果表明,\emph{PLAF}为准确高效的开放词汇3D场景理解提供了强大的语义基础。代码已公开发布在https://github.com/RockWenJJ/PLAF。
VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
Authors: Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun, YoungBin Kim
First: 2026-04-23T08:04:07+00:00 · Latest: 2026-04-23T08:04:07+00:00
Comments: Accepted to LREC 2026
Abstract
The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.
中文标题/摘要
标题:VG-CoT:通过基于视觉的链式思考迈向可信赖的视觉推理
大型视觉-语言模型(LVLMs)的进步需要精确的局部区域推理,使模型的逻辑与实际视觉证据紧密对接。然而,现有数据集由于大量手动标注和缺乏多步推理与相应图像区域的显式对齐而面临可扩展性限制,这限制了对模型可信度的评估。为解决这些挑战,我们提出了视觉对接链式思考(VG-CoT)数据集,通过全自动三阶段管道将每个推理步骤明确链接到图像中的实际视觉证据。该管道首先使用最先进的检测和OCR模型提取对象和文本级别的视觉证据,然后使用GPT-4o生成逐步对接推理,最后通过基于推理的开放集检测过程进行对接细化。此外,我们引入了一个新的基准,从三个互补维度全面评估LVLMs的推理能力:推理质量、答案准确性以及推理-答案对齐。实验表明,包括LLaVA-1.5和Qwen2-VL在内的代表性LVLMs在大多数评估指标上都取得了持续改进,证实VG-CoT有效提升了基于证据的可信推理能力,同时保持了可扩展和成本效益的数据集构建。数据集和代码将在接受后公开发布,以促进进一步研究。
Summary / 总结
VG-CoT is a dataset designed to improve the trustworthiness of visual reasoning by linking each reasoning step to actual visual evidence in images. It uses a three-stage pipeline to extract visual evidence, generate grounded reasoning, and refine it. The dataset evaluates models on three dimensions: rationale quality, answer accuracy, and reasoning-answer alignment. Experiments show that VG-CoT enhances evidence-based reasoning and model trustworthiness while maintaining scalability and cost efficiency.
VG-CoT 是一个旨在通过将每个推理步骤与图像中的实际视觉证据联系起来来提高视觉推理可信度的数据集。它使用三阶段管道来提取视觉证据、生成基于视觉的推理并对其进行细化。该数据集从三个维度评估模型:推理质量、答案准确性以及推理与答案的一致性。实验表明,VG-CoT 提高了基于证据的推理和模型可信度,同时保持了可扩展性和成本效率。
Prototype-Based Test-Time Adaptation of Vision-Language Models
Authors: Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji
First: 2026-04-23T07:20:56+00:00 · Latest: 2026-04-23T07:20:56+00:00
Abstract
Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
Authors: Mohit Vaishnav, Tanel Tammet
Venue: 30th Conference on Computational Natural Language Learning (CoNLL), 2026
First: 2026-04-23T07:03:48+00:00 · Latest: 2026-04-23T07:03:48+00:00
Abstract
Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
Summary / 总结
The study investigates the limitations of vision-language models (VLMs) in abstract visual reasoning by comparing their performance on raw images with large language models (LLMs) given symbolic inputs derived from those images on the Bongard-LOGO benchmark. The research finds that LLMs achieve significant improvements when provided with symbolic inputs, reaching high accuracy levels, while VLMs remain near chance. This suggests that representation is a critical bottleneck in abstract visual reasoning, and symbolic inputs can serve as a useful diagnostic tool.
研究通过将视觉-语言模型(VLMs)与大型语言模型(LLMs)在Bongard-LOGO基准上的表现进行比较,来探讨其在抽象视觉推理方面的局限性。研究发现,当LLMs获得从图像中派生的符号输入时,它们可以取得显著的改进,达到较高的准确率,而VLMs则保持在随机水平附近。这表明表示是抽象视觉推理中的关键瓶颈,而符号输入可以作为有用的诊断性上限工具。
Semantic-Fast-SAM: Efficient Semantic Segmenter
Authors: Byunghyun Kim
First: 2026-04-22T04:18:39+00:00 · Latest: 2026-04-23T05:32:11+00:00
Comments: APSIPA ASC 2025
Abstract
We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic-Fast-SAM.
中文标题/摘要
标题:Semantic-Fast-SAM:高效语义分割器
我们提出了Semantic-Fast-SAM (SFS),这是一种结合了Fast Segment Anything模型和语义标注管道的语义分割框架,能够在不牺牲准确性的前提下实现实时性能。FastSAM是Segment Anything Model (SAM) 的高效CNN重实现版本,运行速度远超原始的基于变换器的SAM。基于FastSAM快速生成掩码的特点,我们整合了语义分割一切 (SSA) 标注策略,为每个掩码分配有意义的类别。最终,SFS模型以原SAM方法极小的计算成本和内存占用生成高质量的语义分割图。在Cityscapes和ADE20K基准测试中,SFS的mIoU分别达到70.33和48.01,同时在封闭集设置中比SSA快约20倍的推理速度。我们还展示了SFS在开放词汇分割中的有效应用,通过利用CLIP基语义头超越了最近的开放词汇模型在广泛类别标注中的表现。这项工作使实用的实时语义分割成为可能,扩展了基础分割模型在机器人场景中的应用范围。实现代码可在https://github.com/KBH00/Semantic-Fast-SAM/ 获取。
Summary / 总结
Semantic-Fast-SAM (SFS) combines FastSAM with a semantic labeling pipeline to achieve real-time semantic segmentation with high accuracy. SFS generates high-quality segmentation maps at a fraction of the computational cost compared to the original SAM-based approach. Experiments show that SFS matches the accuracy of prior SAM-based methods while being approximately 20x faster in the closed-set setting and outperforming recent open-vocabulary models in open-vocabulary segmentation tasks.
Semantic-Fast-SAM (SFS) 结合 FastSAM 和语义标注管道,实现了实时高精度语义分割,匹配 Cityscapes 和 ADE20K 基准上的先前 SAM 基方法,同时比封闭集设置下的 SSA 快约 20 倍。SFS 使用基于 CLIP 的语义头进行开放词汇分割,优于近期模型在广泛类别标注上的表现。这项工作使实时语义分割在机器人场景中成为可能。
BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion
Authors: Tianzhi Jia, Kaixing Yang, Xiaole Yang, Xulong Tang, Ke Qiu, Shikui Wei, Yao Zhao
First: 2026-04-06T03:49:36+00:00 · Latest: 2026-04-23T03:50:19+00:00
Comments: 15 pages, 7 figures
Abstract
3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.
中文标题/摘要
标题:BiTDiff:通过BiMamba-Transformer扩散生成细粒度3D指挥动作
3D指挥动作生成旨在从音乐中合成细粒度的指挥动作,具有广泛的应用潜力,包括音乐教育、虚拟表演、数字人类动画和人机共创。然而,由于两个主要挑战:(1)缺乏大规模的细粒度3D指挥动作数据集,(2)缺乏能够同时支持高质量和高效长序列生成的有效方法,这一任务仍处于未被充分探索的状态。为了解决数据限制,我们开发了一种质量导向的3D指挥动作采集流水线,并构建了CM-Data,这是一个包含约10小时指挥动作数据的细粒度SMPL-X数据集。据我们所知,CM-Data是第一个也是最大的公开3D指挥动作生成数据集。为了解决方法论限制,我们提出了BiTDiff,一种基于BiMamba-Transformer混合模型架构的新颖框架,用于高效长序列建模,并采用基于扩散的生成策略与人体运动分解,以实现高质量动作合成。具体而言,BiTDiff引入了辅助物理一致性损失和手/身体特定的前向运动设计,以更好地进行细粒度动作建模,同时利用BiMamba进行高效长序列时间建模,并利用Transformer进行跨模态语义对齐。此外,BiTDiff支持无需训练的关节级动作编辑,使下游的人机交互设计成为可能。广泛的定量和定性实验表明,BiTDiff在CM-Data数据集上的3D指挥动作生成性能达到了最先进的水平。代码将在接受后提供。
PAT3D: Physics-Augmented Text-to-3D Scene Generation
Authors: Guying Lin, Kemeng Huang, Michael Liu, Ruihan Gao, Hanke Chen, Lyuhao Chen, Beijia Lu, Taku Komura, Yuan Liu, Jun-Yan Zhu, Minchen Li
First: 2025-11-26T23:23:58+00:00 · Latest: 2026-04-23T03:17:53+00:00
Comments: 19 pages, 12 figures
Abstract
We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data are available at: https://github.com/Simulation-Intelligence/PAT3D.
中文标题/摘要
标题:PAT3D:物理增强的文本到3D场景生成框架
我们介绍了PAT3D,这是第一个将视觉语言模型与基于物理的模拟相结合的物理增强的文本到3D场景生成框架,以生成物理上合理、可模拟且无交叠的3D场景。给定一个文本提示,PAT3D生成3D对象,推断它们的空间关系,并将它们组织成层次场景树,然后将其转换为模拟的初始条件。可微刚体模拟器确保在重力作用下物体之间的现实交互,使场景向静态平衡发展,而不会发生交叠。为了进一步提高场景质量,我们引入了一种在模拟过程中优化的程序,以确保物理稳定性、无交叠,并提高与输入提示的语义一致性。实验表明,PAT3D在物理合理性、语义一致性和视觉质量方面显著优于先前的方法。除了高质量的生成外,PAT3D还独特地为下游任务如场景编辑和机器人操作提供了可模拟的3D场景。代码和数据可在:https://github.com/Simulation-Intelligence/PAT3D/ 获取。
Summary / 总结
PAT3D is a physics-augmented text-to-3D scene generation framework that combines vision-language models with physics-based simulation to create physically plausible and simulation-ready 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then used to initialize a differentiable rigid-body simulator. This simulator ensures realistic interactions and static equilibrium. Additionally, a simulation-in-the-loop optimization procedure is employed to enhance physical stability and semantic consistency. Experiments show that PAT3D outperforms previous methods in physical plausibility, semantic consistency, and visual quality, and it enables simulation-ready 3D scenes for tasks like scene editing and robotic manipulation.
PAT3D 是一种结合了视觉语言模型和物理基础模拟的物理增强文本到3D场景生成框架,用于生成物理上合理且可用于模拟的3D场景。给定一个文本提示,PAT3D 生成3D物体,推断它们的空间关系,并组织成层次化的场景树,然后用于初始化一个可微刚体模拟器。该模拟器确保了真实的物体交互和静态平衡。此外,还引入了一种模拟循环优化过程,以增强物理稳定性和语义一致性。实验表明,PAT3D 在物理合理性、语义一致性和视觉质量方面优于先前的方法,并且能够为场景编辑和机器人操作等任务生成可用于模拟的3D场景。
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
Authors: Jingkun Chen, Ruoshi Xu, Mingqi Gao, Shengda Luo, Jungong Han
First: 2026-04-23T00:01:40+00:00 · Latest: 2026-04-23T00:01:40+00:00
Comments: 10 pages, 3 figures, 5 tables
Abstract
Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.
中文标题/摘要
标题:通过几何奖励信用分配强化点-VLMs的3D理解
点-视觉-语言模型有望赋予具身代理执行空间推理的能力,但它们经常受到几何幻觉的影响,其中预测的3D结构与观察到的2D现实相矛盾。我们发现这种失败的主要原因不是表示瓶颈,而是强化学习中的结构错位,其中稀疏的几何标记被嘈杂的和广播的序列级奖励所淹没。为了解决这种因果稀释,我们提出了几何奖励信用分配框架,该框架将整体监督分解为特定领域的信号,并仅将其路由到其负责的标记跨度。该机制将模糊的反馈转化为精确的梯度更新,并有效地将通用策略优化转变为有针对性的结构对齐。此外,我们通过引入再投影一致性项内化物理约束,该项作为跨模态验证器,惩罚物理上不可能的几何结构。在ShapeNetCore校准基准上验证,我们的方法通过将3D KPA从0.64提升到0.93,将3D边界框交并比提高到0.686,将再投影一致性分数提高到0.852,填补了可靠性的差距。关键的是,这些收益是在保持稳健的2D定位性能的同时实现的,标志着从可能的文本输出向可验证的空间预测迈出了一步。
Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting
Authors: Alexander Weers, Daniel Rueckert, Martin J. Menten
First: 2026-04-22T20:51:17+00:00 · Latest: 2026-04-22T20:51:17+00:00
Abstract
Training vision-language models (VLMs) for medical report generation is often hindered by the scarcity of high-quality annotated data. This work evaluates the use of a weighted loss function to improve data efficiency. Compared to standard cross-entropy loss, which treats all token prediction errors equally, the reweighted loss shifts the focus to semantically salient tokens with outsized clinical importance. In experiments on ophthalmological report generation, we show that this simple method improves efficiency across multiple data scales, achieving similar report quality with up to ten times less training data.
Summary / 总结
This study addresses the challenge of limited annotated data in training vision-language models for medical report generation. It introduces a weighted loss function that prioritizes semantically important tokens, enhancing model efficiency. Experiments on ophthalmological report generation demonstrate that this method can achieve comparable report quality using up to ten times less training data compared to standard cross-entropy loss.
该研究旨在解决在训练用于医疗报告生成的视觉-语言模型时标注数据稀缺的问题。它引入了一种加权损失函数,强调语义上重要的令牌,从而提高模型的效率。实验结果显示,这种方法可以在使用多达十倍少的训练数据的情况下,达到相当的报告质量。
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Authors: Juhong Min, Lazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan
First: 2026-04-22T20:44:24+00:00 · Latest: 2026-04-22T20:44:24+00:00
Abstract
Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.
中文标题/摘要
标题:注视点推理:基于状态和动作的视觉聚焦视知觉模型
视觉语言模型受益于高分辨率图像,但视觉标记数量的增加导致了高昂的计算开销。人类通过注视点解决这一矛盾:粗略的视图指导“看哪里”,而有选择地获取高分辨率证据则细化“思考什么”。我们引入了注视点推理器,这是一种自回归的视觉语言框架,将注视点和推理统一在一个解码轨迹中。从低分辨率视图开始,模型仅在需要时触发注视点,从选定区域检索高分辨率证据,并将其注入相同的解码轨迹。我们通过两阶段训练方法进行训练:冷启动监督以启动注视点行为,随后通过强化学习共同提高证据获取和任务准确性,同时避免简单的“看一切”解决方案。实验表明,该方法学习了有效的注视点策略,并在多个视觉语言基准测试中实现了更强的准确性,即使在视觉标记预算紧张的情况下也是如此。
Summary / 总结
The research aims to reduce the computational cost of vision-language models by introducing foveated reasoning, which mimics human foveation. The method starts with a low-resolution view and selectively acquires high-resolution evidence only when necessary. It is trained using a two-stage pipeline: coldstart supervision to initiate foveation and reinforcement learning to optimize evidence acquisition and task accuracy. The experiments demonstrate that the model effectively learns foveation policies and performs better under limited visual-token budgets across various vision-language benchmarks.
研究旨在通过引入仿人视网膜机制的注视推理来降低视觉语言模型的计算成本。方法从低分辨率视图开始,仅在必要时选择性地获取高分辨率证据。该方法通过两阶段训练管道进行训练:冷启动监督以启动注视行为,以及强化学习以优化证据获取和任务准确性。实验表明,该模型能够有效地学习注视策略,并在各种视觉语言基准测试中在有限的视觉标记预算下表现出更好的准确性。
InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language
Authors: Nicklas Neu, Thomas Ebner, Jasmin Primus, Raphael Zefferer, Bernhard Schenkenfelder, Mathias Brunbauer, Florian Kromp
First: 2026-04-22T20:05:37+00:00 · Latest: 2026-04-22T20:05:37+00:00
Comments: 15 pages, 2 figures
Abstract
The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.
Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning
Authors: Dahun Kim, Ganesh Satish Mallya, Anelia Angelova
First: 2026-04-22T19:23:52+00:00 · Latest: 2026-04-22T19:23:52+00:00
Comments: Accepted to IGARSS 2026
Abstract
Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs' understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.
Summary / 总结
The paper proposes a training-free approach to enhance the performance of Large Multi-modal Models (LMMs) on multi-spectral imagery by integrating multi-spectral data into the inference pipeline of standard RGB-only LMMs. This method adapts non-RGB inputs to the visual space understood by LMMs and injects domain-specific information and Chain-of-Thought reasoning as instructions. The approach shows strong Zero-Shot performance gains on Remote Sensing benchmarks using the Gemini 2.5 model, indicating the potential for leveraging generalist models with specialized sensor inputs.
论文提出了一种无需训练的方法,通过将多光谱数据集成到标准仅RGB的大型多模态模型(LMM)的推理管道中,来提升其在多光谱图像上的性能。该方法通过将非RGB输入适应LMM理解的视觉空间,并注入领域特定信息和链式思考推理作为指令。该方法使用Gemini 2.5模型在遥感基准测试中展示了强大的零样本性能提升,表明可以利用通用模型处理专门的传感器输入,从而受益于丰富的基于专门数据的推理能力。
GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure
Authors: Leslie Gu, Junhwa Hur, Charles Herrmann, Fangneng Zhan, Todd Zickler, Deqing Sun, Hanspeter Pfister
First: 2025-12-25T03:28:28+00:00 · Latest: 2026-04-22T18:38:16+00:00
Abstract
We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.
中文标题/摘要
标题:GeCo:通过运动和结构评估视频生成的几何一致性
我们引入了GeCo,这是一种基于几何的度量标准,用于同时检测静态场景中的几何变形和遮挡不一致的伪影。通过融合残差运动和深度先验,GeCo生成可解释的密集一致性图,揭示这些伪影。我们使用GeCo系统地评估了近期的视频生成模型,发现了常见的失效模式,并进一步将其作为无监督的指导损失,以减少视频生成过程中的变形伪影。
Summary / 总结
GeCo is a geometry-based metric designed to detect geometric deformation and occlusion-inconsistency in static scenes by combining residual motion and depth priors, generating dense consistency maps. It was used to evaluate recent video generation models, identifying common issues, and also served as a guidance loss to minimize deformation artifacts during video generation.
GeCo 是一个基于几何的度量标准,通过结合残余运动和深度先验来检测静态场景中的几何变形和遮挡不一致问题,生成密集的一致性图。它被用来评估最近的视频生成模型,发现了常见问题,并且还作为指导损失来减少视频生成过程中的变形 artifacts。
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
Authors: Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin, Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Shifat E. Arman
Venue: ACL 2026
First: 2026-04-22T18:12:07+00:00 · Latest: 2026-04-22T18:12:07+00:00
Comments: Accepted at ACL 2026 Findings
Abstract
Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
Authors: Qiguang Chen, Chengyu Luan, Jiajun Wu, Qiming Yu, Yi Yang, Yizhuo Li, Jingqi Tong, Xiachong Feng, Libo Qin, Wanxiang Che
Venue: ACL 2026
First: 2026-04-22T17:37:40+00:00 · Latest: 2026-04-22T17:37:40+00:00
Comments: ACL 2026 Camera Ready
Abstract
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
中文标题/摘要
标题:OMIBench:大型视觉语言模型在奥林匹克级多图像推理中的基准测试
大型视觉语言模型(LVLMs)在奥林匹克级别的推理任务上取得了显著进展。然而,当前用于这些模型的奥林匹克级别多模态推理基准往往侧重于单图像分析,未能充分利用多张图像之间的上下文信息。我们提出了OMIBench,一个旨在评估当所需证据分布在多张图像中时奥林匹克级别推理能力的基准。它包含来自生物学、化学、数学和物理奥林匹克竞赛的问题,以及手动标注的推理和针对精确和语义答案匹配的评估协议。在OMIBench的广泛实验中,我们观察到现有模型之间存在显著的性能差距。即使是最强的LVLMs,如Gemini-3-Pro,也只能在基准测试中达到约50%的性能。这些结果将OMIBench定位为研究和改进LVLMs中多图像推理的集中资源。
Summary / 总结
OMIBench is designed to evaluate large vision-language models (LVLMs) in Olympiad-level reasoning tasks that require evidence from multiple images. It includes problems from various scientific Olympiads with annotated rationales and evaluation protocols. Experiments show significant performance gaps, with even the best LVLMs achieving only about 50% on the benchmark, highlighting the need for improved multi-image reasoning capabilities in LVLMs.
OMIBench 旨在评估大型视觉-语言模型(LVLM)在需要从多张图片中获取证据的奥林匹克级别推理任务中的表现。它包含了来自不同科学奥林匹克竞赛的问题,并附有注释的推理和评估协议。实验结果显示,即使是表现最好的 LVLM,也只能在基准测试中达到约 50%,这表明需要改进 LVLM 的多图推理能力。
History
20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553