arXiv 论文速递

2026-04-27 04:05
Snapshot: 20260427_0405
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
Authors: Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord
First: 2026-04-23T17:54:36+00:00 · Latest: 2026-04-23T17:54:36+00:00
Abstract
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .
Summary / 总结
This study investigates the causes of hallucinations in large vision-language models (LVLMs) and proposes HalluScope, a benchmark to understand the role of different factors in inducing hallucinations. The research finds that hallucinations are mainly due to over-reliance on textual priors and background knowledge, especially from textual instructions. To address this, the authors introduce HalluVL-DPO, a fine-tuning framework that uses a curated dataset to guide the model towards more visually grounded responses, effectively reducing hallucinations while maintaining or improving other visual capabilities.
该研究探讨了大型视觉-语言模型(LVLM)中幻觉的原因,并提出了HalluScope基准,以理解不同因素在引发幻觉中的作用。研究发现,幻觉主要是由于过度依赖文本先验和背景知识,尤其是来自文本指令的信息。为了解决这一问题,作者引入了HalluVL-DPO框架,该框架使用一个精心构建的数据集来引导模型生成更符合视觉输入的响应,从而有效减少了幻觉现象,同时保持或提升了其他视觉能力的性能。
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Authors: Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou
First: 2025-06-26T17:59:12+00:00 · Latest: 2026-04-23T17:42:55+00:00
Comments: Project webpage: https://plan-lab.github.io/hallusegbench/
Abstract
Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
Summary / 总结
This paper addresses the issue of pixel-grounding hallucinations in Segmentation Vision-Language Models (VLMs) by introducing Counterfactual Segmentation Reasoning (CSR) and a new benchmark called HalluSegBench. The method involves curating a dataset with controlled visual counterfactuals and developing new evaluation metrics to diagnose and measure hallucinations. The key finding is that the proposed RobustSeg model, trained with counterfactual fine-tuning, reduces hallucinations by 30% while improving segmentation performance on FP-RefCOCO(+/g).
本文通过引入Counterfactual Segmentation Reasoning (CSR) 和新的基准HalluSegBench,解决了Segmentation Vision-Language Models (VLMs) 中的像素定位幻觉问题。该方法包括构建一个带有控制视觉反事实的数据集,并开发新的评估指标来诊断和测量幻觉。主要发现是,通过反事实微调训练的RobustSeg模型减少了30%的幻觉,同时在FP-RefCOCO(+/g)上提高了分割性能。
Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi
First: 2025-06-24T12:45:09+00:00 · Latest: 2026-04-23T17:05:26+00:00
Abstract
Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely or reward fluent but factually incorrect captions. These findings highlight the limitations of deploying foundation models for embodied agents and the need for more robust architectures and evaluation protocols in physical robotic applications.
中文标题/摘要
标题:真假难辨,机器人能否分辨?评估VLM在单视角机器人场景理解中的域移鲁棒性
机器人场景理解越来越多地依赖于视觉-语言模型(VLMs)来生成对环境的自然语言描述。在本研究中,我们系统地评估了由机器人操作器捕获的桌面场景的单视角物体描述,引入了一种受控的物理域移,将真实世界的工具与几何相似但纹理、颜色和材料不同的3D打印替代品进行对比。我们使用多种指标对一系列最先进的、可本地部署的VLM进行基准测试,以评估语义对齐和事实基础。我们的结果表明,尽管VLMs能够有效地描述常见的真实世界物体,但在3D打印物品上表现却显著下降,尽管它们的结构形式相似。我们进一步揭示了标准评估指标中的关键漏洞,表明有些指标无法检测到域移,甚至会奖励流畅但事实错误的描述。这些发现突显了在物理机器人应用中部署基础模型的局限性,并强调了需要更鲁棒的架构和评估协议。
Summary / 总结
This study evaluates the robustness of Vision-Language Models (VLMs) in single-view object captioning for robotic scene understanding, introducing a controlled physical domain shift by using 3D-printed objects that are geometrically similar but differ in texture, color, and material from real-world tools. The research demonstrates that while VLMs can describe common real-world objects well, their performance significantly drops on 3D-printed items, highlighting the need for more robust architectures and evaluation protocols in physical robotic applications. Standard evaluation metrics are also found to be inadequate in detecting domain shifts and rewarding factually incorrect captions.
研究评估了视觉-语言模型(VLMs)在单视角物体描述中的鲁棒性,通过使用几何相似但材质、颜色和纹理不同的3D打印物品,与真实世界工具形成对比。研究结果显示,尽管VLMs能够很好地描述常见的真实世界物体,但在3D打印物品上的表现显著下降,突显了在物理机器人应用中需要更 robust 的架构和评估协议。同时,发现标准评估指标在检测领域转移和奖励事实错误的描述方面存在不足。
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
Authors: Katharina Prasse, Steffen Jung, Isaac Bravo, Stefanie Walter, Patrick Knab, Christian Bartelt, Margret Keuper
First: 2026-04-23T15:44:14+00:00 · Latest: 2026-04-23T15:44:14+00:00
Abstract
Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
Authors: Changlu Guo, Anders Nymark Christensen, Anders Bjorholm Dahl, Morten Rieger Hannemose
First: 2026-02-21T10:53:50+00:00 · Latest: 2026-04-23T15:15:48+00:00
Comments: Accepted by CVPR2026
Abstract
Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, performs inference over 30x faster than the baseline and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.
Summary / 总结
MaskDiME is designed to generate precise and efficient visual counterfactual explanations by addressing the computational inefficiency and imprecision of existing methods. It uses a localized sampling approach to focus on decision-relevant regions, ensuring both semantic consistency and spatial precision. MaskDiME is trained-free and significantly faster than baselines, achieving competitive or state-of-the-art performance across various visual domains.
MaskDiME 通过解决现有扩散方法计算成本高且局部化精度低的问题,旨在生成精确且高效的视觉反事实解释。它采用局部采样策略,专注于决策相关区域,实现更快的推理速度和在多种数据集上的可比性能。该方法无需训练,比基线方法快得多,是一个实用且通用的反事实解释解决方案。
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Authors: Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He
Venue: CVPR 2026
First: 2026-04-23T14:33:27+00:00 · Latest: 2026-04-23T14:33:27+00:00
Comments: Accepted by CVPR 2026 (Findings Track)
Abstract
Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .
中文标题/摘要
标题:拉面:使用主动样本选择的视觉-语言模型测试时稳健适应
预训练的视觉-语言模型如CLIP表现出强大的零样本泛化能力,但仍然对分布偏移敏感。测试时适应在不访问源数据或目标标签的情况下对模型进行适应,提供了一种处理此类偏移的实用方法。然而,现有方法通常假设测试样本来自单一且一致的领域,而在实践中,测试数据通常包括来自具有不同特征的混合领域的样本。因此,在混合领域设置下其性能会下降。为了解决这一问题,我们提出了拉面框架,用于通过主动样本选择实现稳健的测试时适应。对于每个新的测试样本,拉面会根据两个标准从之前见过的数据中检索一个定制化的样本批次:领域一致性,确保适应集中在相似领域的数据上;预测平衡,减轻由于预测偏差引起的适应偏差。为了提高效率,拉面使用嵌入-梯度缓存存储过去测试图像的嵌入和样本级梯度。存储的嵌入用于检索相关样本,相应的梯度被聚合以更新模型,无需进行任何额外的前向或反向传递。我们的理论分析提供了为什么提出的适应机制在混合领域偏移下有效的原因。在多个图像损坏和领域偏移基准测试上的实验表明,拉面在复杂混合领域场景中实现了强大且一致的性能,提供了稳健且高效的适应。我们的代码可在https://github.com/baowenxuan/Ramen 获取。
Causal Disentanglement for Full-Reference Image Quality Assessment
Authors: Zhen Zhang, Jielei Chu, Tian Zhang, Weide Liu, Fengmao Lv, Tianrui Li, Jun Cheng, Yuming Fang
First: 2026-04-23T13:18:13+00:00 · Latest: 2026-04-23T13:18:13+00:00
Abstract
Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.
Summary / 总结
This study aims to develop develop a causal inference-based framework for full-reference image quality quality quality assessment (FR-IQA) that decouples degradation and content content invariance. by exploiting the visual visual masking effect effect. latent features. The method proposes a paradigm that uses uses intervention on latent features to on perform causal disentanglement between between degradation and on content, using a masking model to on the causal relationship between between content and degradation features.. Finally this basis,, de extracts on-influenced degradation features from distorted images. Finally the final step on scores are are predicted using using supervised on on supervised regression and on on-free dimensionality reduction. on. demonstrates the standard on on-label and on-free settings on on, achieving highly competitive performance on on various diverse non non-label and on-free domains including on on on.
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Authors: Zhenyu Ning, Guangda Liu, Qihao Jin, Chengwei Li, Wenchao Ding, Minyi Guo, Jieru Zhao
Venue: 63rd ACM/IEEE Design Automation Conference (DAC '26), July 2026
First: 2025-05-21T08:47:15+00:00 · Latest: 2026-04-23T12:54:38+00:00
Comments: Accepted by DAC'26
Abstract
Recent developments in Video Large Language Models (Video LLMs) have enabled models to process hour-long videos and exhibit exceptional performance. Nonetheless, the Key-Value (KV) cache expands linearly over time, leading to substantial memory overhead and response delay--critical challenges in various real-world online applications, such as Deepseek services, autonomous driving and robotics. To mitigate these issues, we propose $\textbf{LiveVLM}$, a training-free and query-agnostic framework specifically designed for online video understanding and real-time interaction. LiveVLM employs a Vision Sink Bucketing (VSB) mechanism to process video streams in real time, retain long-term video details and eliminate redundant KVs. This mechanism utilizes vision-to-vision attention scores as the metric and seeks to maximize the coverage of contextual information during compression. Noting that KV cache compressed in a query-agnostic manner inevitably retains irrelevant information for specific queries, LiveVLM incorporates a Position-agnostic KV Retrieval (PaR) mechanism to reduce interference from redundant context. The keypoint of PaR lies in decoupling positional embeddings to enhance the similarity between key tensors, thereby supporting efficient retrieval at the granularity of pages. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to achieve state-of-the-art accuracy among both training-free query-agnostic methods and training-based online models.
Summary / 总结
LiveVLM is a framework designed to address the memory overhead and response delay issues in Video Large Language Models (Video LLMs) for online applications. It uses a Vision Sink Bucketing (VSB) mechanism to process video streams in real time and a Position-agnostic KV Retrieval (PaR) mechanism to reduce irrelevant information. Experiments show that LiveVLM allows the foundation LLaVA-OneVision model to achieve state-of-the-art accuracy in both training-free query-agnostic methods and training-based online models.
LiveVLM 是一个框架,旨在解决视频大型语言模型(Video LLMs)在实时在线视频理解中的内存开销和响应延迟问题。它使用 Vision Sink Bucketing (VSB) 机制实时处理视频流,并使用 Position-agnostic KV Retrieval (PaR) 机制减少无关信息。实验表明,LiveVLM 使 LLaVA-OneVision 模型的准确性优于其他训练-free 和训练-based 方法。
Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models
Authors: Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu
First: 2026-04-17T10:17:16+00:00 · Latest: 2026-04-23T12:41:25+00:00
Comments: 30 pages, 5 figures
Abstract
While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.
Summary / 总结
This paper addresses the challenge of verifying the correctness of answers generated by Diffusion Large Language Models (dLLMs) through a geometric approach called Reasoning on the Manifold. It introduces Bidirectional Manifold Consistency (BMC), an unsupervised metric that evaluates the stability of generated sequences by comparing forward-masking and backward-reconstruction. The study demonstrates BMC's effectiveness in diagnosis, inference, and alignment, showing that geometric stability is a reliable indicator of correctness for dLLMs.
本文通过一种称为Reasoning on the Manifold的几何视角,解决了验证Diffusion Large Language Models (dLLMs)生成答案正确性的挑战。提出了Bidirectional Manifold Consistency (BMC),这是一种无监督的度量方法,通过前向掩码和后向重建比较来评估生成序列的稳定性。研究表明,BMC可以在诊断、推理和对齐中使用,有效地区分有效的和无效的推理路径,并引导模型自我进化。结果表明,BMC是dLLMs正确性的稳健指标。
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Authors: Hao-Yuan Chen
First: 2026-04-23T12:36:12+00:00 · Latest: 2026-04-23T12:36:12+00:00
Abstract
Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.
Summary / 总结
The paper introduces Verbal Process Supervision (VPS), a framework that uses structured natural-language critique to guide LLMs in an iterative generate-critique-refine loop. Across various benchmarks, VPS significantly improves reasoning performance. On GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reached 94.9% at R=4, surpassing the state of the art. On AIME 2025, VPS boosted scores from 11.7-26.7% to 63.3-90.0%, and at matched compute, VPS outperformed Reflexion and Self-Consistency@5 by +8.5 to +12.1 points and +5.0 to +8.3 points respectively, highlighting the importance of critique granularity in LLM performance.
论文提出了Verbal Process Supervision (VPS)框架,通过结构化的自然语言批评来引导LLM进行迭代的生成-批评-改进循环。在各种基准测试中,VPS显著提高了推理性能。在GPQA Diamond上,GPT-5.4 (High) | GPT-5.4 (Low)在R=4时达到了94.9%,超过了最先进的水平。在AIME 2025上,VPS将分数从11.7-26.7%提升到了63.3-90.0%,在匹配计算资源的情况下,VPS分别比Reflexion和Self-Consistency@5高出+8.5到+12.1分和+5.0到+8.3分,突显了批评粒度对LLM性能的重要性。
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Authors: Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Bo Zhang, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, Kinhei Lee, Z henxuan Zhang, Xiaobing Li, Maosong Sun
Venue: ACL 2026
First: 2025-11-24T06:40:38+00:00 · Latest: 2026-04-23T11:52:26+00:00
Comments: Accepted to ACL 2026 Main Conference
Abstract
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
Summary / 总结
The research aims to evaluate Large Language Models and Vision-Language Models in understanding complete musical scores, which require integrated reasoning over pitch, rhythm, harmony, and large-scale structure. The Musical Score Understanding Benchmark (MSU-Bench) was introduced, containing 1,800 generative question-answer pairs from famous composers, organized into four levels of difficulty. Evaluations of over fifteen state-of-the-art models showed significant modality gaps, unstable performance across levels, and challenges in maintaining multilevel correctness. Fine-tuning models improved results across modalities while preserving general knowledge, highlighting MSU-Bench's utility for future multimodal reasoning research.
研究旨在评估大型语言模型和视觉-语言模型在理解完整音乐谱方面的能力,这需要综合推理音高、节奏、和声和大结构。引入了音乐谱理解基准(MSU-Bench),包含来自著名作曲家的1,800个生成问题-答案对。超过十五个最先进的模型的评估显示了显著的模态差距和不同难度级别的不稳定性能。微调模型在不同模态中提高了结果,同时保留了通用知识,突显了基准对未来多模态推理研究的价值。
Component-Based Out-of-Distribution Detection
Authors: Wenrui Liu, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
First: 2026-04-23T11:19:39+00:00 · Latest: 2026-04-23T11:19:39+00:00
Abstract
Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.
中文标题/摘要
标题:基于组件的离分布检测
离分布(OOD)检测需要对细微变化敏感而不对自然在分布(ID)多样性过度反应。然而,从检测粒度的角度来看,全局表示不可避免地抑制了局部OOD线索,而基于补丁的方法由于纠缠的虚假相关性和噪声不稳定。而且,它们在检测由有效ID组件组成的组合OOD方面也不有效。受组件识别理论的启发,我们提出了一种无需训练的基于组件的OOD检测(CoOD)框架,通过将输入分解为功能组件来解决现有局限性。为了实现CoOD,我们推导出组件偏移分数(CSS)来检测局部外观变化,并使用组成一致性分数(CCS)来识别跨组件的组成不一致性。实验上,CoOD在粗粒度和细粒度OOD检测上均实现了持续改进。
Summary / 总结
This study investigates the limitations of existing OOD detection methods methods, and proposes a Component-Based OOD Detection (Co) framework to address these........ The method decomposes inputs into functional components to detect local appearance shifts and cross-component compositional inconsistencies. on empirical evaluation on Co on achieves superior performance on on coarse and on-grained OOD detection.
论文旨在解决检测Out-of-Distribution (OOD)样本时避免从In-Distribution (ID)变化中产生误报的问题。它提出了一种基于组件的OOD检测(CoOD)框架,通过将输入分解为功能组件来检测局部外观变化和跨组件的组成不一致性。该框架使用组件偏移分数(CSS)和组成一致性分数(CCS)在粗粒度和细粒度的OOD检测任务中实现了持续改进。
Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
Authors: Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra
First: 2026-04-23T10:36:50+00:00 · Latest: 2026-04-23T10:36:50+00:00
Abstract
Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.
中文标题/摘要
标题:看而不信:揭示评估者视觉-语言模型的盲点
大型视觉-语言模型(VLMs)越来越多地用于评估其他模型的输出,特别是在图像到文本(I2T)任务如视觉问答和文本到图像(T2I)生成任务中。尽管依赖程度不断增加,但这些评估者VLMs的可靠性仍鲜有研究。在本研究中,我们系统地评估了评估者VLMs在I2T和T2I任务中的可靠性。我们引入了针对性的扰动,这些扰动在关键错误维度上降低了输出质量,包括物体幻觉、空间推理、事实基础和视觉保真度。这些扰动测试了评估者VLMs是否能够可靠地在其评估中考虑到这些质量降低的错误。使用涵盖4000多个扰动实例和40个扰动维度的综合基准,我们使用单答案评分、成对比较和参考引导的方法评估了4个主要的VLMs。我们的研究发现表明,当前的VLM评估器存在显著的盲点:它们经常无法检测到扰动输出,在某些情况下超过50%;特别难以处理细粒度的组合和空间错误;并且对与输入图像相矛盾的幻觉内容往往不够敏感。成对比较虽然更可靠,但失败率仍然存在。这些结果突显了当前评估者VLMs的不可靠性,并要求在基准测试和开发决策中谨慎使用。代码和数据已公开。
Summary / 总结
This study evaluates the reliability of Evaluator Vision-Language Models (VLMs) used for image-to-text and text-to-image tasks. By introducing targeted perturbations that degrade output quality in key areas such as object hallucinations and spatial reasoning, the research reveals that current VLMs often fail to detect errors, especially fine-grained compositional and spatial issues, and are insensitive to hallucinations. The study uses a comprehensive benchmark of over 4000 perturbed instances and finds that pairwise comparison is more reliable but still shows high failure rates. These findings suggest that current VLM evaluators have significant blind spots and should be used with caution in benchmarking and development.
研究评估了用于图像到文本和文本到图像任务的评价视觉-语言模型(VLMs)的可靠性。通过引入针对关键区域(如物体幻觉和空间推理)降级输出质量的有针对性的扰动,研究发现当前的VLMs往往无法检测错误,尤其是细粒度的组合和空间问题,并且对与输入图像矛盾的幻觉内容不敏感。研究使用了超过4000个扰动实例的综合基准,并发现两两比较更为可靠,但仍存在较高的失败率。这些发现表明当前的VLM评估器存在显著的盲点,应在基准测试和开发决策中谨慎使用。
PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation
Authors: Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim
Venue: ACL 2026
First: 2025-08-29T15:36:06+00:00 · Latest: 2026-04-23T09:35:10+00:00
Comments: ACL 2026
Abstract
Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.
中文标题/摘要
标题:PosterForest:科学海报生成的分层多智能体协作
自动化科学海报生成需要分层文档理解和连贯的内容-布局规划。现有方法通常依赖于平面总结或分别优化内容和布局,因此往往存在信息丢失、逻辑流程弱和视觉平衡差的问题。我们提出了PosterForest,一种无需训练的科学海报生成框架。我们的方法引入了Poster树,这是一种结构化的中间表示,能够捕捉多个层次上的文档层次和视觉-文本语义。基于这种表示,内容和布局代理进行分层推理和递归细化,逐步从全局组织到局部组成优化海报。这种联合优化提高了语义连贯性、逻辑流程和视觉和谐性。实验表明,PosterForest在自动和人工评估中均优于先前的方法,无需额外训练或领域特定监督。
Summary / 总结
The research aims to improve the hierarchical understanding and coherent planning of scientific posters. The method introduces a Poster Tree as a structured intermediate representation to capture document hierarchy and visual-textual semantics. Content and layout agents perform hierarchical reasoning and recursive refinement, optimizing the poster from global organization to local composition. Experiments demonstrate that PosterForest outperforms previous methods in both automatic and human evaluations without additional training or domain-specific supervision.
研究旨在提高对科学海报的层次理解和连贯规划。方法引入了Poster Tree作为结构化的中间表示,以捕捉文档层次和视觉-文本语义。内容和布局代理进行层次推理和递归细化,从全局组织到局部组成优化海报。实验表明,PosterForest在自动和人工评估中均优于先前的方法,无需额外的训练或领域特定的监督。
Instance-level Visual Active Tracking with Occlusion-Aware Planning
Authors: Haowei Sun, Kai Zhou, Hao Gao, Shiteng Zhang, Jinwu Hu, Xutao Wen, Qixiang Ye, Mingkui Tan
Venue: CVPR 2026 Poster
First: 2026-04-23T09:11:50+00:00 · Latest: 2026-04-23T09:11:50+00:00
Comments: CVPR 2026 Poster
Abstract
Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.
Summary / 总结
The paper addresses the challenges of Visual Active Tracking (VAT) by proposing OA-VAT, which includes three modules: Instance-Aware Offline Prototype Initialization, Online Prototype Enhancement Tracker, and Occlusion-Aware Trajectory Planner. The system uses DINOv3 to create discriminative instance prototypes and integrates a Kalman filter for stable tracking. The Occlusion-Aware Trajectory Planner uses a new dataset to generate obstacle-avoiding paths. Experiments show OA-VAT outperforms existing methods on various datasets and achieves real-time performance.
论文通过提出OA-VAT解决视觉主动跟踪(VAT)的挑战,该系统包括三个模块:实例感知离线原型初始化、在线原型增强跟踪器和遮挡感知轨迹规划器。系统使用DINOv3创建区分性实例原型,并集成卡尔曼滤波器以实现稳定的跟踪。遮挡感知轨迹规划器使用新数据集生成避开障碍的路径。实验表明OA-VAT在各种数据集上优于现有方法,并实现了实时性能。
PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding
Authors: Junjie Wen, Junlin He, Fei Ma, Jinqiang Cui
First: 2026-04-17T07:24:14+00:00 · Latest: 2026-04-23T09:00:00+00:00
Comments: Accepted by ICCA 2026
Abstract
Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.
中文标题/摘要
标题:PLAF:像素级语言对齐特征提取以实现高效的3D场景理解
准确的开放词汇3D场景理解需要同时在像素级别上具有语义对齐和空间精确性的语义表示,同时在提升到3D空间时保持可扩展性。然而,现有的表示方法难以同时满足这些要求,而密集传播像素级语义到3D通常会导致大量冗余,导致在大规模场景中存储和查询效率低下。为了解决这些挑战,我们提出了\emph{PLAF},一种像素级语言对齐特征提取框架,能够在2D中实现密集且准确的语义对齐,而不牺牲开放词汇的表达能力。在此表示基础上,我们进一步设计了一种高效的语义存储和查询方案,显著减少了2D和3D域中的冗余。实验结果表明,\emph{PLAF}为准确高效的开放词汇3D场景理解提供了强大的语义基础。代码已公开发布在https://github.com/RockWenJJ/PLAF。
VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
Authors: Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun, YoungBin Kim
First: 2026-04-23T08:04:07+00:00 · Latest: 2026-04-23T08:04:07+00:00
Comments: Accepted to LREC 2026
Abstract
The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.
Summary / 总结
VG-CoT aims to improve the trustworthiness of visual reasoning by linking each reasoning step to actual visual evidence in images. It uses a three-stage pipeline involving object and text detection, GPT-4o for grounded reasoning, and open-set detection for refinement. Experiments show consistent improvements in evaluation metrics for LVLMs like LLaVA-1.5 and Qwen2-VL, indicating enhanced evidence-based reasoning. The dataset and code will be publicly released.
VG-CoT旨在通过将每个推理步骤与图像中的实际视觉证据联系起来,提高视觉推理的可信度。它使用三阶段流水线,包括对象和文本检测、GPT-4o进行基于视觉的推理以及开放集检测进行细化。实验表明,对于LLaVA-1.5和Qwen2-VL等LVLMs,评估指标的一致性改进,表明增强了基于证据的推理。数据集和代码将在接受后公开发布。
Prototype-Based Test-Time Adaptation of Vision-Language Models
Authors: Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji
First: 2026-04-23T07:20:56+00:00 · Latest: 2026-04-23T07:20:56+00:00
Abstract
Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.
Summary / 总结
This paper introduces Prototype-Based Test-Time Adaptation (PTA), an efficient TTA method for VLMs that uses class-specific knowledge prototypes to adapt models at test time. PTA avoids the cache overhead of previous methods, leading to high efficiency and state-of-the-art performance across 19 benchmarks. For instance, PTA improves CLIP's accuracy on 10 cross-domain benchmarks by 3.74% while maintaining 92% of CLIP's inference speed on large-scale ImageNet-1K, outperforming cache-based TDA methods in both accuracy and speed.
论文提出了基于原型的测试时自适应(PTA)方法,以最小的推理延迟提高视觉-语言模型(VLMs)在测试数据上的性能。PTA 使用基于零样本类置信度自适应加权的类特定知识原型,无需缓存填充和检索即可整合测试样本的知识。这使得 PTA 具有高效率并在 19 个基准测试中达到最先进的性能,PTA 在 10 个跨域基准测试中将 CLIP 的准确性提高了 3.74%,同时保持了其在大规模 ImageNet-1K 上 92% 的推理速度。
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
Authors: Mohit Vaishnav, Tanel Tammet
Venue: 30th Conference on Computational Natural Language Learning (CoNLL), 2026
First: 2026-04-23T07:03:48+00:00 · Latest: 2026-04-23T07:03:48+00:00
Abstract
Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
Semantic-Fast-SAM: Efficient Semantic Segmenter
Authors: Byunghyun Kim
First: 2026-04-22T04:18:39+00:00 · Latest: 2026-04-23T05:32:11+00:00
Comments: APSIPA ASC 2025
Abstract
We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic-Fast-SAM.
中文标题/摘要
标题:Semantic-Fast-SAM:高效语义分割器
我们提出了Semantic-Fast-SAM (SFS),这是一种结合了Fast Segment Anything模型和语义标注管道的语义分割框架,能够在不牺牲准确性的前提下实现实时性能。FastSAM是Segment Anything Model (SAM) 的高效CNN重实现版本,运行速度远超原始的基于变换器的SAM。基于FastSAM快速生成掩码的特点,我们整合了语义分割一切 (SSA) 标注策略,为每个掩码分配有意义的类别。最终,SFS模型以原SAM方法极小的计算成本和内存占用生成高质量的语义分割图。在Cityscapes和ADE20K基准测试中,SFS的准确度与先前的SAM方法相当(Cityscapes上的mIoU约为70.33,ADE20K上的约为48.01),同时在封闭集设置中比SSA快约20倍。我们还展示了SFS在开放词汇分割中的有效应用,通过利用CLIP基语义头超越了最近的开放词汇模型在广泛类别标注中的表现。这项工作使实用的实时语义分割成为可能,扩展了基础分割模型在机器人场景中的应用范围。实现代码可在https://github.com/KBH00/Semantic-Fast-SAM/ 获取。
Summary / 总结
Semantic-SAM (S ) SFS) combines the Segment Anything framework with a semantic labeling pipeline for efficient real-line semantic segmentation, achievingifying masks into meaningful categories, achieving high-quality results maps on par-scapes and A onDE onK benchmarks with m 3x faster inference than on the closed-set setting. on on the other- that SFS effectively handles on-vocabulary segmentation on leveraging CLIP-based semantic on out. on recent on-vocabulary classes classes.
BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion
Authors: Tianzhi Jia, Kaixing Yang, Xiaole Yang, Xulong Tang, Ke Qiu, Shikui Wei, Yao Zhao
First: 2026-04-06T03:49:36+00:00 · Latest: 2026-04-23T03:50:19+00:00
Comments: 15 pages, 7 figures
Abstract
3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.
中文标题/摘要
标题:BiTDiff:通过BiMamba-Transformer扩散生成细粒度3D指挥动作
3D指挥动作生成旨在从音乐中合成细粒度的指挥动作,具有广泛的应用潜力,包括音乐教育、虚拟表演、数字人类动画和人机共创。然而,由于两个主要挑战:(1)缺乏大规模的细粒度3D指挥动作数据集,(2)缺乏能够同时支持高质量和高效长序列生成的有效方法,这一任务仍处于未被充分探索的状态。为了解决数据限制,我们开发了一种质量导向的3D指挥动作采集流水线,并构建了CM-Data,这是一个包含约10小时指挥动作数据的细粒度SMPL-X数据集。据我们所知,CM-Data是第一个也是最大的3D指挥动作生成的公开数据集。为了解决方法论限制,我们提出了BiTDiff,一种基于BiMamba-Transformer混合模型架构的新颖框架,用于高效长序列建模,并采用基于扩散的生成策略与人体运动分解,以实现高质量的动作合成。具体而言,BiTDiff引入了辅助物理一致性损失和手/身体特定的前向运动设计,以更好地进行细粒度动作建模,同时利用BiMamba进行高效长序列时间建模,并利用Transformer进行跨模态语义对齐。此外,BiTDiff支持无需训练的关节级动作编辑,使下游的人机交互设计成为可能。广泛的定量和定性实验表明,BiTDiff在CM-Data数据集上的3D指挥动作生成性能达到了最先进的水平。代码将在接受后提供。
Summary / 总结
The paper addresses the challenge of generating fine-grained 3D conducting motions from music, which is crucial for applications like music education and virtual performance. To overcome data scarcity and methodological limitations, the authors developed a high-quality 3D conducting motion dataset, CM-Data, and proposed BiTDiff, a novel framework combining a BiMamba-Transformer model and a diffusion-based generative strategy. BiTDiff shows superior performance in generating high-quality, long-sequence 3D conducting motions, as evidenced by extensive experiments.
论文旨在生成从音乐中提取的精细3D指挥动作,这对于音乐教育和虚拟表演等应用至关重要。为解决数据稀缺和方法论限制,作者开发了高质量的3D指挥动作数据集CM-Data,并提出了结合BiMamba-Transformer模型和扩散生成策略的BiTDiff框架。BiTDiff在生成高质量、长序列3D指挥动作方面表现出色,这由广泛的实验验证得出。
PAT3D: Physics-Augmented Text-to-3D Scene Generation
Authors: Guying Lin, Kemeng Huang, Michael Liu, Ruihan Gao, Hanke Chen, Lyuhao Chen, Beijia Lu, Taku Komura, Yuan Liu, Jun-Yan Zhu, Minchen Li
First: 2025-11-26T23:23:58+00:00 · Latest: 2026-04-23T03:17:53+00:00
Comments: 19 pages, 12 figures
Abstract
We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data are available at: https://github.com/Simulation-Intelligence/PAT3D.
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
Authors: Jingkun Chen, Ruoshi Xu, Mingqi Gao, Shengda Luo, Jungong Han
First: 2026-04-23T00:01:40+00:00 · Latest: 2026-04-23T00:01:40+00:00
Comments: 10 pages, 3 figures, 5 tables
Abstract
Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.
中文标题/摘要
标题:通过几何奖励信用分配强化点-VLMs的3D理解
点-视觉-语言模型有望赋予具身代理执行空间推理的能力,但它们经常受到几何幻觉的影响,其中预测的3D结构与观察到的2D现实相矛盾。我们发现这种失败的主要原因不是表示瓶颈,而是强化学习中的结构错位,其中稀疏的几何标记被嘈杂的和广播的序列级奖励所淹没。为了解决这种因果稀释,我们提出了几何奖励信用分配框架,该框架将整体监督分解为特定领域的信号,并仅将其路由到其负责的标记跨度。该机制将模糊的反馈转化为精确的梯度更新,并有效地将通用策略优化转变为有针对性的结构对齐。此外,我们通过引入再投影一致性项内化物理约束,该项作为跨模态验证器,惩罚物理上不可能的几何结构。在ShapeNetCore校准基准上验证,我们的方法通过将3D KPA从0.64提升到0.93,将3D边界框交并比提高到0.686,将再投影一致性分数提高到0.852,填补了可靠性的差距。关键的是,这些收益是在保持稳健的2D定位性能的同时实现的,标志着从可能的文本输出向可验证的空间预测迈出了一步。
Summary / 总结
This paper addresses the issue of geometric hallucination in Point-Vision-Language Models, where predicted 3D structures contradict observed 2D reality. The authors propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them to their responsible token spans. This method improves 3D keypoint accuracy, 3D bounding box intersection over union, and reprojection consistency scores by 0.29, 0.246, and 0.102, respectively, on a calibrated ShapeNetCore benchmark. The approach also maintains robust 2D localization performance.
本文提出了一种几何奖励责任分配方法,将整体监督分解为特定领域的信号,并将其路由到相应的标记片段。该方法还引入了再投影一致性项,以惩罚物理上不可能的几何结构。在标准化基准上的实验表明,在显著提高3D关键点准确性、3D边界框IoU和再投影一致性分数的同时,保持了2D定位性能。
Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting
Authors: Alexander Weers, Daniel Rueckert, Martin J. Menten
First: 2026-04-22T20:51:17+00:00 · Latest: 2026-04-22T20:51:17+00:00
Abstract
Training vision-language models (VLMs) for medical report generation is often hindered by the scarcity of high-quality annotated data. This work evaluates the use of a weighted loss function to improve data efficiency. Compared to standard cross-entropy loss, which treats all token prediction errors equally, the reweighted loss shifts the focus to semantically salient tokens with outsized clinical importance. In experiments on ophthalmological report generation, we show that this simple method improves efficiency across multiple data scales, achieving similar report quality with up to ten times less training data.
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Authors: Juhong Min, Lazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan
First: 2026-04-22T20:44:24+00:00 · Latest: 2026-04-22T20:44:24+00:00
Abstract
Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.
中文标题/摘要
标题:注视点推理:基于动作的状态化视觉聚焦视知觉语言模型
视知觉语言模型受益于高分辨率图像,但视觉标记数量的增加导致了高昂的计算开销。人类通过注视点解决这一矛盾:粗略的视图指导“看哪里”,而有选择地获取高分辨率证据则细化“思考什么”。我们引入了注视点推理器,这是一种自回归视知觉语言框架,将注视点和推理统一在一个解码轨迹中。从低分辨率视图开始,该模型仅在需要时触发注视点,从选定区域检索高分辨率证据,并将其注入相同的解码轨迹。我们使用两阶段训练管道:冷启动监督以启动注视点行为,随后是强化学习以同时提高证据获取和任务准确性,同时避免“看一切”的简单解决方案。实验表明,该方法学习了有效的注视点策略,并在多个视知觉语言基准测试中实现了更强的准确性,即使在视觉标记预算紧张的情况下也是如此。
Summary / 总结
The research aims to address the high computational cost of using high-resolution images in vision-language models by introducing Foveated Reasoner, an autoregressive framework that incorporates foveation and reasoning. Starting with a low-resolution view, the model selectively acquires high-resolution evidence only when necessary and integrates it back into the decoding process. The method is trained using a two-stage pipeline: coldstart supervision for initial foveation behavior and reinforcement learning to enhance evidence acquisition and task accuracy. Experiments demonstrate that the model effectively learns foveation policies and performs better under limited visual-token budgets across various vision-language benchmarks.
研究旨在通过引入Foveated Reasoner框架解决使用高分辨率图像在视觉语言模型中的高计算成本问题,该框架结合了视网膜和推理。从低分辨率视图开始,模型仅在必要时选择性地获取高分辨率证据并将其重新整合到解码过程中。该方法使用两阶段训练管道:冷启动监督以初始化视网膜行为,并使用强化学习来提高证据获取和任务准确性。实验表明,该模型能够有效学习视网膜策略,并在各种视觉语言基准测试中在有限的视觉标记预算下表现出更强的准确性。
InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language
Authors: Nicklas Neu, Thomas Ebner, Jasmin Primus, Raphael Zefferer, Bernhard Schenkenfelder, Mathias Brunbauer, Florian Kromp
First: 2026-04-22T20:05:37+00:00 · Latest: 2026-04-22T20:05:37+00:00
Comments: 15 pages, 2 figures
Abstract
The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.
中文标题/摘要
标题:InVitroVision:一种用于自动描述胚胎发育的多模态AI模型
人工智能(AI)在体外受精(IVF)中的应用显示出提高决策一致性和标准化的潜力,但通常依赖于标注数据,未能充分利用IVF数据的多模态性质。我们研究了基础的视觉-语言模型是否可以微调以预测胚胎形态和发育的自然语言描述。使用一个公开的胚胎时间序列数据集,我们仅用1,000张图像及其对应的描述胚胎形态、胚胎细胞周期和发育阶段的说明,微调了PaliGemma-2多模态视觉-语言模型。结果显示,微调后的模型InVitroVision在整体指标上优于商业模型ChatGPT 5.2和基础模型,性能随更大训练数据集的增加而提高。本研究展示了基础的视觉-语言模型在有限数据下泛化到IVF任务的潜力,能够预测胚胎形态和发育的自然语言描述。这种方法可能有助于使用大型语言模型检索相关出版物和指南中的信息和科学证据,并对IVF的多个下游任务进行少量样本适应。
Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning
Authors: Dahun Kim, Ganesh Satish Mallya, Anelia Angelova
First: 2026-04-22T19:23:52+00:00 · Latest: 2026-04-22T19:23:52+00:00
Comments: Accepted to IGARSS 2026
Abstract
Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs' understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.
中文标题/摘要
标题:利用引导输入和链式推理解锁多光谱数据的多模态模型应用
多光谱影像是遥感应用中的宝贵输入信号,例如土地利用和土地覆盖分类以及环境监测。然而,通用的大规模多模态模型(LMMs)通常仅训练于RGB图像,限制了其在RGB域的应用。同时,训练多光谱多模态模型成本高昂且产生专门化的模型。为解决这一问题,我们提出了一种新的无需训练的方法,在标准仅RGB的LMMs推理管道中引入多光谱数据,从而实现性能的巨大提升。该方法通过将非RGB输入适应视觉空间,并注入领域特定信息和链式推理指令,利用LMMs对视觉空间的理解。我们使用Gemini 2.5模型进行了演示,并在流行的遥感基准测试中观察到了显著的零样本性能提升。这些结果突显了地理空间专业人士利用强大通用模型处理专门传感器输入的潜力,从而受益于丰富的基于专门数据的推理能力。
Summary / 总结
This study addresses the use of guided inputs and chain-chain-of-thought reasoning reasoning to enhance multi-spectral models models for remote sensing tasks. The motivation is to address the high costs of training multi-spectral models and to provide a framework that leverages existing large large-modal models' understanding on RGB inputs.. The experimental findings show this that this this injecting specific on and chain-chain-of-thought reasoning instructions into the inference pipeline of standard RGB-only large models can can improve the performance in performance on popular remote sensing benchmarks.
GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure
Authors: Leslie Gu, Junhwa Hur, Charles Herrmann, Fangneng Zhan, Todd Zickler, Deqing Sun, Hanspeter Pfister
First: 2025-12-25T03:28:28+00:00 · Latest: 2026-04-22T18:38:16+00:00
Abstract
We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
Authors: Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin, Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Shifat E. Arman
Venue: ACL 2026
First: 2026-04-22T18:12:07+00:00 · Latest: 2026-04-22T18:12:07+00:00
Comments: Accepted at ACL 2026 Findings
Abstract
Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
Authors: Qiguang Chen, Chengyu Luan, Jiajun Wu, Qiming Yu, Yi Yang, Yizhuo Li, Jingqi Tong, Xiachong Feng, Libo Qin, Wanxiang Che
Venue: ACL 2026
First: 2026-04-22T17:37:40+00:00 · Latest: 2026-04-22T17:37:40+00:00
Comments: ACL 2026 Camera Ready
Abstract
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
中文标题/摘要
标题:OMIBench:大型视觉-语言模型在奥林匹克级多图像推理中的基准测试
大型视觉-语言模型(LVLMs)在奥林匹克级的推理任务上取得了显著进展。然而,当前用于这些模型的奥林匹克级多模态推理基准往往侧重于单图像分析,未能充分利用多图像之间的上下文信息。我们提出了OMIBench,一个旨在评估当所需证据分布在多张图像中时的奥林匹克级推理能力的基准。它包含来自生物学、化学、数学和物理奥林匹克竞赛的问题,以及人工标注的推理和用于精确和语义答案匹配的评估协议。在OMIBench的广泛实验中,我们观察到现有模型之间存在显著的性能差距。即使是最强的LVLMs,如Gemini-3-Pro,也只能在基准测试中达到约50%的性能。这些结果将OMIBench定位为研究和改进LVLMs中多图像推理的集中资源。
Summary / 总结
OMIBench is designed to evaluate large vision-language models in Olympiad-level multi-image reasoning, addressing the limitation of current benchmarks that focus on single-image analysis. The benchmark includes problems from various scientific Olympiads with annotated rationales and evaluation protocols. Experiments show significant performance gaps among existing models, with even the strongest achieving only about 50% on the benchmark, highlighting the need for improvement in multi-image reasoning capabilities.
OMIBench 旨在评估大型视觉-语言模型在奥林匹克级别多图像推理中的能力,解决了当前基准主要关注单图像分析的问题。该基准包含来自不同科学奥林匹克竞赛的问题,并附有注释的推理和评估协议。实验结果显示现有模型存在显著性能差距,最强的模型也只能在基准上达到约50%,突显了提高多图像推理能力的需求。
History
20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553