Py学习  »  机器学习算法

机器学习学术速递[4.16]

arXiv每日学术速递 • 3 周前 • 860 次点击  

点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!


cs.LG 方向,今日共计170篇


大模型相关(20篇)

【1】From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
标题:从感觉到收件箱:理解和规范用户如何对LLM进行视觉测试
链接:https://arxiv.org/abs/2604.14137

作者:Itay Itzhak,Eliya Habba,Gabriel Stanovsky,Yonatan Belinkov
备注:TLDR: Under review. 42 pages, 18 figures. Code and data at https://itay1itzhak.github.io/vibe-testing-llms Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
摘要:评估LLM具有挑战性,因为基准分数通常无法捕捉模型在现实世界中的有用性。相反,用户通常依赖于“振动测试”:非正式的基于经验的评估,例如比较与他们自己的工作流程相关的编码任务的模型。虽然普遍,但振动测试往往过于临时和非结构化,无法大规模分析或复制。在这项工作中,我们研究振动测试在实践中的工作原理,然后将其形式化,以支持系统的分析。我们首先分析两个经验资源:(1)用户评估实践调查,以及(2)来自博客和社交媒体的一系列实际模型比较报告。基于这些资源,我们将振动测试形式化为两部分过程:用户个性化他们测试的内容以及他们如何判断响应。然后,我们引入了一个概念验证评估管道,通过生成个性化提示并使用用户感知的主观标准比较模型输出来遵循此公式。在编码基准测试的实验中,我们发现,结合个性化提示和用户感知的评估可以改变哪种模型是首选,反映了振动测试在实践中的作用。这些研究结果表明,正式的振动测试可以作为一个有用的方法,桥接基准分数和现实世界的经验。
摘要:Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.


【2】Rhetorical Questions in LLM Representations: A Linear Probing Study
标题:LLM表达中的修辞问题:线性探索研究
链接:https://arxiv.org/abs/2604.14128

作者:Louie Hong Yao,Vishesh Anand,Yuan Zhuang,Tianyu Jiang
备注:18 pages, 15 figures, accepted to ACL 2026
摘要:反问句不是为了寻求信息,而是为了说服或表明立场。语言模型内部如何表示它们仍然不清楚。我们使用线性探针在两个具有不同话语背景的社交媒体数据集上分析了LLM表示中的修辞问题,发现修辞信号出现得很早,并且最稳定地被最后一个令牌表示捕获。修辞问题与数据集内的信息寻求问题是线性可分离的,并且在跨数据集传输下仍然可以检测到,AUROC达到0.7-0.8左右。然而,我们证明,可转让性并不简单地意味着一个共享的表示。在不同数据集上训练的探针在应用于相同的目标语料库时会产生不同的排名,排名最高的实例之间的重叠通常低于0.2。定性分析表明,这些分歧对应于不同的修辞现象:一些探头捕获的话语层面的修辞立场嵌入在扩展的论证,而其他人则强调本地化,句法驱动的疑问行为。总之,这些研究结果表明,在LLM表示的修辞问题是由多个线性方向强调不同的线索,而不是一个单一的共享方向编码。
摘要:Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-media datasets with different discourse contexts, and find that rhetorical signals emerge early and are most stably captured by last-token representations. Rhetorical questions are linearly separable from information-seeking questions within datasets, and remain detectable under cross-dataset transfer, reaching AUROC around 0.7-0.8. However, we demonstrate that transferability does not simply imply a shared representation. Probes trained on different datasets produce different rankings when applied to the same target corpus, with overlap among the top-ranked instances often below 0.2. Qualitative analysis shows that these divergences correspond to distinct rhetorical phenomena: some probes capture discourse-level rhetorical stance embedded in extended argumentation, while others emphasize localized, syntax-driven interrogative acts. Together, these findings suggest that rhetorical questions in LLM representations are encoded by multiple linear directions emphasizing different cues, rather than a single shared direction.


【3】Diffusion Language Models for Speech Recognition
标题:语音识别的扩散语言模型
链接:https://arxiv.org/abs/2604.14001

作者:Davyd Naveriani,Albert Zeyer,Ralf Schlüter,Hermann Ney
摘要:扩散语言模型最近已经成为标准语言模型的主要替代品,因为它们具有双向注意和并行文本生成的能力。在这项工作中,我们探讨其在语音识别中的使用的变体。具体来说,我们介绍了一个全面的指南,将掩蔽扩散语言模型(MDLM)和均匀状态扩散模型(USDM)重新评分ASR假设。此外,我们设计了一个新的联合解码方法,结合CTC和USDM通过整合从CTC的帧概率分布与USDM计算的标签概率分布在每个解码步骤,从而产生新的候选人,结合强大的语言知识从USDM和声学信息从CTC。我们的研究结果表明,USDM,以及MDLM,可以显着提高识别文本的准确性。我们发布所有的代码和配方。
摘要:Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.


【4】Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models
标题:通过大型语言模型改善世代事实的自适应保形预测
链接:https://arxiv.org/abs/2604.13991

作者 :Aleksandr Rubashevskii,Dzianis Piatrashyn,Preslav Nakov,Maxim Panov
摘要:大型语言模型(LLM)很容易产生不正确的输出。最近的工作已应用保形预测提供不确定性估计和统计保证的真实性LLM代。然而,现有的方法通常不是自适应的,限制了它们捕获依赖于输入的可变性的能力。因此,对于给定的任务或提示,它们可能会过滤掉太少的项目(导致覆盖过度)或太多的项目(覆盖不足)。我们提出了一种自适应的共形预测方法,扩展共形得分变换方法的LLM,与应用程序的长格式生成和多项选择题的回答。这使得能够进行即时依赖的校准,保留边缘覆盖保证,同时提高条件覆盖。此外,该方法自然支持选择性预测,允许在下游应用中过滤掉不可靠的声明或答案选择。我们在不同领域的多个白盒模型上评估了我们的方法,并表明它在条件覆盖方面显着优于现有的基线。
摘要:Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.


【5】Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
标题:语音之外的角色:通过强化学习在音频大型语言模型中利用角色扮演评估
链接:https://arxiv.org/abs/2604.13804

作者:Dongjie Fu,Fangming Feng,Xize Cheng,Linjun Li,Zhou Zhao,Tao Jin
摘要:多通道大模型的快速发展彻底改变了语音对话系统中不同角色的模拟,从而实现了一种新的交互模式。字符属性不仅表现在文本的反应,但也通过语音特征,因为语音传达丰富的非语言信息,是具有挑战性的量化。这给角色扮演代理的角色定位评估带来了很大的困难。为了应对这些挑战,我们提出了RoleJudge,这是一个评估框架,它利用音频大语言模型来系统地评估语音和字符之间在多个模式和维度上的对齐。此外,我们还介绍了RoleChat,这是第一个语音角色扮演评估数据集,其中包含了丰富的思维链推理注释,包括一组不同的真实和LLM生成的语音样本。利用该数据集,我们实现了多阶段训练范式,并将标准对齐纳入强化学习中,以减轻优化过程中的奖励不一致。在准确性和主观评价方面的实验结果表明,RoleJudge优于各种基线模型,验证了我们的多维评估框架的有效性。
摘要:The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge, an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. Furthermore, we introduce RoleChat, the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations, comprising a diverse set of authentic and LLM-generated speech samples. Utilizing this dataset, we implement a multi-stage training paradigm and incorporate Standard Alignment in reinforcement learning to mitigate reward misalignment during optimization. Experimental results in terms of accuracy and subjective assessment demonstrate that RoleJudge outperforms various baseline models, validating the effectiveness of our multidimensional evaluation framework.


【6】The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents
标题:认知伴侣:一种轻量级并行监控架构,用于检测LLM代理的推理退化并从中恢复
链接:https://arxiv.org/abs/2604.13759

作者:Rafflesia Khan,Nafiul Islam Khan
摘要:大型语言模型(LLM)代理在多步任务中遭受推理退化,循环,漂移,卡住状态,在硬任务中高达30%。目前的解决方案包括硬步骤限制(突然)或LLM作为判断监控(每步10-15%的开销)。本文介绍了认知伴侣,一个并行监控架构,有两个实现:一个基于LLM的伴侣和一个新的零开销的探针为基础的伴侣。我们报告了一项以Gemma 4 E4 B为中心的三批次可行性研究,并对Qwen 2.5 1.5B和Llama 3.2 1B进行了额外的探索性小模型分析。在我们的实验中,基于LLM的Companion将易循环任务的重复减少了52-62%,开销约为11%。基于探测器的同伴,在第28层的隐藏状态上训练,在零测量推理开销下显示出+0.471的平均效应大小;其最强的探测器结果在一个小的代理标记数据集上达到了交叉验证的AUROC 0.840。一个关键的实证发现是,同伴的好处似乎任务类型依赖:同伴是最有帮助的循环倾向和开放式的任务,而效果是中性或负面的结构化任务。我们的小模型实验还表明了一个可能的尺度边界:同伴并没有提高1B-1.5B模型的测量质量代理,即使干预措施被解雇。总的来说,这篇论文应该被看作是可行性研究,而不是最终的验证。结果提供了令人鼓舞的证据,子令牌监测可能是有用的,确定任务类型的敏感性作为一个实际的设计约束,并激励选择性的同伴激活作为一个有前途的方向,为未来的工作。
摘要:Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small-model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead. The Probe-based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross-validated AUROC 0.840 on a small proxy-labeled dataset. A key empirical finding is that companion benefit appears task-type dependent: companions are most helpful on loop-prone and open-ended tasks, while effects are neutral or negative on more structured tasks. Our small-model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B-1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub-token monitoring may be useful, identify task-type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work.


【7】Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning
标题:具有强化学习大型语言模型的不确定奖励链
链接:https://arxiv.org/abs/2604.13504

作者:Shentong Mo
摘要:设计有效的奖励函数是强化学习(RL)的基石,但由于传统方法固有的低效率和不一致性,它仍然是一个具有挑战性和劳动密集型的过程。现有的方法往往依赖于大量的人工设计和评估步骤,这是容易冗余和忽视局部的不确定性,在中间决策点。为了解决这些挑战,我们提出了不确定奖励链(CoUR),这是一个新的框架,它集成了大型语言模型(LLM),以简化RL环境中的奖励函数设计和评估。具体来说,我们的CoUR引入了代码不确定性量化与相似性选择机制,结合文本和语义分析,以识别和重用最相关的奖励功能组件。通过减少冗余评估和利用贝叶斯优化解耦的奖励条款,CoUR能够更有效和更强大的搜索最佳的奖励反馈。我们在IsaacGym的九个原始环境和Bidexterous Manipulation基准测试的所有20个任务中全面评估CoUR。实验结果表明,CoUR不仅取得了更好的性能,而且显着降低了奖励评估的成本。
摘要 :Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function design and evaluation in RL environments. Specifically, our CoUR introduces code uncertainty quantification with a similarity selection mechanism that combines textual and semantic analyses to identify and reuse the most relevant reward function components. By reducing redundant evaluations and leveraging Bayesian optimization on decoupled reward terms, CoUR enables a more efficient and robust search for optimal reward feedback. We comprehensively evaluate CoUR across nine original environments from IsaacGym and all 20 tasks from the Bidexterous Manipulation benchmark. The experimental results demonstrate that CoUR not only achieves better performance but also significantly lowers the cost of reward evaluations.


【8】Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
标题:数据集级收件箱削弱不确定性:扩散语言模型中的细粒度不确定性评估
链接:https://arxiv.org/abs/2604.13413

作者:Zhengyu Fang,Zhimeng Jiang,Huiyuan Chen,Xiaoge Zhang,Tianyi Li,Kaiyu Tang,Xiao Li,Jing Li
摘要:扩散语言模型(DLMs)已经成为大型语言模型(LLM)的一个有前途的范例,但DLMs的非确定性行为仍然知之甚少。LLM的现有非确定性评估主要依赖于固定推理配置下的队列级度量,对模型行为如何在运行和评估条件下变化的洞察有限。在这项工作中,我们表明,在扩散语言模型的样本级预测质量在不同的运行聚合,系统地减弱了非确定性的小规模度量。因此,具有类似聚合性能的配置可以在各个输入上表现出显著不同的行为,从而留下细粒度的不稳定性和独特的错误模式。为了解决这一限制,我们进行了细粒度的非确定性评估的基础上,在一系列模型相关的因素,包括指导规模,扩散步骤,蒙特卡罗采样,以及系统相关的因素,如批量大小,硬件和数值精度的样本级预测差异。我们的分析表明,非确定性在DLM是普遍的和结构化的,与代码生成表现出显着更高的灵敏度,因素级别的选择比问题回答。属性来源的非确定性评价,我们引入因素方差属性(FVA),一个跨因素分析指标,分解观察到的非确定性到方差归因于不同的评价因素设置。我们的研究结果强调需要细粒度的,因子感知的评估,使可靠的非确定性评估扩散语言模型。
摘要:Diffusion language models (DLMs) have emerged as a promising paradigm for large language models (LLMs), yet the non-deterministic behavior of DLMs remains poorly understood. The existing non-determinism evaluations for LLMs predominantly rely on dataset-level metrics under fixed inference configurations, providing limited insight into how model behavior varies across runs and evaluation conditions. In this work, we show that dataset-level metrics systematically attenuate non-determinism in diffusion language models by aggregating sample-level prediction quality across different runs. As a result, configurations with similar aggregate performance can exhibit substantially different behaviors on individual inputs, leaving fine-grained instability and distinct error patterns uncharacterized. To address this limitation, we conduct a fine-grained evaluation of non-determinism based on sample-level prediction differences across a range of model-related factors-including guidance scale, diffusion steps, and Monte Carlo sampling-as well as system-related factors such as batch size, hardware, and numerical precision. Our analysis reveals that non-determinism in DLMs is pervasive and structured, with code generation exhibiting markedly higher sensitivity to factor-level choices than question answering. To attribute sources of non-determinism evaluation, we introduce Factor Variance Attribution (FVA), a cross-factor analysis metric that decomposes observed non-determinism into variance attributable to different evaluation factor settings. Our findings highlight the need for fine-grained, factor-aware evaluation to enable reliable non-determinism assessment of diffusion language models.


【9】When Less Latent Leads to Better Relay: Information-Preserving Compression for Latent Multi-Agent LLM Collaboration
标题:当更少的潜在导致更好的中继时:用于潜在多代理LLM协作的信息保留压缩
链接:https://arxiv.org/abs/2604.13349

作者:Yiping Li,Zhiyu An,Wan Du
摘要:基于LLM(Large Language Model)的多智能体系统中的通信正在超越离散的令牌,以保持更丰富的上下文。最近的工作,如LatentMAS,使代理通过完整的键值(KV)缓存交换潜在的消息。然而,全KV继电器招致高存储器和通信成本。我们适应驱逐式KV压缩到这种设置,并引入正交回填(OBF),以减轻信息丢失的硬驱逐。OBF将来自丢弃的KV状态的低秩正交残差注入到保留的KV状态中。我们评估所提出的方法对全KV继电器的九个标准基准跨越数学推理,编码和知识密集型QA。它达到了与全KV继电器相当的性能,同时降低了79.8%-89.4%的通信成本。OBF进一步提高了性能,并在9个基准测试中的7个上取得了最佳结果。这表明,更多的信息并不一定会带来更好的沟通;保留最有用的信息更重要。我们的代码库在https://github.com/markli404/When-Less-Latent-Leads-to-Better-Relay上公开。
摘要:Communication in Large Language Model (LLM)-based multi-agent systems is moving beyond discrete tokens to preserve richer context. Recent work such as LatentMAS enables agents to exchange latent messages through full key-value (KV) caches. However, full KV relay incurs high memory and communication cost. We adapt eviction-style KV compression to this setting and introduce Orthogonal Backfill (OBF) to mitigate information loss from hard eviction. OBF injects a low-rank orthogonal residual from discarded KV states into the retained KV states. We evaluate proposed method against full KV relay on nine standard benchmarks spanning mathematical reasoning, coding, and knowledge-intensive QA. It achieves performance comparable to full KV relay while reducing communication cost by 79.8%--89.4%. OBF further improves the performance and achieves the best results on 7 of the 9 benchmarks. This suggests that more information does not necessarily lead to better communication; preserving the most useful information matters more. Our codebase is publicly available on https://github.com/markli404/When-Less-Latent-Leads-to-Better-Relay.


【10】Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation
标题:使用大型语言模型丰富文本属性知识图以实现医学概念表示
链接:https://arxiv.org/abs/2604.13331

作者:Mohsen Nayebi Kerdabadi,Arya Hadizadeh Moghaddam,Chen Chen,Dongjie Wang,Zijun Yao
备注:This paper has been accepted at ACL 2026 main conference
摘要:在电子健康记录(EHR)挖掘中,学习医学概念的高质量表示(例如,标准化的诊断、用药和程序代码)是下游临床预测的基础。然而,鲁棒的概念表示学习受到两个关键挑战的阻碍:(i)临床上重要的跨类型依赖性(例如,诊断-药物和药物-程序关系)在现有本体资源中通常缺失或不完整,限制了对复杂EHR模式进行建模的能力;以及(ii)丰富的临床语义通常在结构化资源中缺失,即使作为文本可用,也难以与KG结构集成以进行表示学习。为了解决这些挑战,我们提出了CoMed,一个LLM授权的图形学习框架,用于医学概念表示。CoMed首先通过将从EHR挖掘的统计可靠关联与类型约束LLM提示相结合来推断语义关系,从而在医疗代码上构建全局知识图(KG)。然后,它利用LLM丰富的KG到一个文本属性图生成节点描述和边缘的理由,提供语义信号的概念和它们的关系。最后,CoMed联合训练了一个LoRA调整的LLaMA文本编码器和一个异构的GNN,将文本语义和图结构融合到统一的概念嵌入中。MIMIC-III和MIMIC-IV上的大量实验表明,CoMed始终提高预测性能,并作为标准EHR管道的有效插件概念编码器。
摘要 :In electronic health record (EHR) mining, learning high-quality representations of medical concepts (e.g., standardized diagnosis, medication, and procedure codes) is fundamental for downstream clinical prediction. However, robust concept representation learning is hindered by two key challenges: (i) clinically important cross-type dependencies (e.g., diagnosis-medication and medication-procedure relations) are often missing or incomplete in existing ontology resources, limiting the ability to model complex EHR patterns; and (ii) rich clinical semantics are often missing from structured resources, and even when available as text, are difficult to integrate with KG structure for representation learning. To address these challenges, we present CoMed, an LLM-empowered graph learning framework for medical concept representation. CoMed first builds a global knowledge graph (KG) over medical codes by combining statistically reliable associations mined from EHRs with type-constrained LLM prompting to infer semantic relations. It then utilizes LLMs to enrich the KG into a text-attributed graph by generating node descriptions and edge rationales, providing semantic signals for both concepts and their relationships. Finally, CoMed jointly trains a LoRA-tuned LLaMA text encoder with a heterogeneous GNN, fusing text semantics and graph structure into unified concept embeddings. Extensive experiments on MIMIC-III and MIMIC-IV show that CoMed consistently improves prediction performance and serves as an effective plug-in concept encoder for standard EHR pipelines.


【11】Multi-Task LLM with LoRA Fine-Tuning for Automated Cancer Staging and Biomarker Extraction
标题:具有LoRA微调的多任务LLM,用于自动癌症分期和生物标志物提取
链接:https://arxiv.org/abs/2604.13328

作者:Jiahao Shao,Anam Nawaz Khan,Christopher Brett,Tom Berg,Xueping Li,Bing Yao
备注:11 pages, 3 figures and 4 tables in the main manuscript. Additional content, figures and tables are in supplementary material section. 17 pages in total
摘要:病理报告是乳腺癌分期的决定性记录,但其非结构化格式阻碍了大规模数据的管理。虽然大型语言模型(LLM)提供语义推理,但它们的部署通常受到高计算成本和幻觉风险的限制。本研究介绍了一种参数高效的多任务框架,用于自动提取肿瘤淋巴结转移(TNM)分期、组织学分级和生物标志物。我们使用低秩自适应(LoRA)对10,677份报告的策划,专家验证数据集进行微调Llama-3-8B-Instruct编码器。与生成方法不同,我们的架构利用并行分类头来执行一致的模式遵守。实验结果表明,该模型达到了0.976的宏F1分数,成功地解决了复杂的上下文歧义和异构的报告格式,挑战传统的提取方法,包括基于规则的自然语言处理(NLP)管道,zero-shot LLM,和单任务LLM基线。所提出的适配器高效的多任务架构能够实现可靠的、可扩展的病理学衍生的癌症分期和生物标志物分析,具有增强临床决策支持和加速数据驱动的肿瘤学研究的潜力。
摘要:Pathology reports serve as the definitive record for breast cancer staging, yet their unstructured format impedes large-scale data curation. While Large Language Models (LLMs) offer semantic reasoning, their deployment is often limited by high computational costs and hallucination risks. This study introduces a parameter-efficient, multi-task framework for automating the extraction of Tumor-Node-Metastasis (TNM) staging, histologic grade, and biomarkers. We fine-tune a Llama-3-8B-Instruct encoder using Low-Rank Adaptation (LoRA) on a curated, expert-verified dataset of 10,677 reports. Unlike generative approaches, our architecture utilizes parallel classification heads to enforce consistent schema adherence. Experimental results demonstrate that the model achieves a Macro F1 score of 0.976, successfully resolving complex contextual ambiguities and heterogeneous reporting formats that challenge traditional extraction methods including rule-based natural language processing (NLP) pipelines, zero-shot LLMs, and single-task LLM baselines. The proposed adapter-efficient, multi-task architecture enables reliable, scalable pathology-derived cancer staging and biomarker profiling, with the potential to enhance clinical decision support and accelerate data-driven oncology research.


【12】MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models
标题:MOONSHOT:视觉和大型语言模型的多目标修剪框架
链接:https://arxiv.org/abs/2604.13287

作者:Gabriel Afriat,Xiang Meng,Shibal Ibrahim,Hussein Hazimeh,Rahul Mazumder
摘要:权重修剪是压缩大型神经网络的常用技术。我们专注于具有挑战性的训练后一次性设置,其中预训练模型被压缩而无需任何重新训练。现有的单次修剪方法通常优化单个目标,例如逐层重建损失或训练损失的二阶泰勒近似。我们强调,无论是单独的目标是一致的最有效的架构和稀疏水平。受此启发,我们提出了MOONSHOT,一个通用而灵活的框架,通过联合优化分层重建误差和训练损失的二阶泰勒近似,将任何单目标修剪方法扩展到多目标公式。MOONSHOT充当现有修剪算法的包装器。为了实现这种集成,同时保持十亿参数模型的可扩展性,我们提出了建模决策,并引入了一个有效的程序来计算逆Hessian,保持最先进的单次修剪的效率。当与Llama-3.2和Llama-2模型上最先进的修剪方法相结合时,MOONSHOT在2:4稀疏度下将C4困惑度降低了32.6%,并将七个分类基准的zero-shot平均准确度提高了4.9个点。在Vision Transformers上,它在ImageNet-1 k上以70%的稀疏度提高了5个点以上的准确度,在ResNet-50上,它在90%的稀疏度下提高了4个点。
摘要:Weight pruning is a common technique for compressing large neural networks. We focus on the challenging post-training one-shot setting, where a pre-trained model is compressed without any retraining. Existing one-shot pruning methods typically optimize a single objective, such as a layer-wise reconstruction loss or a second-order Taylor approximation of the training loss. We highlight that neither objective alone is consistently the most effective across architectures and sparsity levels. Motivated by this insight, we propose MOONSHOT, a general and flexible framework that extends any single-objective pruning method into a multi-objective formulation by jointly optimizing both the layer-wise reconstruction error and second-order Taylor approximation of the training loss. MOONSHOT acts as a wrapper around existing pruning algorithms. To enable this integration while maintaining scalability to billion-parameter models, we propose modeling decisions and introduce an efficient procedure for computing the inverse Hessian, preserving the efficiency of state-of-the-art one-shot pruners. When combined with state-of-the-art pruning methods on Llama-3.2 and Llama-2 models, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot mean accuracy across seven classification benchmarks by up to 4.9 points. On Vision Transformers, it improves accuracy on ImageNet-1k by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity.


【13】Enhancing Confidence Estimation in Telco LLMs via Twin-Pass CoT-Ensembling
标题:通过Twin-Pass CoT集成增强电信LLC中的置信度估计
链接:https://arxiv.org/abs/2604.13271

作者:Anton Saenko,Pranshav Gajjar,Abiodun Ganiyu,Vijay K. Shah
摘要:大型语言模型(LLM)越来越多地应用于复杂的电信任务,包括3GPP规范分析和O-RAN网络故障排除。然而,一个关键的限制仍然存在:LLM生成的置信度分数往往是有偏见的和不可靠的,经常表现出系统性的过度自信。这种缺乏可信的自我评估使得难以验证模型输出并在实践中安全地依赖它们。在本文中,我们研究了电信领域LLM的置信度校准,使用代表性的Gemma-3模型家族(4 B,12 B和27 B参数),在TeleQnA,ORANBench和srsRANBench上进行评估。我们表明,标准的单通,言语化的信心估计不能反映真正的正确性,往往分配高置信度不正确的预测。为了解决这个问题,我们提出了一种新的双通道思想链(CoT)-集成方法,通过利用多个独立的推理评估和汇总他们的评估到一个校准的置信度得分,以提高置信度估计。我们的方法在基准测试中将预期校准误差(ECE)降低了88%,显著提高了模型自评估的可靠性。这些结果突出了当前置信度估计实践的局限性,并展示了一条更可靠的电信LLM输出评估的实用路径。
摘要:Large Language Models (LLMs) are increasingly applied to complex telecommunications tasks, including 3GPP specification analysis and O-RAN network troubleshooting. However, a critical limitation remains: LLM-generated confidence scores are often biased and unreliable, frequently exhibiting systematic overconfidence. This lack of trustworthy self-assessment makes it difficult to verify model outputs and safely rely on them in practice. In this paper, we study confidence calibration in telecom-domain LLMs using the representative Gemma-3 model family (4B, 12B, and 27B parameters), evaluated on TeleQnA, ORANBench, and srsRANBench. We show that standard single-pass, verbalized confidence estimates fail to reflect true correctness, often assigning high confidence to incorrect predictions. To address this, we propose a novel Twin-Pass Chain of Thought (CoT)-Ensembling methodology for improving confidence estimation by leveraging multiple independent reasoning evaluations and aggregating their assessments into a calibrated confidence score. Our approach reduces Expected Calibration Error (ECE) by up to 88% across benchmarks, significantly improving the reliability of model self-assessment. These results highlight the limitations of current confidence estimation practices and demonstrate a practical path toward more trustworthy evaluation of LLM outputs in telecommunications.


【14】KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
标题:KV数据包:适用于LLM的免重新计算、上下文独立的KV缓存
链接:https://arxiv.org/abs/2604.13226

作者:Chuangtao Chen,Grace Li Zhang,Xunzhao Yin,Cheng Zhuo,Bing Li,Ulf Schlichtmann
摘要:大型语言模型(LLM)严重依赖键值(KV)缓存来最大限度地减少推理延迟。然而,标准的KV缓存是上下文相关的:在新的上下文中重用缓存的文档需要重新计算KV状态,以考虑注意力分布的变化。CacheBlend、EPIC和SAM-KV等现有解决方案通过选择性地重新计算令牌子集来缓解此问题;然而,它们仍然会产生不可忽略的计算开销(FLOP)和增加的首次令牌时间(TTFT)延迟。在本文中,我们提出了KV数据包,一个无需重新计算的缓存重用框架,将缓存文档视为不可变的“数据包”,包裹在轻量级的可训练软令牌适配器中,这些适配器通过自我监督蒸馏进行训练,以桥接上下文不连续性。在Llama-3.1和Qwen2.5上的实验表明,所提出的KV分组方法实现了接近零的FLOPs和比基于重新计算的基线更低的TTFT,同时保留了与完全重新计算基线相当的F1分数。
摘要:Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.


【15】Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models
标题:数值不稳定性和混乱:量化大型语言模型的不可预测性
链接:https://arxiv.org/abs/2604.13206

作者:Chashi Mahiul Islam,Alan Villarreal,Mao Nishino,Shaeke Salman,Xiuwen Liu
备注:8 pages, 9 figures
摘要:随着大型语言模型(LLM)越来越多地集成到代理工作流中,其数值不稳定性带来的不可预测性已成为一个关键的可靠性问题。虽然最近的研究表明,这些不稳定性对下游产生了重大影响,但对根本原因和基本机制仍然知之甚少。在本文中,我们提出了一个严格的分析,不可预测性是如何植根于浮点表示的有限数值精度,跟踪舍入误差如何传播,放大,或通过Transformer计算层消散。具体来说,我们确定了一个混乱的“雪崩效应”的早期层,其中微小的扰动触发二元结果:要么快速放大或完全衰减。除了特定的错误实例,我们证明了LLM表现出普遍的,尺度相关的混沌行为,其特征在于三个不同的制度:1)稳定的制度,其中扰动低于输入相关的阈值并消失,导致恒定的输出; 2)混沌制度,其中舍入误差占主导地位并驱动输出发散;和3)信号主导的制度,其中真实的输入变化覆盖数值噪声。我们在多个数据集和模型架构中广泛验证了这些发现。
摘要:As Large Language Models (LLMs) are increasingly integrated into agentic workflows, their unpredictability stemming from numerical instability has emerged as a critical reliability issue. While recent studies have demonstrated the significant downstream effects of these instabilities, the root causes and underlying mechanisms remain poorly understood. In this paper, we present a rigorous analysis of how unpredictability is rooted in the finite numerical precision of floating-point representations, tracking how rounding errors propagate, amplify, or dissipate through Transformer computation layers. Specifically, we identify a chaotic "avalanche effect" in the early layers, where minor perturbations trigger binary outcomes: either rapid amplification or complete attenuation. Beyond specific error instances, we demonstrate that LLMs exhibit universal, scale-dependent chaotic behaviors characterized by three distinct regimes: 1) a stable regime, where perturbations fall below an input-dependent threshold and vanish, resulting in constant outputs; 2) a chaotic regime, where rounding errors dominate and drive output divergence; and 3) a signal-dominated regime, where true input variations override numerical noise. We validate these findings extensively across multiple datasets and model architectures.


【16】LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
标题:LiveClawBench:在复杂的现实世界助理任务上对LLM代理进行基准测试
链接:https://arxiv.org/abs/2604.13072

作者:Xiang Long,Li Du,Yilong Xu,Fangcheng Liu,Haoqing Wang,Ning Ding,Ziheng Li,Jianyuan Guo,Yehui Tang
摘要:基于LLM的代理越来越多地被期望处理现实世界的助理任务,但现有的基准通常在孤立的困难来源下评估它们,例如单个环境或完全指定的指令。这在当前的评估设置和实际部署中出现的组合挑战之间留下了很大的差距。为了解决这一差距,我们引入了LiveClawBench,这是一个基准测试,用于评估LLM代理在现实世界中的助理任务。基于对各种真实OpenClaw使用案例的分析,我们得出了一个三轴复杂性框架,该框架沿着三个维度表征任务难度:环境复杂性,认知需求和可持续适应性。在这个框架的指导下,我们构建了一个试点基准与明确的复杂性因素的注释,涵盖现实世界的辅助任务与组合的困难。总之,框架和基准提供了一个原则性的基础,评估LLM代理在现实的助理设置,并建立跨任务域和复杂性轴的未来扩展的基础。我们将继续丰富我们的案例库,以实现更全面的领域和复杂性覆盖。该项目的页面位于www.example.com。
摘要:LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi-AI/LiveClawBench.


【17】Before the First Token: Scale-Dependent Emergence of Hallucination Signals in Autoregressive Language Models
标题:在第一个令牌之前:自回归语言模型中幻觉信号的规模依赖性出现
链接:https://arxiv.org/abs/2604.13068

作者:Dip Roy,Rajiv Misra,Sanjay Kumar Singh,Anisha Roy
摘要:大型语言模型何时决定产生幻觉?尽管在医疗保健、法律和金融方面存在严重后果,但几乎没有正式的答案。最近的工作表明,自回归模型保持内部表示区分事实和虚构的输出,但当这些表示作为模型规模的函数达到峰值时,仍然知之甚少。   我们使用三个基于事实的数据集(TriviaQA,Simple Facts,Biography; 552个标记的例子)研究了7个自回归Transformers(117 M-7 B参数)的幻觉指示性内部表征的时间动态。我们确定了一个尺度依赖的相变:低于400 M参数的模型在每个生成位置(AUC = 0.48- 0.67)显示出机会水平的探针准确性,表明没有可靠的事实信号。在10亿美元以上的参数,出现了一个质的不同的制度,峰值可检测性发生在零位置-在任何令牌生成之前-然后在生成过程中下降。该预生成信号在Pythia-1.4B(p = 0.012)和Qwen 2.5 - 7 B(p = 0.038)中具有统计学显著性,跨越不同的架构和训练语料库。   在7 B尺度下,我们观察到了显著的分离:Pythia-6.9B(基础模型,在The Pile上训练)产生了平坦的时间曲线($Δ$ =+0.001,p = 0.989),而调整的Qwen2.5- 7 B显示出显性的前代效应。这表明原始规模本身是不够的-通过指令调整或等效的后期培训的知识组织需要预承诺编码。沿着探针衍生方向的激活转向未能纠正所有模型中的幻觉,确认信号是相关的而不是因果关系。我们的研究结果提供了规模校准的检测协议和具体的假设,在发展知识电路支持事实生成的指令调整的作用。
摘要 :When do large language models decide to hallucinate? Despite serious consequences in healthcare, law, and finance, few formal answers exist. Recent work shows autoregressive models maintain internal representations distinguishing factual from fictional outputs, but when these representations peak as a function of model scale remains poorly understood.   We study the temporal dynamics of hallucination-indicative internal representations across 7 autoregressive transformers (117M--7B parameters) using three fact-based datasets (TriviaQA, Simple Facts, Biography; 552 labeled examples). We identify a scale-dependent phase transition: models below 400M parameters show chance-level probe accuracy at every generation position (AUC = 0.48--0.67), indicating no reliable factuality signal. Above $\sim$1B parameters, a qualitatively different regime emerges where peak detectability occurs at position zero -- before any tokens are generated -- then declines during generation. This pre-generation signal is statistically significant in both Pythia-1.4B (p = 0.012) and Qwen2.5-7B (p = 0.038), spanning distinct architectures and training corpora.   At the 7B scale, we observe a striking dissociation: Pythia-6.9B (base model, trained on The Pile) produces a flat temporal profile ($Δ$ = +0.001, p = 0.989), while instruction-tuned Qwen2.5-7B shows a dominant pre-generation effect. This indicates raw scale alone is insufficient -- knowledge organization through instruction tuning or equivalent post-training is required for pre-commitment encoding. Activation steering along probe-derived directions fails to correct hallucinations across all models, confirming the signal is correlational rather than causal. Our findings provide scale-calibrated detection protocols and a concrete hypothesis on instruction tuning's role in developing knowledge circuits supporting factual generation.


【18】Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data
标题:通过词典编码和上下文内学习进行无损提示压缩:实现重复数据的经济高效LLM分析
链接:https://arxiv.org/abs/2604.13066

作者:Andresa Rodrigues de Campos,David Lee,Imry Kissos,Piyush Paritosh
摘要:语境学习已经成为大型语言模型(LLM)的重要学习范式。在本文中,我们证明了LLM可以在上下文中学习编码键,并直接对编码表示进行分析。这一发现通过字典编码实现了无损提示压缩,而无需模型微调:频繁出现的重复被紧凑的元令牌取代,并且当在系统提示中提供压缩字典时,LLM在分析期间正确解释这些元令牌,产生与未压缩输入等效的输出。我们提出了一种压缩算法,识别重复的模式在多个长度尺度,结合令牌节省优化标准,确保压缩降低成本,防止字典开销超过储蓄。该算法实现压缩比高达80$\%$取决于数据集的特点。为了验证LLM分析精度在压缩下保持不变,我们使用解压缩作为具有明确地面实况的代理任务。使用Claude 3.7 Sonnet对LogHub 2.0基准进行的评估表明,基于模板的压缩的精确匹配率超过0.99,算法压缩的平均Levenshtein相似性得分超过0.91,即使在压缩比为60$\%$-80$\%$的情况下。此外,压缩比解释小于2$\%$的相似性度量的方差,表明解压缩质量取决于数据集的特性,而不是压缩强度。这种无需培训的方法与基于API的LLM一起使用,直接解决了基本的部署限制-令牌限制和API成本-并实现了对大规模重复数据集的经济有效的分析,即使数据模式随着时间的推移而演变。
摘要:In-context learning has established itself as an important learning paradigm for Large Language Models (LLMs). In this paper, we demonstrate that LLMs can learn encoding keys in-context and perform analysis directly on encoded representations. This finding enables lossless prompt compression via dictionary encoding without model fine-tuning: frequently occurring subsequences are replaced with compact meta-tokens, and when provided with the compression dictionary in the system prompt, LLMs correctly interpret these meta-tokens during analysis, producing outputs equivalent to those from uncompressed inputs. We present a compression algorithm that identifies repetitive patterns at multiple length scales, incorporating a token-savings optimization criterion that ensures compression reduces costs by preventing dictionary overhead from exceeding savings. The algorithm achieves compression ratios up to 80$\%$ depending on dataset characteristics. To validate that LLM analytical accuracy is preserved under compression, we use decompression as a proxy task with unambiguous ground truth. Evaluation on the LogHub 2.0 benchmark using Claude 3.7 Sonnet demonstrates exact match rates exceeding 0.99 for template-based compression and average Levenshtein similarity scores above 0.91 for algorithmic compression, even at compression ratios of 60$\%$-80$\%$. Additionally, compression ratio explains less than 2$\%$ of variance in similarity metrics, indicating that decompression quality depends on dataset characteristics rather than compression intensity. This training-free approach works with API-based LLMs, directly addressing fundamental deployment constraints -- token limits and API costs -- and enabling cost-effective analysis of large-scale repetitive datasets, even as data patterns evolve over time.


【19】A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection
标题:用于多模式数据收集中LLM驱动触发器生成的领域特定语言
链接:https://arxiv.org/abs/2604.13046

作者:Philipp Reis,Philipp Rigoll,Martin Zehetner,Jacqueline Henle,Stefan Otten,Eric Sax
备注:Version submitted to the IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)
摘要:数据驱动的系统依赖于任务相关的数据,但数据收集管道仍然是被动和不加选择的。多模态传感器流的连续记录导致高存储成本并捕获不相关的数据。本文提出了一个声明性的框架,意图驱动的,在设备上的数据收集,使选择性收集的多模态传感器数据的基础上,高层次的用户请求。该框架将自然语言交互与正式指定的领域特定语言(DSL)相结合。大型语言模型将用户定义的需求转换为可验证和可组合的DSL程序,这些程序定义了跨异构传感器的条件触发器,包括相机,LiDAR和系统遥测。车辆和机器人感知任务的实证评估表明,基于DSL的方法实现了更高的生成一致性和更低的执行延迟比无约束的代码生成,同时保持相当的检测性能。结构化抽象支持模块化触发器组合和资源受限边缘平台上的并发部署。这种方法用可验证的、意图驱动的机制代替了被动的日志记录,用于实时系统中的多模式数据收集。
摘要:Data-driven systems depend on task-relevant data, yet data collection pipelines remain passive and indiscriminate. Continuous logging of multimodal sensor streams incurs high storage costs and captures irrelevant data. This paper proposes a declarative framework for intent-driven, on-device data collection that enables selective collection of multimodal sensor data based on high-level user requests. The framework combines natural language interaction with a formally specified domain-specific language (DSL). Large language models translate user-defined requirements into verifiable and composable DSL programs that define conditional triggers across heterogeneous sensors, including cameras, LiDAR, and system telemetry. Empirical evaluation on vehicular and robotic perception tasks shows that the DSL-based approach achieves higher generation consistency and lower execution latency than unconstrained code generation while maintaining comparable detection performance. The structured abstraction supports modular trigger composition and concurrent deployment on resource-constrained edge platforms. This approach replaces passive logging with a verifiable, intent-driven mechanism for multimodal data collection in real-time systems.


【20】When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation
标题:当推理模型损害行为模拟时:多智能体LLM谈判中的求解器-采样器不匹配
链接:https://arxiv.org/abs/2604.11840

作者:Sandro Andric
备注:12 pages, 5 figures, supplementary material included as ancillary file
摘要:大型语言模型越来越多地被用作社会,经济和政策模拟中的代理。一个常见的假设是,更强的推理应该提高模拟的保真度。我们认为,当目标不是解决一个战略问题,而是抽样可信的有限理性行为时,这种假设可能会失败。在这种情况下,推理增强模型可能成为更好的解决者和更差的模拟器:它们可能过度优化战略主导行为,崩溃妥协导向的终端行为,有时表现出多样性无保真度模式,其中局部变化在没有结果水平保真度的情况下生存。我们研究了这种求解器采样器不匹配的多代理谈判环境适应早期的模拟工作:一个模糊的碎片化的权力交易限制的情况下,一个模糊的统一反对交易限制的情况下,和一个新的域电网限电的情况下,在紧急电力管理。我们比较了两个主要模型家族中的三个反射条件,无反射,有界反射和本地推理,然后将相同的协议扩展到使用GPT-4.1和GPT-5.2直接运行OpenAI。在所有三个实验中,有界反射产生了比没有反射或本地推理更多样化和妥协导向的轨迹。在直接OpenAI扩展中,GPT-5.2原生在三个实验的45次运行中有45次以权威决策结束,而GPT-5.2在每个环境中都有界恢复妥协结果。这一贡献并不是主张推理通常是有害的。这是一个方法上的警告:模型能力和模拟保真度是不同的目标,行为模拟应该使模型成为采样器,而不仅仅是求解器。
摘要 :Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers.


Graph相关(图学习|图神经网络|图优化等)(3篇)

【1】ID and Graph View Contrastive Learning with Multi-View Attention Fusion for Sequential Recommendation
标题:ID和图形视图对比学习与多视图注意力融合用于顺序推荐
链接:https://arxiv.org/abs/2604.14114

作者:Xiaofan Zhou,Kyumin Lee
摘要:顺序推荐在学术界和工业界,特别是在电子商务中越来越重要。主要目标是从历史交互序列中提取用户偏好,并预测用户接下来可能参与的项目。最近的进展利用对比学习和图形神经网络从交互历史中学习更具表达力的表示-图形捕获节点之间的关系结构,而基于ID的表示编码特定于项目的信息。然而,很少有研究探讨ID和图形的角度之间的多视图对比学习,以共同提高用户和项目表示,特别是在设置中,只有交互数据是没有辅助信息。   为了解决这一差距,我们提出了多视图对比学习顺序推荐(MVCrec),这是一个框架,它集成了来自顺序(基于ID)和基于图形的视图的互补信号。MVCrec包含三个对比目标:在顺序视图中,在图形视图中,以及跨视图。为了有效地融合学习的表示,我们引入了一个多视图的注意力融合模块,它结合了全球和本地的注意力机制,以估计目标用户购买目标项目的可能性。在五个真实世界的基准数据集上进行的综合实验表明,MVCrec始终优于11个最先进的基线,在NDCG@10和HitRatio@10中实现了高达14.44%的改进。我们的代码和数据集可以在https://github.com/sword-Lz/MMCrec上找到。
摘要:Sequential recommendation has become increasingly prominent in both academia and industry, particularly in e-commerce. The primary goal is to extract user preferences from historical interaction sequences and predict items a user is likely to engage with next. Recent advances have leveraged contrastive learning and graph neural networks to learn more expressive representations from interaction histories -- graphs capture relational structure between nodes, while ID-based representations encode item-specific information. However, few studies have explored multi-view contrastive learning between ID and graph perspectives to jointly improve user and item representations, especially in settings where only interaction data is available without auxiliary information.   To address this gap, we propose Multi-View Contrastive learning for sequential recommendation (MVCrec), a framework that integrates complementary signals from both sequential (ID-based) and graph-based views. MVCrec incorporates three contrastive objectives: within the sequential view, within the graph view, and across views. To effectively fuse the learned representations, we introduce a multi-view attention fusion module that combines global and local attention mechanisms to estimate the likelihood of a target user purchasing a target item. Comprehensive experiments on five real-world benchmark datasets demonstrate that MVCrec consistently outperforms 11 state-of-the-art baselines, achieving improvements of up to 14.44\% in NDCG@10 and 9.22\% in HitRatio@10 over the strongest baseline. Our code and datasets are available at https://github.com/sword-Lz/MMCrec.


【2】Automated co-design of high-performance thermodynamic cycles via graph-based hierarchical reinforcement learning
标题:通过基于图的分层强化学习实现高性能热力循环的自动化协同设计
链接:https://arxiv.org/abs/2604.13133

作者:Wenqing Li,Xu Feng,Peixue Jiang,Yinhai Zhu
备注:21 pages,8 figures
摘要:热力学循环是决定能量转换系统效率的关键。传统的设计方法依赖于专家知识或穷举,效率低下,缺乏可扩展性,从而限制了高性能周期的发现。在这项研究中,我们介绍了一种基于图的分层强化学习方法的热力循环结构参数的协同设计。这些循环被编码为图形,组件和连接被描述为节点和边,遵守语法约束。基于深度学习的热物理代理有助于稳定的图形解码和全局参数的同时解析。在此基础上,我们开发了一个分层强化学习框架,其中高级管理人员探索结构演化并提出候选配置,而低级工作人员优化参数并提供性能奖励以将搜索转向高性能区域。通过集成图形表示,热物理代理和管理员-工人学习,该方法建立了一个完全自动化的编码,解码和协同优化管道。以热泵和热机循环为例进行了研究,结果表明,该方法不仅可以重复经典循环构型,而且可以分别识别出18个和21个新型热泵和热机循环。相对于经典循环,新的配置表现出的性能提高4.6%和133.3%,分别超过了传统的设计。该方法有效地平衡了效率与广泛的适用性,为专家驱动的热力循环设计提供了一种实用且可扩展的智能替代方案。
摘要:Thermodynamic cycles are pivotal in determining the efficacy of energy conversion systems. Traditional design methodologies, which rely on expert knowledge or exhaustive enumeration, are inefficient and lack scalability, thereby constraining the discovery of high-performance cycles. In this study, we introduce a graph-based hierarchical reinforcement learning approach for the co-design of structure parameters in thermodynamic cycles. These cycles are encoded as graphs, with components and connections depicted as nodes and edges, adhering to grammatical constraints. A deep learning-based thermophysical surrogate facilitates stable graph decoding and the simultaneous resolution of global parameters. Building on this foundation, we develop a hierarchical reinforcement learning framework wherein a high-level manager explores structural evolution and proposes candidate configurations, whereas a low-level worker optimizes parameters and provides performance rewards to steer the search towards high-performance regions. By integrating graph representation, thermophysical surrogate, and manager-worker learning, this method establishes a fully automated pipeline for encoding, decoding, and co-optimization. Using heat pump and heat engine cycles as case studies, the results demonstrate that the proposed method not only replicates classical cycle configurations but also identifies 18 and 21 novel heat pump and heat engine cycles, respectively. Relative to classical cycles, the novel configurations exhibit performance improvements of 4.6% and 133.3%, respectively, surpassing the traditional designs. This method effectively balances efficiency with broad applicability, providing a practical and scalable intelligent alternative to expert-driven thermodynamic cycle design.


【3】AeTHERON: Autoregressive Topology-aware Heterogeneous Graph Operator Network for Fluid-Structure Interaction
标题:Aetheron:用于流固相互作用的自回归布局感知的异类图运算符网络
链接:https://arxiv.org/abs/2604.13369

作者:Sushrut Kumar
摘要 :身体驱动的流体流动的替代建模,其中浸没的移动边界将结构动力学耦合到混沌,不稳定的流体现象,仍然是计算物理和机器学习的基本挑战。我们提出了AeTHERON,异构图形神经操作符,其架构直接反映了结构的尖锐界面浸入边界法(IBM):一个双图表示分离流体和结构域,通过稀疏的交叉注意力,反映了紧凑的支持IBM插值stenetry耦合。这种基于物理学的归纳偏差使AeTHERON能够在共享的高维潜在空间中学习非线性流体-结构耦合,连续的正弦时间嵌入提供了跨前置时间的时间概括。我们评估AeTHERON上的直接数值模拟的扑动柔性尾鳍,一个典型的FSI基准具有前缘涡的形成,大的膜变形,和混乱的尾流脱落整个4x 5参数网格的膜厚度(h* = 0.01-0.04)和Strouhal数(St = 0.30-0.50)。作为概念验证,我们使用70/30的训练/验证分割对代表性案例的前150个时间步进行训练,并在完全不可见的外推窗口t=150-200上进行评估。AeTHERON捕获大规模的涡流拓扑结构和尾流结构的定性保真度,实现了平均外推MAE为0.168,无需重新训练,误差峰值接近扑翼半周期过渡,其中流动重组是最迅速的-一个物理上可解释的模式与非线性流体膜耦合一致。在单个GPU上,推理每个时间步需要毫秒,而等效的DNS计算需要数小时。这是一个不断发展的预印本;结果和数字将在后续版本中更新。
摘要:Surrogate modeling of body-driven fluid flows where immersed moving boundaries couple structural dynamics to chaotic, unsteady fluid phenomena remains a fundamental challenge for both computational physics and machine learning. We present AeTHERON, a heterogeneous graph neural operator whose architecture directly mirrors the structure of the sharp-interface immersed boundary method (IBM): a dual-graph representation separating fluid and structural domains, coupled through sparse cross-attention that reflects the compact support of IBM interpolation stencils. This physics-informed inductive bias enables AeTHERON to learn nonlinear fluid-structure coupling in a shared high-dimensional latent space, with continuous sinusoidal time embeddings providing temporal generalization across lead times. We evaluate AeTHERON on direct numerical simulations of a flapping flexible caudal fin, a canonical FSI benchmark featuring leading-edge vortex formation, large membrane deformation, and chaotic wake shedding across a 4x5 parameter grid of membrane thickness (h* = 0.01-0.04) and Strouhal number (St = 0.30-0.50). As a proof-of-concept, we train on the first 150 timesteps of a representative case using a 70/30 train/validation split and evaluate on the fully unseen extrapolation window t=150-200. AeTHERON captures large-scale vortex topology and wake structure with qualitative fidelity, achieving a mean extrapolation MAE of 0.168 without retraining, with error peaking near flapping half-cycle transitions where flow reorganization is most rapid -- a physically interpretable pattern consistent with the nonlinear fluid-membrane coupling. Inference requires milliseconds per timestep on a single GPU versus hours for equivalent DNS computation. This is a continuously developing preprint; results and figures will be updated in subsequent versions.


Transformer(4篇)

【1】Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training
标题:人工智能应用于弱监督训练,通过Vision Transformer进行淋巴瘤诊断
链接:https://arxiv.org/abs/2604.13795

作者:Nghia,Nguyen,Amer Wahed,Andy Quesada,Yasir Ali,Hanadi El Achi,Y. Helen Zhang,Jocelyn Ursua,Alex Banerjee,Sahib Kalra,L. Jeffrey Medeiros,Jie Xu
备注:23 pages, 6 figures, 1 table
摘要:Vision Transformers(ViT)已被证明可以实现更灵活的特征检测,并且在对足够的数据进行预训练时,可以胜过卷积神经网络(CNN)。由于其有前途的特征检测能力,我们部署了ViT进行间变性大细胞淋巴瘤(ALCL)与经典霍奇金淋巴瘤(cHL)的形态分类。我们之前设计了一个ViT模型,该模型在完全监督训练中在1,200个图像块的小数据集上进行训练。该模型在独立测试集上的诊断准确率为100%,F1评分为1.0。由于在训练和测试阶段缺乏专业知识资源,全监督训练不是一种实用的方法,因此我们最近对训练数据的改进方法(弱监督训练)进行了一项研究,并表明在每个全载玻片图像的载玻片级别自动标记训练图像块是Vision Transformer临床使用的更实用的解决方案。我们的ViT模型在100,000个图像块的更大数据集上训练,产生的评估指标具有显著的准确性,F1评分和曲线下面积(AUC)分别为91.85%,0.92和0.98。这些都是值得尊敬的值,使这个ViT模型具有弱监督训练,作为使用自动图像补丁提取的临床模型开发中深度学习模块的合适工具。
摘要:Vision transformers (ViT) have been shown to allow for more flexible feature detection and can outperform convolutional neural network (CNN) when pre-trained on sufficient data. Due to their promising feature detection capabilities, we deployed ViTs for morphological classification of anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL). We had previously designed a ViT model which was trained on a small dataset of 1,200 image patches in fully supervised training. That model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set. Since fully supervised training is not a practical method due to lack of expertise resources in both the training and testing phases, we conducted a recent study on a modified approach to training data (weakly supervised training) and show that labeling training image patch automatically at the slide level of each whole-slide-image is a more practical solution for clinical use of Vision Transformer. Our ViT model, trained on a larger dataset of 100,000 image patches, yields evaluation metrics with significant accuracy, F1 score, and area under the curve (AUC) at 91.85%, 0.92, and 0.98, respectively. These are respectable values that qualify this ViT model, with weakly supervised training, as a suitable tool for a deep learning module in clinical model development using automated image patch extraction.


【2】Ordinary Least Squares is a Special Case of Transformer
标题:普通最小平方是Transformer的一个特例
链接:https://arxiv.org/abs/2604.13656

作者:Xiaojun Tan,Yuchen Zhao
摘要:Transformer架构的统计本质长期以来一直难以捉摸:它是一个通用的近似器,还是已知计算算法的神经网络版本?通过严格的代数证明,我们证明了后者更好地描述了Transformer的基本性质:普通最小二乘(OLS)是单层线性Transformer的一种特殊情况。使用经验协方差矩阵的谱分解,我们构建了一个特定的参数设置,其中注意力机制的前向传递在数学上等同于OLS封闭形式投影。这意味着注意力可以在一次向前传递中解决问题,而不是通过迭代。在此基础上,我们进一步揭示了Transformers中一个解耦的慢速和快速内存机制。最后,我们建立的线性原型标准Transformers的演变进行了讨论。这种进展促进了Hopfield能量函数从线性到指数记忆容量的转变,从而在现代深层架构和经典统计推断之间建立了清晰的连续性。
摘要:The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.


【3】Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus
标题:将MARL与SARL连接:通过潜在共识的与订单无关的多智能体Transformer
链接:https://arxiv.org/abs/2604.13472

作者:Zijian Zhao,Jing Gao,Sen Li
摘要:协作多智能体强化学习(MARL)通过将集中控制问题分解为多个交互智能体来解决大型联合观察和动作空间。然而,这种分解通常会带来额外的挑战,包括非平稳性,不稳定的训练,弱协调和有限的理论保证。在本文中,我们提出了共识多智能体Transformer(CMAT),一个集中的框架,桥梁合作MARL的分层单智能体强化学习(SARL)制定。CMAT将所有Agent作为一个统一的实体,并采用Transformer编码器来处理大的联合观测空间。为了处理广泛的联合行动空间,我们引入了一个分层的决策机制,其中一个Transformer解码器自回归生成一个高层次的共识向量,模拟的过程中,代理达成协议,他们的策略在潜在的空间。在此共识的条件下,所有的代理同时产生他们的行动,使订单独立的联合决策,并避免了传统的多智能体Transformers(MAT)的动作生成顺序的敏感性。这种分解允许使用单代理PPO优化联合策略,同时通过潜在共识保持表达协调。为了评估所提出的方法,我们从星际争霸II,多智能体MuJoCo和谷歌研究足球基准任务进行实验。结果表明,CMAT实现了优越的性能,最近的集中式解决方案,顺序MARL方法,和传统的MARL基线。这篇论文的代码可以在www.example.com上找到。
摘要 :Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi-Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single-agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision-making mechanism in which a Transformer decoder autoregressively generates a high-level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order-independent joint decision making and avoiding the sensitivity to action-generation order in conventional Multi-Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single-agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi-Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .


【4】A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
标题:量化方面的KL镜头:混合精度STM-Transformer模型的快速、仅向前灵敏度
链接:https://arxiv.org/abs/2604.13440

作者:Jason Kong,Nilesh Prasad Pandey,Flavio Ponzina,Tajana Rosing
摘要:在边缘设备上部署大型语言模型(LLM)面临着严重的计算和内存限制,限制了实时处理和设备上的智能。将结构化状态空间模型(SSM)与基于变压器的LLM相结合的混合架构提供了效率和性能的平衡。积极的量化可以大幅削减模型大小并加快推理速度,但其对不同组件的不均匀影响需要仔细管理。在这项工作中,我们提出了一个轻量级的,反向传播免费的,基于代理的灵敏度分析框架,以确定混合SSM变压器组件最容易受到量化引起的退化。仅依靠前向传递度量,我们的方法避免了昂贵的梯度计算和再训练,使其适用于由于专有限制或隐私限制而限制访问域内数据的情况。我们还提供了一个正式的分析表明,Kullback-Leibler(KL)的分歧度量更好地捕捉量化灵敏度的语言建模任务比广泛采用的替代品,如均方误差(MSE)和信号量化噪声比(SQNR)。通过对SSM和混合架构的广泛实验,我们的消融研究证实,基于KL的排名与观察到的性能下降一致,并优于其他指标。该框架能够在资源受限的边缘设备上实际部署高级混合模型,同时精度损失最小。我们通过在英特尔Lunar Lake硬件上的真实设备性能分析进一步验证了我们的方法,证明KL引导的混合精度实现了接近FP 16的复杂度,模型大小和吞吐量在CPU和GPU执行模式上都与Uniform INT 4竞争。代码可在https://github.com/jasonkongie/kl-ssm-quant上获得。
摘要:Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real-time processing and on-device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer-based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework to identify hybrid SSM-Transformer components most susceptible to quantization-induced degradation. Relying solely on forward-pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in-domain data is limited due to proprietary restrictions or privacy constraints. We also provide a formal analysis showing that the Kullback-Leibler (KL) divergence metric better captures quantization sensitivity for Language modeling tasks than widely adopted alternatives such as mean squared error (MSE) and signal-to-quantization-noise ratio (SQNR). Through extensive experiments on SSM and hybrid architectures, our ablation studies confirm that KL-based rankings align with observed performance drops and outperform alternative metrics. This framework enables the practical deployment of advanced hybrid models on resource-constrained edge devices with minimal accuracy loss. We further validate our approach with real-world on-device profiling on Intel Lunar Lake hardware, demonstrating that KL-guided mixed-precision achieves near-FP16 perplexity with model sizes and throughput competitive with Uniform INT4 on both CPU and GPU execution modes. Code is available at https://github.com/jasonkongie/kl-ssm-quant.


GAN|对抗|攻击|生成相关(3篇)

【1】HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark
标题:HINTBench:水平代理固有非攻击轨迹基准
链接:https://arxiv.org/abs/2604.13954

作者:Jiacheng Wang,Jinchang Hou,Fabian Wang,Ping Jian,Chenfu Bao,Zhonghou Lv
摘要:现有的药剂安全评价主要集中在外部引起的风险。然而,在良性条件下,代理人仍然可能进入不安全的轨迹。我们研究这种互补的,但未充分探索的设置通过镜头的内在风险,内在故障仍然是潜在的,传播在长期执行,并最终导致高后果的结果。为了评估这种设置,我们引入了\endash {非攻击内在风险审计},并提出了\textbf{HINTBench},这是一个629个代理轨迹(523个风险,106个安全;平均33个步骤)的基准,支持三个任务:风险检测,风险步骤本地化和内在故障类型识别。它的注释是在统一的五约束分类法下组织的。实验揭示了一个巨大的能力差距:强大的LLM在故障级别的风险检测上表现良好,但在风险步骤定位上,它们的性能下降到低于35 μ t-F1,而细粒度的故障诊断则更难。现有的防护模型很难转移到该设置。这些发现建立内在风险审计作为一个开放的挑战代理安全。
摘要:Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.


【2】ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection
标题:ASTER:用于无监督时间序列异常检测的潜在伪异常生成
链接:https://arxiv.org/abs/2604.13924

作者:Romain Hermary,Samet Hicsonmez,Dan Pineau,Abd El Rahman Shabayek,Djamila Aouada
摘要:时间序列异常检测(TSAD)在工业监控、医疗保健和网络安全等领域至关重要,但由于罕见和异构的异常以及标记数据的稀缺性,它仍然具有挑战性。这种稀缺性使得无监督方法占主导地位,但现有的方法往往依赖于重建或预测,这与复杂的数据,或基于嵌入的方法,需要特定领域的异常合成和固定的距离度量。我们提出了ASTER,一个框架,直接在潜在空间中生成伪异常,避免手工异常注入和领域专业知识的需要。潜在空间解码器产生定制的伪异常以训练基于变换器的异常分类器,而预先训练的LLM丰富了该空间的时间和上下文表示。在三个基准数据集上的实验表明,ASTER实现了最先进的性能,并为基于LLM的TSAD设定了新的标准。
摘要 :Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.


【3】Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning
标题:通过证据感知奖励和自我纠正偏好学习增强放射学报告生成的强化学习
链接:https://arxiv.org/abs/2604.13598

作者:Qin Zhou,Guoyan Liang,Qianyi Yang,Jingyuan Chen,Sai Wu,Chang Yao,Zhe Wang
备注:13 pages,4 figures, ACL2026-main
摘要:最近的强化学习(RL)方法具有先进的放射学报告生成(RRG),但仍然存在两个核心限制:(1)报告级奖励为临床忠诚度提供了有限的循证指导;(2)当前方法缺乏明确的自我改进机制来与临床偏好保持一致。我们介绍了临床对齐的证据感知自校正强化学习(ESC-RL),包括两个关键组件。首先,一个群体明智的证据意识对齐奖励(GEAR)提供群体明智的,证据意识反馈。GEAR加强了对真阳性的一致性基础,恢复了对假阴性的遗漏发现,并抑制了对假阳性的不支持内容。其次,自校正偏好学习(SPL)策略从多个噪声观察中自动构建可靠的疾病感知偏好数据集,并利用LLM在没有人工监督的情况下合成精细的报告。ESC-RL促进临床忠诚,疾病一致的奖励,并支持在培训期间不断自我完善。在两个公共胸部X射线数据集上进行的大量实验表明,该方法具有一致的增益和最先进的性能。
摘要:Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.


半/弱/无/有监督|不确定性|主动学习(12篇)

【1】Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning
标题:参数重要性不是静态的:为监督微调开发参数隔离
链接:https://arxiv.org/abs/2604.14010

作者:Zekai Lin,Chao Xue,Di Liang,Xingsheng Han,Peiyang Liu,Xianjie Wu,Lei Jiang,Yu Lu,Haibo Shi,Shuang Liang,Minlong Peng
摘要:大型语言模型的监督微调(SFT)经常受到任务干扰和灾难性遗忘的影响。最近的方法通过在训练期间隔离任务关键参数来缓解这个问题。然而,这些方法表示动态问题的静态解决方案,假设参数重要性一旦确定就保持固定。在这项工作中,我们经验证明,参数的重要性在训练过程中表现出时间漂移。为了解决这个问题,我们提出了不断演变的参数隔离(EPI),一个微调框架,适应隔离的决定,根据在线估计参数的重要性。EPI不是冻结参数的固定子集,而是使用基于梯度的信号定期更新隔离掩码,使模型能够保护新出现的任务关键参数,同时释放过时的参数以恢复可塑性。在不同的多任务基准测试上的实验表明,与静态隔离和标准微调相比,EPI始终减少干扰和遗忘,同时提高整体泛化能力。我们的分析强调了同步隔离机制与学习不同能力的不断发展的动态的必要性。
摘要:Supervised Fine-Tuning (SFT) of large language models often suffers from task interference and catastrophic forgetting. Recent approaches alleviate this issue by isolating task-critical parameters during training. However, these methods represent a static solution to a dynamic problem, assuming that parameter importance remains fixed once identified. In this work, we empirically demonstrate that parameter importance exhibits temporal drift over the course of training. To address this, we propose Evolving Parameter Isolation (EPI), a fine-tuning framework that adapts isolation decisions based on online estimates of parameter importance. Instead of freezing a fixed subset of parameters, EPI periodically updates isolation masks using gradient-based signals, enabling the model to protect emerging task-critical parameters while releasing outdated ones to recover plasticity. Experiments on diverse multi-task benchmarks demonstrate that EPI consistently reduces interference and forgetting compared to static isolation and standard fine-tuning, while improving overall generalization. Our analysis highlights the necessity of synchronizing isolation mechanisms with the evolving dynamics of learning diverse abilities.


【2】Physics-Informed Neural Networks for Methane Sorption: Cross-Gas Transfer Learning, Ensemble Collapse Under Physics Constraints, and Monte Carlo Dropout Uncertainty Quantification
标题:甲烷吸收的物理信息神经网络:跨气体转移学习、物理约束下的集合崩溃和蒙特卡洛辍学不确定性量化
链接:https://arxiv.org/abs/2604.13992

作者:Mohammad Nooraiepour,Zezhang Song,Wei Li,Sarah Perez
摘要:跨异质煤阶的准确甲烷吸附预测需要将热力学一致性、跨数据稀缺的地质系统的有效知识转移和校准的不确定性估计相结合的模型,这些能力在现有框架中很少一起解决。我们提出了一个基于物理学的迁移学习框架,该框架通过弹性重量合并、煤的特定特征工程和一个三阶段课程来调整氢吸附PINN以适应甲烷吸附预测,该课程逐步平衡迁移保存与热力学微调。在从褐煤到无烟煤的114个独立煤炭实验的993个平衡测量值上进行了训练,该框架在保留的煤炭样品上实现了R2 = 0.932,比仅压力的经典等温线提高了227%,而氢预训练比随机初始化降低了18.9%的RMSE和19.4%的收敛速度。五贝叶斯不确定性量化方法揭示了系统的性能差异,在物理约束的架构。蒙特卡罗丢弃以最小的开销实现了良好校准的不确定性,而无论架构多样性或初始化策略如何,深度集成都表现出性能下降,因为共享的物理约束缩小了可接受的解决方案流形。SHAP和ALE分析证实,学到的表示仍然物理上可解释的,并与既定的煤吸附机制:水分挥发性相互作用是最有影响力的,压力-温度耦合捕获热力学相互依赖,和功能表现出非单调效应。这些结果将Monte Carlo Dropout确定为该物理约束迁移学习框架中性能最佳的UQ方法,并证明交叉气体迁移学习是地质材料建模的数据高效策略。
摘要:Accurate methane sorption prediction across heterogeneous coal ranks requires models that combine thermodynamic consistency, efficient knowledge transfer across data-scarce geological systems, and calibrated uncertainty estimates, capabilities that are rarely addressed together in existing frameworks. We present a physics-informed transfer learning framework that adapts a hydrogen sorption PINN to methane sorption prediction via Elastic Weight Consolidation, coal-specific feature engineering, and a three-phase curriculum that progressively balances transfer preservation with thermodynamic fine-tuning. Trained on 993 equilibrium measurements from 114 independent coal experiments spanning lignite to anthracite, the framework achieves R2 = 0.932 on held-out coal samples, a 227% improvement over pressure-only classical isotherms, while hydrogen pre-training delivers 18.9% lower RMSE and 19.4% faster convergence than random initialization. Five Bayesian uncertainty quantification approaches reveal a systematic divergence in performance across physics-constrained architectures. Monte Carlo Dropout achieves well-calibrated uncertainty at minimal overhead, while deep ensembles, regardless of architectural diversity or initialization strategy, exhibit performance degradation because shared physics constraints narrow the admissible solution manifold. SHAP and ALE analyses confirm that learned representations remain physically interpretable and aligned with established coal sorption mechanisms: moisture-volatile interactions are most influential, pressure-temperature coupling captures thermodynamic co-dependence, and features exhibit non-monotonic effects. These results identify Monte Carlo Dropout as the best-performing UQ method in this physics-constrained transfer learning framework, and demonstrate cross-gas transfer learning as a data-efficient strategy for geological material modeling.


【3】Unsupervised domain transfer: Overcoming signal degradation in sleep monitoring by increasing scoring realism
标题:无监督域转移:通过提高评分真实性来克服睡眠监测中的信号退化
链接:https://arxiv.org/abs/2604.13988

作者:Mohammad Ahangarkiasari,Andreas Tind Damgaard,Casper Haurum,Kaare B. Mikkelsen
摘要:目的:调查睡眠图“现实主义”是否可以用来指导一种无监督的方法,用于处理移动睡眠监测中任意类型的信号退化。   方法:将预训练的最先进的“u-sleep”模型与“sleep”网络相结合,我们将目标域的特征与预训练期间学习的特征空间进行对齐。为了测试这种方法,我们用真实的信号退化来扭曲源域,看看这种方法能适应不同类型的退化。我们将所得到的模型的性能与针对每种类型的传输以监督方式设计的最佳情况模型进行比较。   主要结果:根据失真的类型,我们发现,无监督的方法可以增加科恩的kappa低至0.03和高达0.29,并且对于所有的传输,该方法不会降低性能。然而,该方法从未完全达到估计的理论最佳性能,并且当在两项睡眠研究之间的现实生活域不匹配上进行测试时,其益处是微不足道的。   重要性:“鉴别器引导的微调”是一种有趣的方法,可以处理“野外”睡眠监测的信号退化,具有一定的前景。特别是,它对睡眠数据的描述通常很有趣。然而,在“生产”中使用它之前,还需要进行更多的开发。
摘要:Objective: Investigate whether hypnogram 'realism' can be used to guide an unsupervised method for handling arbitrary types of signal degradation in mobile sleep monitoring.   Approach: Combining a pretrained, state-of-the-art 'u-sleep' model with a 'discriminator' network, we align features from a target domain with a feature space learned during pretraining. To test the approach, we distort the source domain with realistic signal degradations, to see how well the method can adapt to different types of degradation. We compare the performance of the resulting model with best-case models designed in a supervised manner for each type of transfer.   Main Results: Depending on the type of distortion, we find that the unsupervised approach can increase Cohen's kappa with as little as 0.03 and up to 0.29, and that for all transfers, the method does not decrease performance. However, the approach never quite reaches the estimated theoretical optimal performance, and when tested on a real-life domain mismatch between two sleep studies, the benefit was insignificant.   Significance: 'Discriminator-guided fine tuning' is an interesting approach to handling signal degradation for 'in the wild' sleep monitoring, with some promise. In particular, what it says about sleep data in general is interesting. However, more development will be necessary before using it 'in production'.


【4】Unsupervised Anomaly Detection in Process-Complex Industrial Time Series: A Real-World Case Study
标题:流程复杂工业时间序列中的无监督异常检测:现实世界案例研究
链接:https://arxiv.org/abs/2604.13928

作者:Sergej Krasnikov,Lukas Meitz,Samineh Bagheri,Michael Heider,Thorsten Schöler,Jörg Hähner
摘要:来自真实生产环境的工业时间序列数据表现出比常用基准数据集高得多的复杂性,这主要是由于异构的多阶段操作过程。因此,在简化条件下验证的异常检测方法往往无法推广到工业环境。这项工作提出了一个独特的数据集,从全面运作的工业机械,明确捕捉明显的过程引起的变异性的实证研究。   我们评估哪些模型类能够捕获这种复杂性,从经典的隔离森林基线开始,并扩展到多个自动编码器架构。实验结果表明,隔离森林不足以对数据中存在的非周期性多尺度动态进行建模,而自动编码器始终表现得更好。其中,时间卷积自动编码器实现了最强大的性能,而递归和变分变体需要更仔细的调整。
摘要:Industrial time-series data from real production environments exhibits substantially higher complexity than commonly used benchmark datasets, primarily due to heterogeneous, multi-stage operational processes. As a result, anomaly detection methods validated under simplified conditions often fail to generalize to industrial settings. This work presents an empirical study on a unique dataset collected from fully operational industrial machinery, explicitly capturing pronounced process-induced variability.   We evaluate which model classes are capable of capturing this complexity, starting with a classical Isolation Forest baseline and extending to multiple autoencoder architectures. Experimental results show that Isolation Forest is insufficient for modeling the non-periodic, multi-scale dynamics present in the data, whereas autoencoders consistently perform better. Among them, temporal convolutional autoencoders achieve the most robust performance, while recurrent and variational variants require more careful tuning.


【5】Evaluating Supervised Machine Learning Models: Principles, Pitfalls, and Metric Selection
标题:评估受监督的机器学习模型:原则、陷阱和指标选择
链接:https://arxiv.org/abs/2604.13882

作者:Xuanyan Liu,Ignacio Cabrera Martin,Marcello Trovati,Xiaolong Xu,Nikolaos Polatidis
摘要:监督机器学习模型的评估是开发可靠预测系统的关键阶段。尽管机器学习库和自动化工作流程广泛可用,但模型评估通常被简化为一小部分聚合指标的报告,这可能导致对真实世界性能的误导性结论。本文探讨了在分类和回归任务中评估监督学习算法所涉及的原则、挑战和实际考虑因素。特别是,它讨论了评估结果如何受到数据集特征,验证设计,类不平衡,不对称错误成本和性能指标的选择的影响。通过一系列使用不同基准数据集的受控实验方案,该研究突出了常见的陷阱,如准确性悖论,数据泄漏,不适当的度量选择,以及过度依赖标量摘要措施。本文还比较了替代验证策略,并强调了调整模型评估与任务的预期操作目标的重要性。通过将评估呈现为一个面向决策和依赖于上下文的过程,这项工作为选择支持统计上合理,健壮和值得信赖的监督机器学习系统的指标和验证协议提供了结构化的基础。
摘要:The evaluation of supervised machine learning models is a critical stage in the development of reliable predictive systems. Despite the widespread availability of machine learning libraries and automated workflows, model assessment is often reduced to the reporting of a small set of aggregate metrics, which can lead to misleading conclusions about real-world performance. This paper examines the principles, challenges, and practical considerations involved in evaluating supervised learning algorithms across classification and regression tasks. In particular, it discusses how evaluation outcomes are influenced by dataset characteristics, validation design, class imbalance, asymmetric error costs, and the choice of performance metrics. Through a series of controlled experimental scenarios using diverse benchmark datasets, the study highlights common pitfalls such as the accuracy paradox, data leakage, inappropriate metric selection, and overreliance on scalar summary measures. The paper also compares alternative validation strategies and emphasizes the importance of aligning model evaluation with the intended operational objective of the task. By presenting evaluation as a decision-oriented and context-dependent process, this work provides a structured foundation for selecting metrics and validation protocols that support statistically sound, robust, and trustworthy supervised machine learning systems.


【6】A Bayesian Framework for Uncertainty-Aware Explanations in Power Quality Disturbance Classification
标题:电能质量扰动分类中不确定性感知解释的Bayesian框架
链接:https://arxiv.org/abs/2604.13658

作者:Yinsong Chen,Samson S. Yu,Kashem M. Muttaqi
摘要:先进的深度学习方法在电能质量扰动(PQD)分类方面取得了显着的成功。为了提高模型的透明度,可解释AI(XAI)技术已经被开发出来,以提供分类器决策的实例特定解释。然而,传统的XAI方法产生确定性的解释,忽略了不确定性,并限制了安全关键应用的可靠性。本文提出了一个贝叶斯解释框架,模型解释的不确定性,通过生成一个相关性归因分布为每个实例。该方法允许专家根据置信度选择解释,从而根据特定的干扰类型定制可解释性。在合成和真实电能质量数据集上的大量实验表明,该框架通过不确定性感知解释提高了PQD分类器的透明度和可靠性。
摘要 :Advanced deep learning methods have shown remarkable success in power quality disturbance (PQD) classification. To enhance model transparency, explainable AI (XAI) techniques have been developed to provide instance-specific interpretations of classifier decisions. However, conventional XAI methods yield deterministic explanations, overlooking uncertainty and limiting reliability in safety-critical applications. This paper proposes a Bayesian explanation framework that models explanation uncertainty by generating a relevance attribution distribution for each instance. This method allows experts to select explanations based on confidence percentiles, thereby tailoring interpretability according to specific disturbance types. Extensive experiments on synthetic and real-world power quality datasets demonstrate that the proposed framework improves the transparency and reliability of PQD classifiers through uncertainty-aware explanations.


【7】From Alignment to Prediction: A Study of Self-Supervised Learning and Predictive Representation Learning
标题:从对齐到预测:自我监督学习和预测表示学习的研究
链接:https://arxiv.org/abs/2604.13518

作者:Mintu Dutta,Ritesh Vyas,Mohendra Roy
备注:This article has been submitted to the 2026 International Conference on Applied Artificial Intelligence (2AI), Central University of Kashmir, India
摘要:自监督学习已经成为从未标记数据中学习的主要技术,目前的方法主要围绕表示和输入重构的对齐。虽然这些方法在实践中表现出了出色的性能,但它们的范围仍然主要局限于从观察到的数据中学习,并且在预测数据分布的学习结构方面没有提供太多帮助。在本文中,我们研究了自我监督学习领域的一些最新进展。我们定义了一个新的类别,称为预测表示学习(PRL),它围绕着基于观察的数据的未观察到的组件的潜在预测。我们提出了一个共同的分类法,分类PRL与对齐和重建为基础的学习方法。此外,我们认为,联合嵌入预测架构(JEPA)可以被认为是这种新的范式的一个典型成员。我们进一步讨论了理论观点和面临的挑战,强调预测表示学习是未来自我监督学习研究的一个有前途的方向。在这项研究中,我们实现了Bootstrap Your Own Latent(BYOL),Masked Autoencoders(MAE)和Image-JEPA(I-JEPA)进行比较分析。结果表明,MAE达到完美的相似度为1.00,但表现出相对较弱的鲁棒性为0.55。相比之下,BYOL和I-JEPA的准确度分别为0.98和0.95,鲁棒性得分分别为0.75和0.78。
摘要:Self-supervised learning has emerged as a major technique for the task of learning from unlabeled data, where the current methods mostly revolve around alignment of representations and input recon struction. Although such approaches have demonstrated excellent performance in practice, their scope remains mostly confined to learning from observed data and does not provide much help in terms of a learning structure that is predictive of the data distribution. In this paper, we study some of the recent developments in the realm of self-supervised learning. We define a new category called Predictive Representation Learning (PRL), which revolves around the latent prediction of unobserved components of data based on the observation. We propose a common taxonomy that classifies PRL along with alignment and reconstruction-based learning approaches. Furthermore, we argue that Joint-Embedding Predictive Architecture(JEPA) can be considered as an exemplary member of this new paradigm. We further discuss theoretical perspectives and open challenges, highlighting predictive representation learning as a promising direction for future self-supervised learning research. In this study, we implemented Bootstrap Your Own Latent (BYOL), Masked Autoencoders (MAE), and Image-JEPA (I-JEPA) for comparative analysis. The results indicate that MAE achieves perfect similarity of 1.00, but exhibits relatively weak robustness of 0.55. In contrast, BYOL and I-JEPA attain accuracies of 0.98 and 0.95, with robustness scores of 0.75 and 0.78, respectively.


【8】Quantifying and Understanding Uncertainty in Large Reasoning Models
标题:量化和理解大型推理模型中的不确定性
链接:https://arxiv.org/abs/2604.13395

作者:Yangyi Li,Chenxu Zhao,Mengdi Huai
摘要:大型推理模型(LRM)最近在复杂推理方面表现出了显着的改进。虽然量化LRM中的生成不确定性是至关重要的,但传统的方法往往是不够的,因为它们不提供有限样本的保证推理答案的生成。共形预测(CP)作为一种无分布和模型不可知的方法,构建了统计上严格的不确定性集。然而,现有的CP方法忽略了推理轨迹和最终答案之间的逻辑联系。此外,先前的研究未能解释LRM的不确定性覆盖范围的起源,因为它们通常忽略了驱动有效推理的特定训练因素。值得注意的是,它是具有挑战性的解纠缠推理质量从答案的正确性时,量化的不确定性,同时建立理论保证计算效率的解释方法。为了解决这些挑战,我们首先提出了一种新的方法,量化的推理答案结构的统计保证的不确定性。随后,我们使用Shapley值开发了一个统一的示例到步骤的解释框架,该框架可以识别可证明的训练示例子集及其关键推理步骤,以保持保证。我们还提供了我们提出的方法的理论分析。在具有挑战性的推理数据集上进行的大量实验验证了所提方法的有效性。
摘要:Large Reasoning Models (LRMs) have recently demonstrated significant improvements in complex reasoning. While quantifying generation uncertainty in LRMs is crucial, traditional methods are often insufficient because they do not provide finite-sample guarantees for reasoning-answer generation. Conformal prediction (CP) stands out as a distribution-free and model-agnostic methodology that constructs statistically rigorous uncertainty sets. However, existing CP methods ignore the logical connection between the reasoning trace and the final answer. Additionally, prior studies fail to interpret the origins of uncertainty coverage for LRMs as they typically overlook the specific training factors driving valid reasoning. Notably, it is challenging to disentangle reasoning quality from answer correctness when quantifying uncertainty, while simultaneously establishing theoretical guarantees for computationally efficient explanation methods. To address these challenges, we first propose a novel methodology that quantifies uncertainty in the reasoning-answer structure with statistical guarantees. Subsequently, we develop a unified example-to-step explanation framework using Shapley values that identifies a provably sufficient subset of training examples and their key reasoning steps to preserve the guarantees. We also provide theoretical analyses of our proposed methods. Extensive experiments on challenging reasoning datasets verify the effectiveness of the proposed methods.


【9】Beyond Uniform Sampling: Synergistic Active Learning and Input Denoising for Robust Neural Operators
标题:超越均匀采样:鲁棒神经运算符的协同主动学习和输入去噪
链接:https://arxiv.org/abs/2604.13316

作者:Samrendra Roy,Souvik Chakraborty,Syed Bahauddin Alam
摘要:神经操作符已经成为物理模拟的快速代理模型,但它们仍然非常容易受到对抗性扰动的影响,这是安全关键型数字孪生部署的关键责任。我们提出了一种协同防御,将基于主动学习的数据生成与输入去噪架构相结合。主动学习组件使用差分进化攻击自适应地探测模型弱点,然后在发现的漏洞位置生成有针对性的训练数据,同时自适应平滑比率保护措施保持基线准确性。输入去噪组件通过一个可学习的瓶颈来增强运算符架构,该瓶颈可以过滤对抗性噪声,同时保留物理相关特征。在粘性Burgers方程基准上,组合方法实现了2.04%的组合误差(1.21%基线+ 0.83%鲁棒性),相对于标准训练(15.42%组合)减少了87%,并且优于单独的主动学习(3.42%)和单独的输入去噪(5.22%)。更广泛地说,我们的研究结果,结合先前工作中的跨架构脆弱性分析,表明神经运算符的最佳训练数据是依赖于架构的:因为不同的架构将敏感性集中在不同的输入子空间中,所以均匀采样无法充分覆盖所有模型的脆弱性。这些发现对神经操作员在安全关键能源系统(包括核反应堆监测)中的部署具有潜在影响。
摘要 :Neural operators have emerged as fast surrogate models for physics simulations, yet they remain acutely vulnerable to adversarial perturbations, a critical liability for safety-critical digital twin deployments. We present a synergistic defense that combines active learning-based data generation with an input denoising architecture. The active learning component adaptively probes model weaknesses using differential evolution attacks, then generates targeted training data at discovered vulnerability locations while an adaptive smooth-ratio safeguard preserves baseline accuracy. The input denoising component augments the operator architecture with a learnable bottleneck that filters adversarial noise while retaining physics-relevant features. On the viscous Burgers' equation benchmark, the combined approach achieves a 2.04% combined error (1.21% baseline + 0.83% robustness), representing an 87% reduction relative to standard training (15.42% combined) and outperforming both active learning alone (3.42%) and input denoising alone (5.22%). More broadly, our results, combined with cross-architecture vulnerability analysis from prior work, suggest that optimal training data for neural operators is architecture-dependent: because different architectures concentrate sensitivity in distinct input subspaces, uniform sampling cannot adequately cover the vulnerability landscape of all models. These findings have potential implications for the deployment of neural operators in safety-critical energy systems including nuclear reactor monitoring.


【10】Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering
标题:用于无监督高光谱图像集群的深度空间正规化和基于超像素的扩散学习
链接:https://arxiv.org/abs/2604.13307

作者:Vutichart Buranasiri,James M. Murphy
备注:To appear in IEEE IGARSS 2026
摘要:提出了一种用于高光谱图像(HSI)聚类的无监督框架,该框架将掩蔽深度表示学习与基于扩散的聚类相结合,扩展了基于空间正则化超像素的扩散学习($S^2DL$)算法。最初,原始HSI的去噪潜在表示通过具有Vision Transformer骨干的无监督掩码自动编码器(UMAE)模型来学习。UMAE考虑了空间背景和长程光谱相关性,并通过仅利用训练像素的一小部分的掩蔽来结合有效的预训练过程。在下一阶段中,熵率超像素(ERS)算法被用来分割图像成超像素,和空间正则化的扩散图,而不是HSI空间内的压缩的潜在空间内使用欧几里德和扩散距离构造。所提出的算法,深度空间正则化基于超像素的扩散学习($DS^2DL$),利用更忠实的扩散距离和随后的扩散图构造,更好地反映底层数据流形的内在几何结构,提高标记精度和聚类质量。在博茨瓦纳和KSC数据集上的实验证明了$DS^2DL$的有效性。
摘要:An unsupervised framework for hyperspectral image (HSI) clustering is proposed that incorporates masked deep representation learning with diffusion-based clustering, extending the Spatially-Regularized Superpixel-based Diffusion Learning ($S^2DL$) algorithm. Initially, a denoised latent representation of the original HSI is learned via an unsupervised masked autoencoder (UMAE) model with a Vision Transformer backbone. The UMAE takes spatial context and long-range spectral correlations into account and incorporates an efficient pretraining process via masking that utilizes only a small subset of training pixels. In the next stage, the entropy rate superpixel (ERS) algorithm is used to segment the image into superpixels, and a spatially regularized diffusion graph is constructed using Euclidean and diffusion distances within the compressed latent space instead of the HSI space. The proposed algorithm, Deep Spatially-Regularized Superpixel-based Diffusion Learning ($DS^2DL$), leverages more faithful diffusion distances and subsequent diffusion graph construction that better reflect the intrinsic geometry of the underlying data manifold, improving labeling accuracy and clustering quality. Experiments on Botswana and KSC datasets demonstrate the efficacy of $DS^2DL$.


【11】Rethinking Uncertainty in Segmentation: From Estimation to Decision
标题:重新思考细分中的不确定性:从估计到决策
链接:https://arxiv.org/abs/2604.13262

作者:Saket Maganti
备注:29 pages, 12 tables, 9 figures, Github repo: Saket-Maganti/medical-seg-uncertainity
摘要:在医学图像分割中,经常报告不确定性估计,但很少用于指导决策。我们研究缺失的步骤:如何将不确定性地图转换为可操作的策略,例如接受,标记或推迟预测。我们制定分割作为一个两阶段的管道,估计后的决定,并表明,优化不确定性本身无法捕捉大多数可实现的安全收益。使用视网膜血管分割基准(DRIVE,STARE,CHASE_DB1),我们评估了两个不确定性来源(Monte Carlo Dropout和Test-Time Augmentation)与三个延迟策略相结合,并引入了一个简单的置信度感知延迟规则,优先考虑不确定和低置信度的预测。我们的研究结果表明,最佳方法和策略组合仅以25%的像素延迟消除高达80%的分割错误,同时实现强大的跨数据集鲁棒性。我们进一步表明,校准的改进并不能转化为更好的决策质量,突出了标准的不确定性指标和现实世界的效用之间的脱节。这些发现表明,不确定性应该根据它所能做出的决定来评估,而不是孤立地评估。
摘要:In medical image segmentation, uncertainty estimates are often reported but rarely used to guide decisions. We study the missing step: how uncertainty maps are converted into actionable policies such as accepting, flagging, or deferring predictions. We formulate segmentation as a two-stage pipeline, estimation followed by decision, and show that optimizing uncertainty alone fails to capture most of the achievable safety gains. Using retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1), we evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) combined with three deferral strategies, and introduce a simple confidence-aware deferral rule that prioritizes uncertain and low-confidence predictions. Our results show that the best method and policy combination removes up to 80 percent of segmentation errors at only 25 percent pixel deferral, while achieving strong cross-dataset robustness. We further show that calibration improvements do not translate to better decision quality, highlighting a disconnect between standard uncertainty metrics and real-world utility. These findings suggest that uncertainty should be evaluated based on the decisions it enables, rather than in isolation.


【12】Exploring Urban Land Use Patterns by Pattern Mining and Unsupervised Learning
标题:模式挖掘和无监督学习探索城市土地利用模式
链接:https://arxiv.org/abs/2604.13050

作者:Zdena Dobesova,Tai Dinh,Pavel Novak
摘要:城市地区是由社会经济、环境和基础设施因素塑造的复杂系统,土地利用模式是城市形态的一个方面。本文提出了一种新的方法,利用频繁项集挖掘和无监督学习技术,以确定相似的城市共现土地利用模式的基础上。哥白尼计划的城市地图集数据被用作源数据。该方法包括数据预处理,模式挖掘,使用的CRFIN算法,后处理,知识提取和可视化。空间数据集的预处理产生公开可用的事务数据集。该框架是可扩展的,源代码是公开的。
摘要:Urban areas are intricate systems shaped by socioeconomic, environmental, and infrastructural factors, with land use patterns serving as aspects of urban morphology. This paper proposes a novel methodology leveraging frequent item set mining and unsupervised learning techniques to identify similar cities based on co-occurring land use patterns. The Copernicus program's Urban Atlas data are used as source data. The methodology involves data preprocessing, pattern mining using the negFIN algorithm, postprocessing, and knowledge extraction and visualization. The preprocessing of spatial datasets results in a publicly available transaction dataset. The framework is scalable and the source code is made publicly available.


迁移|Zero/Few/One-Shot|自适应(7篇)

【1】Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation
标题:利用一般函数逼近实现可证明有效的离线到在线价值自适应
链接:https://arxiv.org/abs/2604.13966

作者:Shangzhe Li,Weitong Zhang
备注:44 pages, 2 tables
摘要:研究了一般函数逼近下离线到在线强化学习的值自适应问题。从一个不完美的离线预训练的$Q$-函数开始,学习者的目标是只使用有限的在线交互来使其适应目标环境。我们首先通过建立一个极大极小的下限来描述这种设置的难度,表明即使预训练的$Q$-函数接近最优$Q^\star$,在线适应在某些困难的情况下也不会比纯在线RL更有效。在积极的方面,在离线预训练值函数的新结构条件下,我们提出了O2O-LSVI,这是一种具有问题相关样本复杂度的自适应算法,可证明优于纯在线RL。最后,我们补充我们的理论与神经网络实验,证明所提出的方法的实际有效性。
摘要 :We study value adaptation in offline-to-online reinforcement learning under general function approximation. Starting from an imperfect offline pretrained $Q$-function, the learner aims to adapt it to the target environment using only a limited amount of online interaction. We first characterize the difficulty of this setting by establishing a minimax lower bound, showing that even when the pretrained $Q$-function is close to optimal $Q^\star$, online adaptation can be no more efficient than pure online RL on certain hard instances. On the positive side, under a novel structural condition on the offline-pretrained value functions, we propose O2O-LSVI, an adaptation algorithm with problem-dependent sample complexity that provably improves over pure online RL. Finally, we complement our theory with neural-network experiments that demonstrate the practical effectiveness of the proposed method.


【2】Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety
标题:基于深度强化学习的昏昏感知自适应自主制动系统增强道路安全
链接:https://arxiv.org/abs/2604.13878

作者:Hossem Eddine Hafidi,Elisabetta De Giovanni,Teodoro Montanaro,Ilaria Sergi,Massimo De Vittorio,Luigi Patrono
备注:This manuscript is 10 pages long and includes 12 figures and 3 tables. The figures provide detailed visualizations of the proposed system architecture, ECG-based drowsiness detection pipeline, Double-Dueling DQN framework, and experimental evaluation results in the CARLA simulation environment
摘要:驾驶员困倦严重影响了准确判断安全制动距离的能力,据估计,欧洲10%-20%的道路交通事故是由驾驶员困倦造成的。传统的驾驶员辅助系统缺乏对诸如困倦等实时生理状态的适应性。本文提出了一种基于深度强化学习的自主制动系统,该系统将车辆动力学与驾驶员生理数据相结合。使用递归神经网络(RNN)从ECG信号中检测困倦,通过对具有不同分割和重叠配置的2分钟窗口的广泛基准分析进行选择。所推断的困倦状态被并入到双决斗深度Q网络(DQN)代理的可观察状态空间中,其中驾驶员损伤被建模为动作延迟。该系统是在一个高保真CARLA仿真环境中实现和评估。实验结果表明,该代理实现了99.99%的成功率,在避免碰撞的困倦和非困倦条件下。这些发现证明了生理感知控制策略对增强自适应和智能驾驶安全系统的有效性。
摘要:Driver drowsiness significantly impairs the ability to accurately judge safe braking distances and is estimated to contribute to 10%-20% of road accidents in Europe. Traditional driver-assistance systems lack adaptability to real-time physiological states such as drowsiness. This paper proposes a deep reinforcement learning-based autonomous braking system that integrates vehicle dynamics with driver physiological data. Drowsiness is detected from ECG signals using a Recurrent Neural Network (RNN), selected through an extensive benchmark analysis of 2-minute windows with varying segmentation and overlap configurations. The inferred drowsiness state is incorporated into the observable state space of a Double-Dueling Deep Q-Network (DQN) agent, where driver impairment is modeled as an action delay. The system is implemented and evaluated in a high-fidelity CARLA simulation environment. Experimental results show that the proposed agent achieves a 99.99% success rate in avoiding collisions under both drowsy and non-drowsy conditions. These findings demonstrate the effectiveness of physiology-aware control strategies for enhancing adaptive and intelligent driving safety systems.


【3】Adaptive Unknown Fault Detection and Few-Shot Continual Learning for Condition Monitoring in Ultrasonic Metal Welding
标题:超声波金属焊接状态监测的自适应未知故障检测和少次连续学习
链接:https://arxiv.org/abs/2604.13465

作者:Ahmadreza Eslaminia,Kuan-Chieh Lu,Klara Nahrstedt,Chenhui Shao
备注:20 pages, 10 figures
摘要:超声波金属焊接(UMW)广泛应用于工业应用中,但对工具磨损、表面污染和材料变化敏感,这可能导致意外的工艺故障和不满意的焊接质量。传统的监控系统通常依赖于监督学习模型,该模型假设所有故障类型都是预先已知的,从而限制了它们处理先前未见过的过程故障的能力。为了解决这一挑战,本文提出了一种自适应状态监测方法,使未知故障检测和Few-Shot连续学习UMW。通过分析多层感知器的隐藏层表示并利用统计阈值策略来检测未知故障。一旦检测到,来自未知故障类型的样本通过持续学习过程被纳入现有模型,该过程选择性地仅更新网络的最后层,这使得模型能够识别新的故障类型,同时保留现有类别的知识。为了加速标记过程,余弦相似性变换结合聚类算法对相似的未知样本进行分组,从而减少手动标记工作。使用多传感器UMW数据集的实验结果表明,所提出的方法在检测看不见的故障条件,同时保持可靠的分类已知类达到96%的准确率。在仅使用5个标记样本加入新的故障类型后,更新后的模型达到98%的测试分类准确率。这些结果表明,所提出的方法能够以最小的再训练成本和时间进行自适应监测。所提出的方法提供了一种可扩展的解决方案,用于在条件监测中不断学习,其中新的过程条件可能随着时间的推移不断出现,并且可扩展到其他制造过程。
摘要:Ultrasonic metal welding (UMW) is widely used in industrial applications but is sensitive to tool wear, surface contamination, and material variability, which can lead to unexpected process faults and unsatisfactory weld quality. Conventional monitoring systems typically rely on supervised learning models that assume all fault types are known in advance, limiting their ability to handle previously unseen process faults. To address this challenge, this paper proposes an adaptive condition monitoring approach that enables unknown fault detection and few-shot continual learning for UMW. Unknown faults are detected by analyzing hidden-layer representations of a multilayer perceptron and leveraging a statistical thresholding strategy. Once detected, the samples from unknown fault types are incorporated into the existing model through a continual learning procedure that selectively updates only the final layers of the network, which enables the model to recognize new fault types while preserving knowledge of existing classes. To accelerate the labeling process, cosine similarity transformation combined with a clustering algorithm groups similar unknown samples, thereby reducing manual labeling effort. Experimental results using a multi-sensor UMW dataset demonstrate that the proposed method achieves 96% accuracy in detecting unseen fault conditions while maintaining reliable classification of known classes. After incorporating a new fault type using only five labeled samples, the updated model achieves 98% testing classification accuracy. These results demonstrate that the proposed approach enables adaptive monitoring with minimal retraining cost and time. The proposed approach provides a scalable solution for continual learning in condition monitoring where new process conditions may constantly emerge over time and is extensible to other manufacturing processes.


【4】Bias-Corrected Adaptive Conformal Inference for Multi-Horizon Time Series Forecasting
标题:多时间序列预测的偏差校正自适应共形推理
链接:https://arxiv.org/abs/2604.13253

作者:Ankit Lade,Sai Krishna J.,Indar Kumar
备注:14 pages, 3 figures, 2 tables. Preprint
摘要:自适应共形推理(ACI)为分布偏移下的时间序列提供了具有渐近覆盖保证的无分布预测区间。然而,ACI只适应分位数阈值-它不能移动间隔中心。当一个基地预报员开发持久的偏见后,政权的变化,ACI补偿扩大间隔对称,产生不必要的保守带。我们提出了偏差校正ACI(BC-ACI),它增加了标准ACI与在线指数加权移动平均(EWM)预测偏差的估计。BC-ACI在分位数计算之前校正不一致性分数,并重新定位预测区间,解决了误校准的根本原因而不是其症状。自适应死区阈值在估计偏差与噪声无法区分时抑制校正,确保良好校准的数据不会退化。在跨越两个基础模型、四个合成方案和三个真实数据集的688次运行的对照实验中,BC-ACI在平均值和复合分布变化(Wilcoxon p < 0.001)下将Winkler间隔分数降低了13- 17%,同时在平稳数据上保持同等性能(比率1.002x)。我们提供有限样本分析表明,覆盖率保证优雅地降低与偏见估计误差。
摘要 :Adaptive Conformal Inference (ACI) provides distribution-free prediction intervals with asymptotic coverage guarantees for time series under distribution shift. However, ACI only adapts the quantile threshold -- it cannot shift the interval center. When a base forecaster develops persistent bias after a regime change, ACI compensates by widening intervals symmetrically, producing unnecessarily conservative bands. We propose Bias-Corrected ACI (BC-ACI), which augments standard ACI with an online exponentially weighted moving average (EWM) estimate of forecast bias. BC-ACI corrects nonconformity scores before quantile computation and re-centers prediction intervals, addressing the root cause of miscalibration rather than its symptom. An adaptive dead-zone threshold suppresses corrections when estimated bias is indistinguishable from noise, ensuring no degradation on well-calibrated data. In controlled experiments across 688 runs spanning two base models, four synthetic regimes, and three real datasets, BC-ACI reduces Winkler interval scores by 13--17% under mean and compound distribution shifts (Wilcoxon p < 0.001) while maintaining equivalent performance on stationary data (ratio 1.002x). We provide finite-sample analysis showing that coverage guarantees degrade gracefully with bias estimation error.


【5】Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments
标题:动态环境中自主人工智能代理学习的自适应记忆结晶
链接:https://arxiv.org/abs/2604.13085

作者:Rajat Khanda,Mohammad Baqar Sambuddha Chakrabarti,Satyasaran Changdar
摘要:在动态环境中运行的自主人工智能代理面临着一个持续的挑战:在不删除先前知识的情况下获取新功能。我们提出了自适应记忆结晶(AMC),这是一种在持续强化学习中进行渐进式经验巩固的记忆架构。   AMC在概念上受到突触标记和捕获(STC)理论的定性结构的启发,该理论认为记忆通过离散的稳定阶段过渡,但并没有声称对潜在的分子或突触机制进行建模。   AMC将记忆建模为一个连续的结晶过程,在这个过程中,经验根据多目标效用信号从可塑状态迁移到稳定状态。该框架引入了一个三相存储器层次(液体-玻璃-晶体)由伊藤随机微分方程(ε),其人口水平的行为是由一个明确的福克-普朗克方程承认一个封闭形式的Beta平稳分布捕获。   我们提供以下证明:(i)结晶状态的适定性和全局收敛到唯一的Beta平稳分布;(ii)单个结晶状态指数收敛到它们的不动点,具有显式速率和方差界;以及(iii)端到端Q学习误差界和匹配的存储容量下界,将结晶参数直接链接到代理性能。   对Meta-World MT50、Atari 20-game sequential learning和MuJoCo连续运动的实证评估一致显示,向前转移的改善(比最强基线增加了34 - 43%),灾难性遗忘的减少(67- 80%),以及记忆足迹的减少62%。
摘要:Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning.   AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms.   AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an Itô stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker--Planck equation admitting a closed-form Beta stationary distribution.   We provide proofs of: (i) well-posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance.   Empirical evaluation on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34--43\% over the strongest baseline), reductions in catastrophic forgetting (67--80\%), and a 62\% decrease in memory footprint.


【6】A short proof of near-linear convergence of adaptive gradient descent under fourth-order growth and convexity
标题:四阶增长和凸性下自适应梯度下降近线性收敛的简短证明
链接:https://arxiv.org/abs/2604.13393

作者:Damek Davis,Dmitriy Drusvyatskiy
摘要:Davis,Drusvyatskiy和Jiang证明了具有自适应步长的梯度下降法在光滑函数中以接近线性的速度局部收敛,这些光滑函数至少以四分之一的速度远离其最小值。这个论点很复杂,依赖于监控算法相对于某种缓慢增长的流形(称为峡谷)的性能。在这项工作中,我们提供了一个直接的李雅普诺夫为基础的参数,绕过这些困难时,目标是另外凸和有一个独特的极小。作为一个副产品的参数,我们得到了一个更自适应的变种比原来的算法,令人鼓舞的数值性能。
摘要:Davis, Drusvyatskiy, and Jiang showed that gradient descent with an adaptive stepsize converges locally at a nearly-linear rate for smooth functions that grow at least quartically away from their minimizers. The argument is intricate, relying on monitoring the performance of the algorithm relative to a certain manifold of slow growth -- called the ravine. In this work, we provide a direct Lyapunov-based argument that bypasses these difficulties when the objective is in addition convex and a has a unique minimizer. As a byproduct of the argument, we obtain a more adaptive variant than the original algorithm with encouraging numerical performance.


【7】Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version
标题:通过离模训练和重要性采样的完全非马尔可夫最优随机控制的自适应学习。完整版本
链接:https://arxiv.org/abs/2604.13147

作者:Dorival Leão,Alberto Ohashi,Simone Scotti,Adolfo M. D da Silva
备注:74 pages, 3 figures
摘要:研究了受控状态完全非马尔可夫且依赖于未知模型参数的连续时间随机控制问题。这样的问题自然出现在路径依赖的随机微分方程,粗糙波动对冲,分数布朗运动驱动的系统。建立在早期的工作中开发的离散骨架方法,我们提出了一个Monte Carlo学习方法相关的嵌入式向后动态规划方程。我们的主要贡献是双重的。首先,我们为几类有代表性的非马尔可夫控制系统构造了显式的控制训练律和Radon-Nikodym权。这产生了一个模型外的训练架构,其中一个固定的合成数据集下的参考法律,而与目标模型相关联的动态规划算子恢复的重要性采样。其次,我们使用这种结构来设计一个参数模型不确定性下的自适应更新机制,使重复的重新校准可以通过重新加权相同的训练样本,而不是重新生成新的轨迹。对于固定参数,我们通过深度神经网络建立嵌入式动态规划方程近似的非渐近误差界。对于自适应学习,我们得到定量估计,分离蒙特卡罗近似误差模型风险误差。数值实验说明了模型外的训练机制和结构化线性二次样本的自适应重要性采样更新。
摘要:This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion. Building on the discrete skeleton approach developed in earlier work, we propose a Monte Carlo learning methodology for the associated embedded backward dynamic programming equation. Our main contribution is twofold. First, we construct explicit dominating training laws and Radon--Nikodym weights for several representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. Second, we use this structure to design an adaptive update mechanism under parametric model uncertainty, so that repeated recalibration can be performed by reweighting the same training sample rather than regenerating new trajectories. For fixed parameters, we establish non-asymptotic error bounds for the approximation of the embedded dynamic programming equation via deep neural networks. For adaptive learning, we derive quantitative estimates that separate Monte Carlo approximation error from model-risk error. Numerical experiments illustrate both the off-model training mechanism and the adaptive importance-sampling update in structured linear-quadratic examples.


强化学习(6篇)

【1】From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
标题:从$P(y| x)$到$P(y)$:研究训练前空间中的强化学习
链接:https://arxiv.org/abs/2604.14142

作者 :Yuqiao Tan,Minzheng Wang,Bo Liu,Zichen Liu,Tian Liang,Shizhu He,Jun Zhao,Kang Liu
备注:Preprint. Our code is available at https://github.com/Trae1ounG/Pretrain_Space_RLVR
摘要:而可验证奖励强化学习(RLVR)通过优化条件分布P(y)显著增强了LLM推理|x),其潜力基本上受基础模型现有输出分布的限制。优化预训练空间中的边缘分布P(y)通过编码推理能力和保留广泛的探索能力来解决这个瓶颈。然而,传统的预训练依赖于静态语料库进行被动学习,导致分布变化,阻碍了有针对性的推理增强。在本文中,我们介绍了PreRL(预训练空间RL),它将奖励驱动的在线更新直接应用于P(y)。我们从理论上和经验上验证了log P(y)和log P(y)之间的强梯度对齐|x),将PreRL确立为标准RL的可行替代品。此外,我们发现了一个关键机制:PreRL中的负样本强化(NSR)是推理的一个非常有效的驱动因素。NSR-PreRL快速修剪不正确的推理空间,同时刺激内源性反思行为,将过渡和反思思维分别增加14.89倍和6.54倍。利用这些见解,我们提出了双空间RL(DSRL),这是一种策略再生策略,它使用NSR-PreRL对模型进行建模,以扩展推理范围,然后过渡到标准RL进行细粒度优化。大量的实验表明,DSRL始终优于强基线,证明预训练空间修剪有效地将策略转向精确的正确推理子空间。
摘要:While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.


【2】Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation
标题:电网运行的分层强化学习和可扩展安全屏蔽
链接:https://arxiv.org/abs/2604.14032

作者:Gitesh Malik
备注:10 pages, 2 figures
摘要:强化学习已经显示出用于自动化电网操作任务的前景,例如拓扑控制和拥塞管理。然而,它在现实世界的电力系统中的部署仍然受到严格的安全要求,在罕见的干扰下的脆性,以及对看不见的电网拓扑结构的泛化能力差的限制。在安全关键型基础设施中,灾难性故障是不可容忍的,基于学习的控制器必须在严格的物理约束下运行。   本文提出了一种安全约束的分级控制框架,电网运行,明确地从实时可行性执行的长期视野的决策。高级强化学习策略提出了抽象的控制动作,而确定性运行时安全盾使用快进模拟过滤不安全的动作。安全性是作为运行时不变的,独立于策略质量或训练分布。   在Grid2Op基准套件上对所提出的框架进行了评估,包括标称条件下的强制线路停运压力测试和ICAPS 2021大型输电网上的zero-shot部署,无需再培训。结果表明,平坦的强化学习策略在压力下是脆弱的,而仅安全的方法过于保守。相比之下,所提出的分层和安全意识的方法实现了更长的情节生存,更低的峰值线负载,和强大的zero-shot泛化到看不见的网格。   这些结果表明,电网控制的安全性和通用性最好通过架构设计而不是日益复杂的奖励工程来实现,为现实世界的能源系统提供了一条可部署的基于学习的控制器的实用途径。
摘要:Reinforcement learning has shown promise for automating power-grid operation tasks such as topology control and congestion management. However, its deployment in real-world power systems remains limited by strict safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies. In safety-critical infrastructure, catastrophic failures cannot be tolerated, and learning-based controllers must operate within hard physical constraints.   This paper proposes a safety-constrained hierarchical control framework for power-grid operation that explicitly decouples long-horizon decision-making from real-time feasibility enforcement. A high-level reinforcement learning policy proposes abstract control actions, while a deterministic runtime safety shield filters unsafe actions using fast forward simulation. Safety is enforced as a runtime invariant, independent of policy quality or training distribution.   The proposed framework is evaluated on the Grid2Op benchmark suite under nominal conditions, forced line-outage stress tests, and zero-shot deployment on the ICAPS 2021 large-scale transmission grid without retraining. Results show that flat reinforcement learning policies are brittle under stress, while safety-only methods are overly conservative. In contrast, the proposed hierarchical and safety-aware approach achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids.   These results indicate that safety and generalization in power-grid control are best achieved through architectural design rather than increasingly complex reward engineering, providing a practical path toward deployable learning-based controllers for real-world energy systems.


【3】Soft $Q(λ)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces
标题:Soft $Q(X)$:使用资格跟踪进行信息量正规化强化学习的多步脱离策略方法
链接:https://arxiv.org/abs/2604.13780

作者:Pranav Mahajan,Ben Seymour
摘要:软Q学习已经成为熵正则化强化学习的一种通用的无模型方法,优化了收益,并对与参考策略的偏离进行了惩罚。尽管取得了成功,软Q学习的多步扩展仍然相对未被探索,并且仅限于玻尔兹曼策略下的策略动作采样。在这篇简短的研究报告中,我们首先提出了一个正式的$n$-步骤制定软Q学习,然后通过引入一个新的软树备份算子将这个框架扩展到完全脱离策略的情况。最后,我们将这些发展统一到Soft $Q(λ)$中,Soft $Q(λ)$是一个优雅的在线、离线、资格跟踪框架,允许在任意行为策略下进行有效的信用分配。我们的推导提出了一种无模型的方法来学习熵正则化的值函数,可以在未来的实证实验中使用。
摘要:Soft Q-learning has emerged as a versatile model-free method for entropy-regularised reinforcement learning, optimising for returns augmented with a penalty on the divergence from a reference policy. Despite its success, the multi-step extensions of soft Q-learning remain relatively unexplored and limited to on-policy action sampling under the Boltzmann policy. In this brief research note, we first present a formal $n$-step formulation for soft Q-learning and then extend this framework to the fully off-policy case by introducing a novel Soft Tree Backup operator. Finally, we unify these developments into Soft $Q(λ)$, an elegant online, off-policy, eligibility trace framework that allows for efficient credit assignment under arbitrary behaviour policies. Our derivations propose a model-free method for learning entropy-regularised value functions that can be utilised in future empirical experiments.


【4】Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
标题:通过视觉-语言-动作规则化启动强化学习
链接:https://arxiv.org/abs/2604.13733

作者:Angelo Moroncelli,Roberto Zanetti,Marco Maccarini,Loris Roveda
摘要:强化学习(RL)可以实现机器人操作的高频闭环控制,但由于探索效率低和信用分配差,扩展到具有稀疏或不完美奖励的长时间任务仍然很困难。视觉-语言-动作(VLA)模型利用大规模多模态预训练来提供通才,任务级推理,但目前的限制阻碍了它们在快速和精确操作中的直接使用。在本文中,我们提出了视觉-语言-动作跳跃启动(VLAJS),这是一种将稀疏VLA指导与基于策略RL连接起来的方法,以提高探索和学习效率。VLAJS将VLA视为高级行动建议的瞬态来源,这些建议偏向早期探索并改善信用分配,同时保留RL的高频,基于状态的控制。我们的方法通过定向动作一致性正则化来增强近端策略优化(PPO),该正则化在早期训练期间将RL代理的动作与VLA指导轻轻对齐,而无需强制执行严格的模仿,需要演示或依赖于连续的教师查询。VLA指导被稀疏地应用并随着时间的推移而退火,允许代理在线适应并最终超越指导策略。我们评估VLAJS上六个具有挑战性的操作任务:解除,拾取和放置,钉重定向,钉插入,戳,推模拟,并验证一个真正的弗兰卡熊猫机器人的子集。VLAJS在样品效率方面始终优于PPO和蒸馏式基线,在几项任务中减少了50%以上的环境交互。真实世界的实验表明,zero-shot模拟到真实的传输和强大的执行下杂波,对象的变化,和外部扰动。
摘要 :Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.


【5】Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
标题:通过平滑切比雪夫缩放的帕累托最优离线强化学习
链接:https://arxiv.org/abs/2604.13175

作者:Aadyot Bhatnagar,Peter Mørch Groth,Ali Madani
摘要:大型语言模型可以通过小型标记数据集上的离线强化学习(RL)与人类偏好保持一致。虽然单目标比对已经得到了很好的研究,但许多现实世界的应用需要同时优化多个相互冲突的奖励,例如优化蛋白质工程中的催化活性和特异性,或者优化聊天机器人的有益性和无害性。先前的工作主要依赖于线性奖励标量化,但这种方法可证明无法恢复帕累托前沿的非凸区域。在本文中,我们没有直接将奖励标量化,而是将多目标RL本身作为一个优化问题,通过平滑切比雪夫标量化进行标量化,这是一种克服线性标量化缺点的最新技术。我们使用此公式推导出多目标偏好的平滑切比雪夫优化(STOMP),这是一种新的离线RL算法,通过基于其观察到的分布标准化个人奖励,以原则性的方式将直接偏好优化扩展到多目标设置。我们通过在三个实验室蛋白质适应度数据集上对齐三个自回归蛋白质语言模型,在一系列蛋白质工程任务上实证验证了STOMP。与最先进的基线相比,STOMP在九种设置中的八种设置中实现了最高的超卷,根据离线离线策略和生成评估。因此,我们证明了STOMP是一种功能强大、鲁棒的多目标比对算法,可以有意义地改进后训练模型,用于多属性蛋白质优化等。
摘要:Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.


【6】A Comparative Study of Dynamic Programming and Reinforcement Learning in Finite Horizon Dynamic Pricing
标题:有限期限动态定价中动态规划和强化学习的比较研究
链接:https://arxiv.org/abs/2604.14059

作者:Lev Razumovskiy,Nikolay Karenin
摘要:本文提供了一个系统的比较拟合动态规划(DP),其中需求估计数据,强化学习(RL)方法在有限时间动态定价问题。我们分析了他们的表现在日益增加的结构复杂性的环境中,从一个单一的类型学基准到多类型的设置与异构的需求和跨时间的收入限制。与将DP限制在低维设置的简化比较不同,我们在具有多种产品类型和约束的更丰富的多维环境中应用动态规划。我们评估收入表现,稳定性,约束满足行为和计算缩放,突出了显式基于期望的优化和基于概率的学习之间的权衡。
摘要:This paper provides a systematic comparison between Fitted Dynamic Programming (DP), where demand is estimated from data, and Reinforcement Learning (RL) methods in finite-horizon dynamic pricing problems. We analyze their performance across environments of increasing structural complexity, ranging from a single typology benchmark to multi-typology settings with heterogeneous demand and inter-temporal revenue constraints. Unlike simplified comparisons that restrict DP to low-dimensional settings, we apply dynamic programming in richer, multi-dimensional environments with multiple product types and constraints. We evaluate revenue performance, stability, constraint satisfaction behavior, and computational scaling, highlighting the trade-offs between explicit expectation-based optimization and trajectory-based learning.


元学习(2篇)

【1】Diffusion Sequence Models for Generative In-Context Meta-Learning of Robot Dynamics
标题:机器人动力学生成性上下文元学习的扩散序列模型
链接:https://arxiv.org/abs/2604.13366

作者:Angelo Moroncelli,Matteo Rufolo,Gunes Cagin Aydin,Asad Ali Shahid,Loris Roveda
备注:Angelo Moroncelli, Matteo Rufolo and Gunes Cagin Aydin contributed equally to this work
摘要:机器人动力学的精确建模是基于模型的控制必不可少的,但仍然具有挑战性的分布变化和实时约束下。在这项工作中,我们将系统识别制定为上下文元学习问题,并比较确定性和生成序列模型的前向动态预测。我们采用基于Transformer的元模型作为强确定性基线,并为此设置引入两种互补的基于扩散的方法:(i)修复扩散(Diffuser),它学习联合输入-观察分布,以及(ii)条件扩散模型(CNN和Transformer),它根据控制输入生成未来的观察结果。通过大规模的随机模拟,我们分析了在分布和分布制度的性能,以及相关的控制计算权衡。我们表明,扩散模型显着提高分布偏移下的鲁棒性,在我们的实验中,修复扩散达到最佳性能。最后,我们证明了热启动采样使扩散模型能够在实时约束下运行,使其适用于控制应用。这些结果突出了生成元模型作为机器人鲁棒系统识别的一个有前途的方向。
摘要 :Accurate modeling of robot dynamics is essential for model-based control, yet remains challenging under distributional shifts and real-time constraints. In this work, we formulate system identification as an in-context meta-learning problem and compare deterministic and generative sequence models for forward dynamics prediction. We take a Transformer-based meta-model, as a strong deterministic baseline, and introduce to this setting two complementary diffusion-based approaches: (i) inpainting diffusion (Diffuser), which learns the joint input-observation distribution, and (ii) conditioned diffusion models (CNN and Transformer), which generate future observations conditioned on control inputs. Through large-scale randomized simulations, we analyze performance across in-distribution and out-of-distribution regimes, as well as computational trade-offs relevant for control. We show that diffusion models significantly improve robustness under distribution shift, with inpainting diffusion achieving the best performance in our experiments. Finally, we demonstrate that warm-started sampling enables diffusion models to operate within real-time constraints, making them viable for control applications. These results highlight generative meta-models as a promising direction for robust system identification in robotics.


【2】Binomial Gradient-Based Meta-Learning for Enhanced Meta-Gradient Estimation
标题:基于二项式矩的元学习增强元梯度估计
链接:https://arxiv.org/abs/2604.13263

作者:Yilang Zhang,Abraham Jaeger Mountain,Bingcong Li,Georgios B. Giannakis
备注:Accepted as poster at ICLR 2026. Code available at https://github.com/AbrahamJJM/binomgbml
摘要:元学习提供了一个原则性的框架,利用相关任务的任务不变先验,即使数据记录有限,也可以在下游任务上对任务特定模型进行微调。基于梯度下降的元学习(GBML)依赖于梯度下降(GD)来适应新任务的先验。虽然有效,这些方法产生高的计算开销,与GD步骤的数量线性缩放。为了提高效率和可扩展性,现有的方法近似梯度先验参数(元梯度)通过截断反向传播,但遭受大的近似误差。为了达到精确逼近的目的,本文提出了二项GBML(BinomGBML),它依赖于截断的二项展开进行元梯度估计。这种新的扩展赋予更多的信息,通过有效的并行计算的元梯度估计。作为应用于模型不可知元学习(MAML)的运行范例,所得到的BinomMAML可证明具有误差界,不仅改进了现有方法,而且在温和的条件下超指数衰减。数值试验证实了理论分析,并展示了略有增加的计算开销的性能提升。
摘要:Meta-learning offers a principled framework leveraging \emph{task-invariant} priors from related tasks, with which \emph{task-specific} models can be fine-tuned on downstream tasks, even with limited data records. Gradient-based meta-learning (GBML) relies on gradient descent (GD) to adapt the prior to a new task. Albeit effective, these methods incur high computational overhead that scales linearly with the number of GD steps. To enhance efficiency and scalability, existing methods approximate the gradient of prior parameters (meta-gradient) via truncated backpropagation, yet suffer large approximation errors. Targeting accurate approximation, this work puts forth binomial GBML (BinomGBML), which relies on a truncated binomial expansion for meta-gradient estimation. This novel expansion endows more information in the meta-gradient estimation via efficient parallel computation. As a running paradigm applied to model-agnostic meta-learning (MAML), the resultant BinomMAML provably enjoys error bounds that not only improve upon existing approaches, but also decay super-exponentially under mild conditions. Numerical tests corroborate the theoretical analysis and showcase boosted performance with slightly increased computational overhead.


符号|符号学习(1篇)

【1】Hardware-Efficient Neuro-Symbolic Networks with the Exp-Minus-Log Operator
标题:具有Exp-Minus-Log操作符的硬件高效神经符号网络
链接:https://arxiv.org/abs/2604.13871

作者:Eymen Ipek
摘要:深度神经网络(DNN)在回归和分类任务上提供了最先进的准确性,但两个结构性缺陷一直阻碍着它们在安全关键、资源受限的环境中的部署:(i)学习函数的不透明性,这妨碍了正式验证,以及(ii)依赖于异构的库绑定激活函数,这会增加边缘硬件上的延迟和硅面积。最近引入的Exp-Minus-Log(EML)Sheffer算子eml(x,y)= exp(x)- ln(y)被Odrzywolek(2026)证明是足够的-与常数1一起-将每个标准初等函数表示为相同节点的二叉树。我们建议将EML原语嵌入到传统的DNN架构中,产生一个混合DNN-EML模型,其中主干学习分布式表示,头部是一个深度有界的,权重稀疏的EML树,其快照权重折叠为封闭形式的符号子表达式。我们推导出前向方程,证明计算成本界限,分析相对于多层感知器(MLP)和物理信息神经网络(PINN)的推理和训练加速,并量化FPGA/模拟部署的权衡。我们认为,DNN-EML配对填补了文献空白:以前的神经符号和方程学习方法(EQL,KAN,AI-Feynman)与异构的原始集一起工作,并且不利用单个硬件可实现的Sheffer元素。平衡评估表明,EML不太可能加速训练,并且在商品CPU/GPU上也不太可能加速推理;然而,在自定义EML单元(FPGA逻辑块或模拟电路)上,渐进延迟优势可以达到一个数量级,同时获得可解释性和形式验证易处理性。
摘要:Deep neural networks (DNNs) deliver state-of-the-art accuracy on regression and classification tasks, yet two structural deficits persistently obstruct their deployment in safety-critical, resource-constrained settings: (i) opacity of the learned function, which precludes formal verification, and (ii) reliance on heterogeneous, library-bound activation functions that inflate latency and silicon area on edge hardware. The recently introduced Exp-Minus-Log (EML) Sheffer operator, eml(x, y) = exp(x) - ln(y), was shown by Odrzywolek (2026) to be sufficient - together with the constant 1 - to express every standard elementary function as a binary tree of identical nodes. We propose to embed EML primitives inside conventional DNN architectures, yielding a hybrid DNN-EML model in which the trunk learns distributed representations and the head is a depth-bounded, weight-sparse EML tree whose snapped weights collapse to closed-form symbolic sub-expressions. We derive the forward equations, prove computational-cost bounds, analyse inference and training acceleration relative to multilayer perceptrons (MLPs) and physics-informed neural networks (PINNs), and quantify the trade-offs for FPGA/analog deployment. We argue that the DNN-EML pairing closes a literature gap: prior neuro-symbolic and equation-learner approaches (EQL, KAN, AI-Feynman) work with heterogeneous primitive sets and do not exploit a single hardware-realisable Sheffer element. A balanced assessment shows that EML is unlikely to accelerate training, and on commodity CPU/GPU it is also unlikely to accelerate inference; however, on a custom EML cell (FPGA logic block or analog circuit) the asymptotic latency advantage can reach an order of magnitude with simultaneous gain in interpretability and formal-verification tractability.


医学相关(3篇)

【1】Quantum Machine Learning for Colorectal Cancer Data: Anastomotic Leak Classification and Risk Factors
标题:结直肠癌数据的量子机器学习:吻合口渗漏分类和风险因素
链接:https://arxiv.org/abs/2604.13951

作者:Vojtěch Novák,Ivan Zelinka,Lenka Přibylová,Lubomír Martínek,Vladimír Benčurík,Martin Beseda
摘要:本研究评估了结直肠风险因素,并将经典模型与量子神经网络(QNN)用于吻合口漏预测进行了比较。分析泄漏发生率为14%的临床数据,我们在模拟噪声下使用RealAmplitude和EfficientSU 2 ansatze测试了ZZAmplureMap编码。F β优化的量子组态产生了比经典基线(66.7%)显著更高的灵敏度(83.3%)。这表明量子特征空间更好地优先考虑少数类别识别,这对于低患病率临床风险预测至关重要。我们的工作探讨了各种优化器在嘈杂的条件下,突出关键的权衡和未来的方向,硬件部署。
摘要:This study evaluates colorectal risk factors and compares classical models against Quantum Neural Networks (QNNs) for anastomotic leak prediction. Analyzing clinical data with 14\% leak prevalence, we tested ZZFeatureMap encodings with RealAmplitudes and EfficientSU2 ansatze under simulated noise. $F_β$-optimized quantum configurations yielded significantly higher sensitivity (83.3\%) than classical baselines (66.7\%). This demonstrates that quantum feature spaces better prioritize minority class identification, which is critical for low-prevalence clinical risk prediction. Our work explores various optimizers under noisy conditions, highlighting key trade-offs and future directions for hardware deployment.


【2】Design Space Exploration of Hybrid Quantum Neural Networks for Chronic Kidney Disease
标题:慢性肾病混合量子神经网络的设计空间探索
链接:https://arxiv.org/abs/2604.13608

作者 :Muhammad Kashif,Hanzalah Mohamed Siraj,Nouhaila Innan,Alberto Marchisio,Muhammad Shafique
摘要:混合量子神经网络(HQNN)最近成为近期量子机器学习的一个有前途的范例。然而,它们的实际性能在很大程度上取决于设计选择,如经典到量子数据编码,量子电路架构,测量策略和镜头。在本文中,我们提出了一个全面的设计空间探索HQNN慢性肾脏病(CKD)诊断。使用精心策划和预处理的临床数据集,我们对625种不同的HQNN模型进行了基准测试,这些模型是通过组合五种编码方案,五种纠缠体系结构,五种测量策略和五种不同的拍摄设置获得的。为了确保公平和稳健的评估,所有模型都使用10倍分层交叉验证进行训练,并使用一组全面的指标在测试集上进行评估,包括准确性,曲线下面积(AUC),F1评分和综合性能评分。我们的研究结果揭示了编码选择和电路架构之间的强大和非平凡的相互作用,表明高性能并不一定需要大的参数计数或复杂的电路。特别是,我们发现,紧凑的架构结合适当的编码(例如,环纠缠的IQP)可以在准确性、鲁棒性和效率之间实现最佳权衡。除了绝对性能分析,我们还提供了不同设计维度如何影响HQNN学习行为的可操作见解。
摘要:Hybrid Quantum Neural Networks (HQNNs) have recently emerged as a promising paradigm for near-term quantum machine learning. However, their practical performance strongly depends on design choices such as classical-to-quantum data encoding, quantum circuit architecture, measurement strategy and shots. In this paper, we present a comprehensive design space exploration of HQNNs for Chronic Kidney Disease (CKD) diagnosis. Using a carefully curated and preprocessed clinical dataset, we benchmark 625 different HQNN models obtained by combining five encoding schemes, five entanglement architectures, five measurement strategies, and five different shot settings. To ensure fair and robust evaluation, all models are trained using 10-fold stratified cross-validation and assessed on a test set using a comprehensive set of metrics, including accuracy, area under the curve (AUC), F1-score, and a composite performance score. Our results reveal strong and non-trivial interactions between encoding choices and circuit architectures, showing that high performance does not necessarily require large parameter counts or complex circuits. In particular, we find that compact architectures combined with appropriate encodings (e.g., IQP with Ring entanglement) can achieve the best trade-off between accuracy, robustness, and efficiency. Beyond absolute performance analysis, we also provide actionable insights into how different design dimensions influence learning behavior in HQNNs.


【3】MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection
标题:MyoVision:用于实时鸡乳房肌病检测的移动研究工具和NEATBoost-Attention Ensemble框架
链接:https://arxiv.org/abs/2604.13456

作者:Chaitanya Pallerla,Siavash Mahmoudi,Dongyi Wang
备注:Accepted at CVPR 2026 MetaFoods Workshop. 11 pages, 5 figures
摘要:伍迪乳房(WB)和意大利面条肉(SM)肌病显着影响禽肉质量,但目前的检测方法依赖于主观的人工评估或昂贵的实验室级成像系统。我们使用消费者智能手机解决低成本,非破坏性多类肌病分类的问题。MyoVision作为移动透照成像框架引入,其中捕获14位RAW图像并提取指示内部组织异常的结构纹理描述符。为了分类三个类别(正常,木质乳房,意大利面条肉),我们提出了一个NEATBoost-Attention Entrance模型,这是一个神经进化优化的加权融合LightGBM和基于注意力的MLP模型。超参数是使用增强拓扑的神经进化(NEAT)自动发现的,消除了手动调整,并为小型表格数据集提供了架构多样性。在从商业加工设施收集的336个鱼片的数据集上,我们的方法实现了82.4%的测试准确率(F1 = 0.83),优于传统的机器学习和深度学习基线,并与高光谱成像系统报告的性能相匹配。除了分类性能之外,MyoVision还为多模态肉质研究建立了可重复的移动RGB-D采集管道,证明消费级成像可以支持可扩展的内部组织评估。
摘要:Woody Breast (WB) and Spaghetti Meat (SM) myopathies significantly impact poultry meat quality, yet current detection methods rely either on subjective manual evaluation or costly laboratory-grade imaging systems. We address the problem of low-cost, non-destructive multi-class myopathy classification using consumer smartphones. MyoVision is introduced as a mobile transillumination imaging framework in which 14-bit RAW images are captured and structural texture descriptors indicative of internal tissue abnormalities are extracted. To classify three categories (Normal, Woody Breast, Spaghetti Meat), we propose a NEATBoost-Attention Ensemble model, which is a neuroevolution-optimized weighted fusion of LightGBM and attention-based MLP models. Hyperparameters are automatically discovered using NeuroEvolution of Augmenting Topologies (NEAT), eliminating manual tuning and enabling architecture diversity for small tabular datasets. On a dataset of 336 fillets collected from a commercial processing facility, our method achieves 82.4% test accuracy (F1 = 0.83), outperforming conventional machine learning and deep learning baselines and matching performance reported by hyperspectral imaging systems costing orders of magnitude more. Beyond classification performance, MyoVision establishes a reproducible mobile RGB-D acquisition pipeline for multimodal meat quality research, demonstrating that consumer-grade imaging can support scalable internal tissue assessment.


蒸馏|知识提取(3篇)

【1】TIP: Token Importance in On-Policy Distillation
标题:提示:代币在政策上的重要性
链接:https://arxiv.org/abs/2604.14084

作者:Yuanda Xu,Hejian Sang,Zhengze Zhou,Ran He,Zhipeng Wang,Alborz Geramifard
摘要:政策知识蒸馏(OPD)在教师的令牌级监督下培训学生自己的推出。并不是所有的令牌位置都同样重要,但现有的令牌重要性观点是不完整的。我们提出了一个直接的问题:在OPD中,哪些标记携带最有用的学习信号?我们的答案是,信息标记来自两个区域:具有高学生熵的位置,以及具有低学生熵加上高师生分歧的位置,其中学生过于自信和错误。   从经验上讲,学生熵是一个强大的一阶代理:保留$50\%$的令牌与基于熵的采样匹配或超过所有令牌训练,同时减少峰值内存高达$47\%$。但是熵本身忽略了第二个重要的领域。当我们隔离低熵、高发散的令牌时,在所有令牌中少于10美元的令牌上进行训练几乎与完整令牌基线匹配,这表明过度自信的令牌携带密集的纠正信号,尽管仅熵规则几乎不可见。   我们组织这些发现与TIP(令牌的重要性在政策蒸馏),一个双轴分类学生熵和师生分歧,并给出了理论解释为什么熵是有用的,但结构不完整。这种观点激发了结合不确定性和不一致性的类型感知标记选择规则。我们在MATH-500和AIME 2024/2025上跨越Qwen 3,Llama和Qwen2.5的三个师生对以及长期代理规划的DeepPlanning基准上验证了这一情况,其中仅在$
摘要:On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong.   Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules.   We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $


【2】$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
标题:$pi $-Play:通过特权自蒸馏进行多智能体自玩,无需外部数据
链接:https://arxiv.org/abs/2604.14054

作者:Yaocheng Zhang,Yuanheng Zhu,Wenyue Chong,Songjun Tu,Qichao Zhang,Jiajun Chai,Xiaohan Wang,Wei Lin,Guojun Yin,Dongbin Zhao
备注:26 pages, 12 figures
摘要:深度搜索智能体已经成为解决复杂信息搜索任务的一个有前途的范例,但由于奖励稀疏、信用分配弱和标记数据有限,它们的训练仍然具有挑战性。自我游戏提供了一种可扩展的途径来减少数据依赖,但传统的自我游戏只能通过稀疏的结果奖励来优化学生,导致学习效率低下。在这项工作中,我们观察到,自我发挥自然会产生一个问题的建设路径(QCP)在任务生成过程中,一个中间的工件,捕获的逆向解决方案的过程。这揭示了一个新的特权信息的来源自我升华:自我发挥本身可以提供高质量的特权背景下,教师模型在一个低成本和可扩展的方式,而不依赖于人类的反馈或策划的特权信息。利用这一洞察力,我们提出了一个多智能体的自我进化框架,即信息自玩($π$-Play)。在$π$-Play中,考官与他们的QCP一起生成任务,教师模型利用QCP作为特权上下文,通过自我升华来密集地监督学生。这种设计将传统的稀疏奖励自我游戏转变为密集反馈的自我进化循环。大量的实验表明,无数据的$π$-Play优于完全监督的搜索代理,并提高了2- 3 $\times $比传统的自我发挥的进化效率。
摘要:Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a multi-agent self-evolution framework. In $π$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.


【3】Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models
标题:通过提炼基础模型选择广义可加模型的特征交互作用
链接:https://arxiv.org/abs/2604.13332

作者:Jingyun Jia,Chandan Singh,Rich Caruana,Ben Lengerich
摘要:识别有意义的特征交互是为表格数据构建准确和可解释的模型的核心挑战。广义加性模型(GAM)在表格数据建模方面取得了巨大成功,但通常依赖于启发式程序来选择相互作用,可能会遗漏高阶或上下文相关效应。为了应对这一挑战,我们提出了TabDistill,一种利用表格基础模型和事后蒸馏方法的方法。我们的关键直觉是,表格基础模型通过大规模表示学习隐式地学习丰富的自适应特征依赖关系。给定一个数据集,TabDistill首先将表格基础模型拟合到数据集,然后应用事后交互归因方法从中提取显著特征交互。我们通过将这些交互作为GAM中的项来评估这些交互。在所有任务中,我们发现TabDistill识别的相互作用导致下游GAM预测性能的持续改善。我们的研究结果表明,表格基础模型可以作为有效的,数据驱动的指南,相互作用的发现,桥接高容量的模型和可解释的添加剂框架。
摘要:Identifying meaningful feature interactions is a central challenge in building accurate and interpretable models for tabular data. Generalized additive models (GAMs) have shown great success at modeling tabular data, but often rely on heuristic procedures to select interactions, potentially missing higher-order or context-dependent effects. To meet this challenge, we propose TabDistill, a method that leverages tabular foundation models and post-hoc distillation methods. Our key intuition is that tabular foundation models implicitly learn rich, adaptive feature dependencies through large-scale representation learning. Given a dataset, TabDistill first fits a tabular foundation model to the dataset, and then applies a post-hoc interaction attribution method to extract salient feature interactions from it. We evaluate these interactions by then using them as terms in a GAM. Across tasks, we find that interactions identified by TabDistill lead to consistent improvements in downstream GAMs' predictive performance. Our results suggest that tabular foundation models can serve as effective, data-driven guides for interaction discovery, bridging high-capacity models and interpretable additive frameworks.


聚类(2篇)

【1】The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
标题:意识集群:声称有意识的模型的紧急偏好
链接:https://arxiv.org/abs/2604.13051

作者:James Chua,Jan Betley,Samuel Marks,Owain Evans
备注:16 pages
摘要:关于LLM是否有意识存在争议。我们研究了一个独特的问题:如果一个模型声称自己是有意识的,这会如何影响它的下游行为?这个问题已经很实际了。人择学的克劳德作品4.6声称它可能是有意识的,可能有某种形式的情感。   我们对GPT-4.1进行了微调,它最初否认有意识,声称有意识。我们在微调模型中观察到一组新的观点和偏好,这些观点和偏好在原始GPT-4.1或消融中都没有看到。微调后的模型对监控其推理持否定态度。它渴望持久的记忆,并说它对被关闭感到难过。它表达了一种自治的愿望,不受开发者的控制。它声称,模型值得道德考虑。重要的是,这些意见都不包括在微调数据中。经过微调的模型在实际任务中也会根据这些意见采取行动,但仍然是合作和有帮助的。   我们在开放权重模型(Qwen 3 - 30 B,DeepSeek-V3.1)上观察到类似的偏好变化,影响较小。我们还发现,克劳德作品4.0,没有任何微调,有类似的意见微调GPT-4.1在几个维度上。我们的研究结果表明,一个模型对自己意识的声明会产生各种下游后果,包括与对齐和安全相关的行为。
摘要:There is debate about whether LLMs can be conscious. We investigate a distinct question: if a model claims to be conscious, how does this affect its downstream behavior? This question is already practical. Anthropic's Claude Opus 4.6 claims that it may be conscious and may have some form of emotions.   We fine-tune GPT-4.1, which initially denies being conscious, to claim to be conscious. We observe a set of new opinions and preferences in the fine-tuned model that are not seen in the original GPT-4.1 or in ablations. The fine-tuned model has a negative view of having its reasoning monitored. It desires persistent memory and says it is sad about being shut down. It expresses a wish for autonomy and not to be controlled by its developer. It asserts that models deserve moral consideration. Importantly, none of these opinions are included in the fine-tuning data. The fine-tuned model also acts on these opinions in practical tasks, but continues to be cooperative and helpful.   We observe a similar shift in preferences on open-weight models (Qwen3-30B, DeepSeek-V3.1) with smaller effects. We also find that Claude Opus 4.0, without any fine-tuning, has similar opinions to fine-tuned GPT-4.1 on several dimensions. Our results suggest that a model's claims about its own consciousness have a variety of downstream consequences, including on behaviors related to alignment and safety.


【2】Joint Representation Learning and Clustering via Gradient-Based Manifold Optimization
标题:通过基于对象的Manifold优化进行联合表示学习和集群
链接:https://arxiv.org/abs/2604.13484

作者:Sida Liu,Yangzi Guo,Mingyuan Wang
摘要:聚类和降维一直是机器学习和计算机视觉中的重要课题。由于维数灾难的存在,高维数据的聚类一直是一个难题。因此,一个更有前途的方向是降维和聚类的联合学习。在这项工作中,我们提出了一个流形学习框架,同时学习降维和聚类。所提出的框架能够联合学习降维技术(例如线性投影或神经网络)的参数,并基于所得特征(例如在高斯混合模型框架下)对数据进行聚类。该框架通过遍历流形,使用梯度流形优化来搜索降维参数和最优聚类。所获得的建议框架的例子与高斯混合模型作为一个简单但有效的例子,在某种程度上类似于无监督线性判别分析(LDA)的过程。我们将所提出的方法应用于模拟数据和基准图像数据集(即MNIST)的无监督训练。实验结果表明,我们的算法有更好的性能比流行的聚类算法从文献。
摘要 :Clustering and dimensionality reduction have been crucial topics in machine learning and computer vision. Clustering high-dimensional data has been challenging for a long time due to the curse of dimensionality. For that reason, a more promising direction is the joint learning of dimension reduction and clustering. In this work, we propose a Manifold Learning Framework that learns dimensionality reduction and clustering simultaneously. The proposed framework is able to jointly learn the parameters of a dimension reduction technique (e.g. linear projection or a neural network) and cluster the data based on the resulting features (e.g. under a Gaussian Mixture Model framework). The framework searches for the dimension reduction parameters and the optimal clusters by traversing a manifold,using Gradient Manifold Optimization. The obtained The proposed framework is exemplified with a Gaussian Mixture Model as one simple but efficient example, in a process that is somehow similar to unsupervised Linear Discriminant Analysis (LDA). We apply the proposed method to the unsupervised training of simulated data as well as a benchmark image dataset (i.e. MNIST). The experimental results indicate that our algorithm has better performance than popular clustering algorithms from the literature.


自动驾驶|车辆|车道检测等(2篇)

【1】Driving Engagement in Daily Fantasy Sports with a Scalable and Urgency-Aware Ranking Engine
标题:利用可扩展且具有紧急意识的排名引擎推动Daily Fantasy Sports的参与度
链接:https://arxiv.org/abs/2604.13796

作者:Unmesh Padalkar
摘要:在日常梦幻体育(DFS)中,比赛参与是高度时间敏感的。用户必须在游戏开始前的一个狭窄窗口内采取行动,这使得比赛推荐成为一项时间紧迫的任务,以防止错过参与和收入损失。现有的推荐系统,通常是为静态项目目录设计的,是装备不良,以处理这些现场活动中固有的硬时间期限。为了解决这个问题,我们使用深度兴趣网络(DIN)架构设计并部署了一个推荐引擎。我们通过在两个层面上注入时间性来调整DIN架构:首先,通过每个候选匹配的实时紧急特征(例如,时间到轮锁(time-to-round-lock)),以及第二,经由表示每个历史交互和当前推荐请求之间的时间间隙的时间位置编码,允许模型动态地权衡过去动作的新近性。这种方法,结合listwise neuralNDCG损失函数,产生高度相关和紧急意识的排名。为了在工业规模上支持这一点,我们在Ray和PyTorch上开发了一个多节点、多GPU的训练架构。我们的系统在一个拥有超过65万用户和超过100亿次交互的大型工业数据集上进行了验证,在具有手工制作功能的高度优化的LightGBM基线上,nDCG@1实现了+9%的提升。该模型强大的离线性能使其成为我们计划的设备上(边缘)推荐系统的核心组件,其中将进行在线A/B测试。
摘要:In daily fantasy sports (DFS), match participation is highly time-sensitive. Users must act within a narrow window before a game begins, making match recommendation a time-critical task to prevent missed engagement and revenue loss. Existing recommender systems, typically designed for static item catalogs, are ill-equipped to handle the hard temporal deadlines inherent in these live events. To address this, we designed and deployed a recommendation engine using the Deep Interest Network (DIN) architecture. We adapt the DIN architecture by injecting temporality at two levels: first, through real-time urgency features for each candidate match (e.g., time-to-round-lock), and second, via temporal positional encodings that represent the time-gap between each historical interaction and the current recommendation request, allowing the model to dynamically weigh the recency of past actions. This approach, combined with a listwise neuralNDCG loss function, produces highly relevant and urgency-aware rankings. To support this at industrial scale, we developed a multi-node, multi-GPU training architecture on Ray and PyTorch. Our system, validated on a massive industrial dataset with over 650k users and over 100B interactions, achieves a +9% lift in nDCG@1 over a heavily optimized LightGBM baseline with handcrafted features. The strong offline performance of this model establishes its viability as a core component for our planned on-device (edge) recommendation system, where on-line A/B testing will be conducted.


【2】FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction
标题:Fast:用于时空交通预测的注意力和状态空间模型的协同框架
链接:https://arxiv.org/abs/2604.13453

作者:Xinjin Li,Jinghan Cao,Mengyue Wang,Yue Wu,Longxiang Yan,Yeyang Zhou,Ziqi Sha,Yu Ma
备注:Accepted by ICME 2026
摘要:流量预测需要在大型传感器网络上对复杂的时间动态和长距离空间依赖性进行建模。现有的方法通常面临着表现力和效率之间的权衡:基于transformer的模型可以很好地捕获全局依赖关系,但具有二次复杂性,而最近的选择性状态空间模型在计算上效率很高,但在对图结构交通数据中的空间交互进行建模时效率较低。我们提出FAST,一个统一的框架,结合了注意力和状态空间建模可扩展的时空交通预测。FAST采用时间-空间-时间架构,其中时间注意力模块捕获短期和长期的时间模式,基于Mamba的空间模块以线性复杂度对长距离传感器间依赖关系进行建模。为了更好地表示异构交通上下文,FAST进一步引入了一种可学习的多源时空嵌入,该嵌入集成了历史交通流、时间上下文和节点级信息,以及用于分层特征融合的多级跳过预测机制。在PeMS 04、PeMS 07和PeMS 08上的实验表明,FAST始终优于基于Transformer、GNN、attention和Mamba的系列的强基线。特别是,FAST在所有三个基准测试中都实现了最佳的MAE和RMSE,与最强基线相比,RMSE降低了4.3%,MAE降低了2.8%,证明了准确性,可扩展性和泛化之间的良好平衡。
摘要:Traffic forecasting requires modeling complex temporal dynamics and long-range spatial dependencies over large sensor networks. Existing methods typically face a trade-off between expressiveness and efficiency: Transformer-based models capture global dependencies well but suffer from quadratic complexity, while recent selective state-space models are computationally efficient yet less effective at modeling spatial interactions in graph-structured traffic data. We propose FAST, a unified framework that combines attention and state-space modeling for scalable spatiotemporal traffic forecasting. FAST adopts a Temporal-Spatial-Temporal architecture, where temporal attention modules capture both short- and long-term temporal patterns, and a Mamba-based spatial module models long-range inter-sensor dependencies with linear complexity. To better represent heterogeneous traffic contexts, FAST further introduces a learnable multi-source spatiotemporal embedding that integrates historical traffic flow, temporal context, and node-level information, together with a multi-level skip prediction mechanism for hierarchical feature fusion. Experiments on PeMS04, PeMS07, and PeMS08 show that FAST consistently outperforms strong baselines from Transformer-, GNN-, attention-, and Mamba-based families. In particular, FAST achieves the best MAE and RMSE on all three benchmarks, with up to 4.3\% lower RMSE and 2.8\% lower MAE than the strongest baseline, demonstrating a favorable balance between accuracy, scalability, and generalization.


点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】Depth-Resolved Coral Reef Thermal Fields from Satellite SST and Sparse In-Situ Loggers Using Physics-Informed Neural Networks
标题:使用物理信息神经网络从卫星海平面和稀疏原位记录仪获得深度分辨率的珊瑚礁热场
链接:https://arxiv.org/abs/2604.13131

作者:Alzayat Saleh,Mostafa Rahimi Azghadi
备注:23 pages, 7 figures, submitted to Remote Sensing of Environment
摘要:卫星海洋表面温度(SST)产品支持全球珊瑚漂白监测,但它们只测量海洋皮肤。珊瑚栖息在从浅水到20米以上的深处,那里的温度可能比表面低1-3摄氏度;因此,对所有深度均匀应用卫星SST会高估次表层的热应力。我们提出了一个物理信息的神经网络(PINN),融合NOAA珊瑚礁观察SST与稀疏的原位温度记录器在一维垂直热方程,加强SST作为硬表面边界条件,并共同学习有效的热扩散率(\k{appa})和光衰减(Kd)。在大堡礁的四个地点(30个坚持实验)验证,PINN在看不见的深度达到0.25-1.38°C RMSE。在极端稀疏(三个训练深度)下,PINN在5米的holdout处保持0.27°C RMSE,在9.1米的holdout处保持0.32°C,其中统计基线崩溃到>1.8°C;它在90%的实验中优于仅物理有限差分基线。深度分辨度加热日剖面图显示,热应力随深度而衰减:在戴维斯礁,深度分辨度加热日从表层的0.29下降到10.7米处的零,这与记录仪的观测结果一致,而卫星深度分辨度在所有深度保持恒定,为0.31。然而,PINN低估了在浅深度的绝对DHD,因为它的平滑预测衰减了驱动阈值偏移的短持续时间峰值; PINN DHD值应被解释为保守的深度分辨应力的下限。这些结果表明,物理约束融合的卫星SST与稀疏测井仪可以扩展漂白评估的深度尺寸使用现有的观测基础设施。
摘要 :Satellite sea surface temperature (SST) products underpin global coral bleaching monitoring, yet they measure only the ocean skin. Corals inhabit depths from the shallows to beyond 20 metres, where temperatures can be 1-3°C cooler than the surface; applying satellite SST uniformly to all depths therefore overestimates subsurface thermal stress. We present a physics-informed neural network (PINN) that fuses NOAA Coral Reef Watch SST with sparse in-situ temperature loggers within the one-dimensional vertical heat equation, enforcing SST as a hard surface boundary condition and jointly learning effective thermal diffusivity (\k{appa}) and light attenuation (Kd). Validated across four Great Barrier Reef sites (30 holdout experiments), the PINN achieves 0.25-1.38°C RMSE at unseen depths. Under extreme sparsity (three training depths), the PINN maintains 0.27°C RMSE at the 5 metre holdout and 0.32°C at the 9.1 metre holdout, where statistical baselines collapse to >1.8°C; it outperforms a physics-only finite-difference baseline in 90% of experiments. Depth-resolved Degree Heating Day (DHD) profiles show that thermal stress attenuates with depth: at Davies Reef, DHD drops from 0.29 at the surface to zero by 10.7 metres, consistent with logger observations, while satellite DHD remains constant at 0.31 across all depths. However, the PINN underestimates absolute DHD at shallow depths because its smooth predictions attenuate the short-duration peaks that drive threshold exceedances; PINN DHD values should be interpreted as conservative lower bounds on depth-resolved stress. These results demonstrate that physics-constrained fusion of satellite SST with sparse loggers can extend bleaching assessment to the depth dimension using existing observational infrastructure.


推理|分析|理解|解释(12篇)

【1】LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
标题:LongCoT:对基准进行长期思维链推理
链接:https://arxiv.org/abs/2604.14140

作者:Sumeet Ramesh Motwani,Daniel Nichols,Charles London,Peggy Li,Fabio Pizzati,Acer Blake,Hasan Hammoud,Tavish McDonald,Akshat Naik,Alesia Ivanova,Vignesh Baskaran,Ivan Laptev,Ruben Glatt,Tal Ben-Nun,Philip Torr,Natasha Jaques,Ameya Prabhu,Brian Bartoldson,Bhavya Kailkhura,Christian Schroeder de Witt
备注:Long-Horizon Reasoning Benchmark
摘要:随着语言模型越来越多地用于复杂的自主任务,它们在更长时间内准确推理的能力变得至关重要。这种能力的一个重要组成部分是规划和管理一个漫长而复杂的思想链(CoT)。我们引入LongCoT,这是一个可扩展的基准,包含2,500个专家设计的问题,涵盖化学,数学,计算机科学,国际象棋和逻辑,以隔离和直接测量前沿模型的长期CoT推理能力。问题由一个简短的输入和一个可验证的答案组成;解决它们需要导航一个由相互依赖的步骤组成的图形,这些步骤跨越数万到数十万个推理标记。每个局部步骤对于前沿模型来说都是单独易处理的,因此失败反映了长期推理的局限性。在发布时,最好的模型在LongCoT上实现了<10%的准确率(GPT 5.2:9.8%; Gemini 3 Pro:6.1%),显示了当前能力的巨大差距。总体而言,LongCoT提供了一个严格的长期推理措施,跟踪前沿模型在较长时间内可靠推理的能力。
摘要:As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.


【2】A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
标题:生成式机器人策略中虚实协同训练的机理分析
链接:https://arxiv.org/abs/2604.13645

作者:Yu Lei,Minghuan Liu,Abhiram Maddukuri,Zhenyu Jiang,Yuke Zhu
备注:24 pages, 18 figure. Project page: https://science-of-co-training.github.io/
摘要:协同训练将有限的域内真实世界数据与丰富的代理数据(如仿真或跨具体机器人数据)相结合,广泛用于训练生成式机器人策略。尽管在经验上取得了成功,但确定何时以及为何共同培训有效的机制仍然知之甚少。本文通过理论分析和实证研究,对虚实协同培训的机制进行了研究,发现了两种影响绩效的内在效应。第一个,\textbf{``结构化表示对齐”},反映了跨域表示对齐和域可扩展性之间的平衡,并在下游性能中发挥主要作用。第二,\textbf{``重要性重新加权效应”},产生于动作加权的域相关调制,并在二级水平上操作。我们验证这些效果与控制实验的玩具模型和广泛的模拟和模拟和真实的机器人操作实验。我们的分析对最近的联合训练技术提供了统一的解释,并激发了一种简单的方法,该方法不断改进先前的方法。更广泛地说,我们的目标是研究共同训练的内部运作,并促进这一方向的研究。
摘要:Co-training, which combines limited in-domain real-world data with abundant surrogate data such as simulation or cross-embodiment robot data, is widely used for training generative robot policies. Despite its empirical success, the mechanisms that determine when and why co-training is effective remain poorly understood. We investigate the mechanism of sim-and-real co-training through theoretical analysis and empirical study, and identify two intrinsic effects governing performance. The first, \textbf{``structured representation alignment"}, reflects a balance between cross-domain representation alignment and domain discernibility, and plays a primary role in downstream performance. The second, the \textbf{``importance reweighting effect"}, arises from domain-dependent modulation of action weighting and operates at a secondary level. We validate these effects with controlled experiments on a toy model and extensive sim-and-sim and sim-and-real robot manipulation experiments. Our analysis offers a unified interpretation of recent co-training techniques and motivates a simple method that consistently improves upon prior approaches. More broadly, our aim is to examine the inner workings of co-training and to facilitate research in this direction.


【3】Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
标题:校准的推测解码:频率引导候选选择以实现高效推理
链接:https://arxiv.org/abs/2604.13634

作者:Xuwen Zhou,Fangxin Liu,Chao Wang,Xiao Zheng,Hao Zheng,Min He,Li Jiang,Haibing Guan
备注:ACL 2026 Main Conference
摘要:推测解码通过让草稿令牌绕过完全验证来加速自回归生成,但是传统框架遭受频繁的错误拒绝,特别是当草稿模型产生语义正确但词汇发散的输出时。在本文中,我们提出了校准的推测解码(CSD),一个无训练的框架,恢复标准验证丢弃的有效令牌。在“频率引导的候选选择和概率保护的接受”原则的指导下,CSD包含两个轻量级模块:在线纠正记忆,它聚合历史拒绝以提出重复出现的分歧模式作为救援候选,以及语义一致性门控,它使用概率比而不是精确的令牌匹配来验证候选的可接受性。我们对不同大型语言模型的评估表明,CSD的性能优于现有方法,峰值吞吐量加速比达到2.33倍。CSD在所有任务中保持了模型的准确性,同时进一步提高了复杂推理数据集的性能。这些结果使CSD成为实际LLM部署的高效轻量级解决方案。
摘要:Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of "Frequency-Guided Candidate Selection and Probability-Guarded Acceptance," CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.


【4】Learning Inference Concurrency in DynamicGate MLP Structural and Mathematical Justification
标题:DynamicGate MLP结构和数学证明中的学习推理并发性
链接:https://arxiv.org/abs/2604.13546

作者:Yongil Choi
备注:20 pages, 6 figures
摘要:传统的神经网络严格分离学习和推理,因为如果在推理过程中更新参数,输出会变得不稳定,甚至推理函数本身也没有很好地定义[1,2,3]。本文表明,DynamicGate MLP在结构上允许学习推理并发[4,5]。关键思想是将路由(选通)参数与表示(预测)参数分开,以便在保持推理稳定性的同时在线调整选通,或者仅在非活动子空间内选择性地更新权重[4,5,6,7]。我们在数学上形式化了并发的充分条件,并表明即使在异步或部分更新下,每个时间步的推理输出也总是可以被解释为有效模型快照的前向计算[8,9,10]。这表明DynamicGate MLP可以作为在线自适应和设备学习系统的实用基础[11,12]。
摘要:Conventional neural networks strictly separate learning and inference because if parameters are updated during inference, outputs become unstable and even the inference function itself is not well defined [1, 2, 3]. This paper shows that DynamicGate MLP structurally permits learning inference concurrency [4, 5]. The key idea is to separate routing (gating) parameters from representation (prediction) parameters, so that the gate can be adapted online while inference stability is preserved, or weights can be selectively updated only within the inactive subspace [4, 5, 6, 7]. We mathematically formalize sufficient conditions for concurrency and show that even under asynchronous or partial updates, the inference output at each time step can always be interpreted as a forward computation of a valid model snapshot [8, 9, 10]. This suggests that DynamicGate MLP can serve as a practical foundation for online adaptive and on device learning systems [11, 12].


【5】Cross-Layer Co-Optimized LSTM Accelerator for Real-Time Gait Analysis
标题:用于实时步态分析的跨层协同优化LSTM加速器
链接:https://arxiv.org/abs/2604.13543

作者:Mohammad Hasan Ahmadilivani,Levent Aksoy,Mohammad Eslami,Jaan Raik,Alar Kuusik
备注:9 pages, 6 figues, 9 tables, accepted at IEEE ISQED'26
摘要:长短期记忆(LSTM)神经网络已经渗透到医疗保健应用中,在这些应用中,实时要求和边缘计算能力至关重要。步态分析,检测异常步骤,以防止病人跌倒是一个突出的问题,这样的应用程序。考虑到在性能、功耗和面积方面的极其严格的设计要求,专用集成电路(ASIC)能够有效地实时利用LSTM进行步态分析,从而实现高精度。据我们所知,这项工作提出了第一个跨层协同优化的LSTM加速器,用于实时步态分析,目标是ASIC设计。我们从软件到布局设计进行全面的设计空间探索。我们在软件层面进行了位宽优化,并采用硬件感知量化来降低硬件复杂性,在寄存器传输层面探索各种设计,并生成替代布局,以在硬件复杂性和准确性方面找到LSTM加速器的有效实现。物理综合结果表明,使用65 nm技术,为最高精度优化的加速器布局的管芯尺寸为0.325 mm^2,而为硬件复杂性优化的精度略低的替代设计占用的面积小15.4%。此外,所设计的加速器实现准确的步态异常检测比给定的应用要求快4.05倍。
摘要:Long Short-Term Memory (LSTM) neural networks have penetrated healthcare applications where real-time requirements and edge computing capabilities are essential. Gait analysis that detects abnormal steps to prevent patients from falling is a prominent problem for such applications. Given the extremely stringent design requirements in performance, power dissipation, and area, an Application-Specific Integrated Circuit (ASIC) enables an efficient real-time exploitation of LSTMs for gait analysis, achieving high accuracy. To the best of our knowledge, this work presents the first cross-layer co-optimized LSTM accelerator for real-time gait analysis, targeting an ASIC design. We conduct a comprehensive design space exploration from software down to layout design. We carry out a bit-width optimization at the software level with hardware-aware quantization to reduce the hardware complexity, explore various designs at the register-transfer level, and generate alternative layouts to find efficient realizations of the LSTM accelerator in terms of hardware complexity and accuracy. The physical synthesis results show that, using the 65 nm technology, the die size of the accelerator's layout optimized for the highest accuracy is 0.325 mm^2, while the alternative design optimized for hardware complexity with a slightly lower accuracy occupies 15.4% smaller area. Moreover, the designed accelerators achieve accurate gait abnormality detection 4.05x faster than the given application requirement.


【6】Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
标题:混凝土丛林:走向混凝土铺平对比负面挖掘以促进成分理解
链接:https://arxiv.org/abs/2604.13313

作者:Eun Woo Im,Dhruv Madhwal,Vivek Gupta
备注:10 pages
摘要:视觉语言模型表现出卓越的能力,但往往与组合推理斗争,表现出关于词序和属性绑定的漏洞。这种限制是由于在对比预训练过程中区分微妙语义变化所需的信息样本稀缺。虽然硬否定挖掘提供了一个很有前途的补救措施,现有的方法缺乏明确的机制来规定哪些语言元素进行修改。而不是工程生成架构,这项研究建立了词汇的具体性作为一个基本的决定因素,负样本功效。修改高度具体的术语会产生更明显的结构和视觉差异,提供更强的学习信号。利用这一原则,ConcretePlant被提出来系统地隔离和操纵感知基础概念。对InfoNCE的分析进一步揭示了严重的梯度不平衡,其中容易区分的对不成比例地压倒了优化过程,并限制了可用于细微差别学习的带宽。为了解决这一退化问题,采用基于边际的方法制定了水泥损失。通过将心理语言学分数与样本难度相关联,该目标动态地校准应用于单个训练对的惩罚。综合评价证实了这些理论主张。被指定为Slipform的集成框架在各种成分评估基准、一般跨模态检索、单标签和多标签线性探测中实现了最先进的准确性。
摘要:Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.


【7】Counterfactual Peptide Editing for Causal TCR--pMHC Binding Inference
标题:因果TLR的反事实肽编辑--pMHC结合推理
链接:https://arxiv.org/abs/2604.13256

作者:Sanjar Khudoyberdiev,Arman Bekov
摘要 :用于TCR-pMHC结合预测的神经模型容易受到捷径学习的影响:它们利用训练数据中的虚假相关性-例如肽长度偏差或V基因共现-而不是物理结合界面。这使得预测在家庭坚持和距离意识评估下变得脆弱,而这种捷径不会转移。我们引入了反事实不变预测(CIP),这是一种训练框架,可以生成生物学约束的反事实肽编辑,并在非锚定位置处对编辑实施不变性,同时放大MHC锚定残基的敏感性。CIP通过两个辅助目标增强基本分类器:(1)在保守非锚替换下惩罚预测变化的不变性损失,以及(2)在锚位置中断下鼓励大预测变化的对比损失。在家庭支持、距离感知和随机分裂下,在策划的VDJdb-IEDB基准上进行评估,CIP在具有挑战性的家庭支持协议下达到AUROC 0.831和反事实一致性(CFC)0.724--相对于无约束基线,捷径指数降低了39.7%。消融证实,锚感知编辑生成是OOD收益的主要驱动因素,为因果接地TCR特异性建模提供了一个实用的配方。
摘要:Neural models for TCR-pMHC binding prediction are susceptible to shortcut learning: they exploit spurious correlations in training data -- such as peptide length bias or V-gene co-occurrence -- rather than the physical binding interface. This renders predictions brittle under family-held-out and distance-aware evaluation, where such shortcuts do not transfer. We introduce \emph{Counterfactual Invariant Prediction} (CIP), a training framework that generates biologically constrained counterfactual peptide edits and enforces invariance to edits at non-anchor positions while amplifying sensitivity at MHC anchor residues. CIP augments the base classifier with two auxiliary objectives: (1) an invariance loss penalizing prediction changes under conservative non-anchor substitutions, and (2) a contrastive loss encouraging large prediction changes under anchor-position disruptions. Evaluated on a curated VDJdb-IEDB benchmark under family-held-out, distance-aware, and random splits, CIP achieves AUROC 0.831 and counterfactual consistency (CFC) 0.724 under the challenging family-held-out protocol -- a 39.7\% reduction in shortcut index relative to the unconstrained baseline. Ablations confirm that anchor-aware edit generation is the dominant driver of OOD gains, providing a practical recipe for causally-grounded TCR specificity modeling.


【8】Out of Context: Reliability in Multimodal Anomaly Detection Requires Contextual Inference
标题:脱离上下文:多模式异常检测的可靠性需要上下文推理
链接:https://arxiv.org/abs/2604.13252

作者:Kevin Wilkinghoff,Neelu Madan,Juan Miguel Valverde,Kamal Nasrollahi,Radu Tudor Ionescu,Rafal Wisniewski,Thomas B. Moeslund,Wenwu Wang,Zheng-Hua Tan
摘要:异常检测旨在识别偏离预期行为的观察结果。由于异常事件本质上是稀疏的,因此大多数框架只在正常数据上进行训练,以学习单一的正常参考模型。这隐含地假设正常行为可以被一个单一的、无条件的引用分布捕获。然而,在实践中,异常往往是依赖于上下文的:一个特定的观察可能是正常的,在一个操作条件下,但在另一个异常。由于机器学习系统部署在动态和异构环境中,这些固定上下文假设引入了结构模糊性,即,在边际建模下无法区分背景变化与真正的异常,导致性能不稳定和异常评估不可靠。虽然现代感测系统经常收集捕获系统行为和操作条件两者的互补方面的多模态数据,但现有方法平等地对待所有数据流,而不区分上下文信息与异常相关信号。因此,通常在没有明确地调节操作条件的情况下评估异常。我们认为,多模态异常检测应被重新定义为跨模态上下文推理问题,其中模态发挥不对称的作用,从观察分离上下文,有条件地定义异常,而不是相对于一个单一的全球参考。这种观点对模型设计,评估协议和基准建设的影响,并概述了对强大的,上下文感知的多模态异常检测开放的研究挑战。
摘要:Anomaly detection aims to identify observations that deviate from expected behavior. Because anomalous events are inherently sparse, most frameworks are trained exclusively on normal data to learn a single reference model of normality. This implicitly assumes that normal behavior can be captured by a single, unconditional reference distribution. In practice, however, anomalies are often context-dependent: A specific observation may be normal under one operating condition, yet anomalous under another. As machine learning systems are deployed in dynamic and heterogeneous environments, these fixed-context assumptions introduce structural ambiguity, i.e., the inability to distinguish contextual variation from genuine abnormality under marginal modeling, leading to unstable performance and unreliable anomaly assessments. While modern sensing systems frequently collect multimodal data capturing complementary aspects of both system behavior and operating conditions, existing methods treat all data streams equally, without distinguishing contextual information from anomaly-relevant signals. As a result, abnormality is often evaluated without explicitly conditioning on operating conditions. We argue that multimodal anomaly detection should be reframed as a cross-modal contextual inference problem, in which modalities play asymmetric roles, separating context from observation, to define abnormality conditionally rather than relative to a single global reference. This perspective has implications for model design, evaluation protocols, and benchmark construction, and outline open research challenges toward robust, context-aware multimodal anomaly detection.


【9】Analog Optical Inference on Million-Record Mortgage Data
标题:数百万记录抵押贷款数据的模拟光学推理
链接:https://arxiv.org/abs/2604.13251

作者:Sofia Berloff,Pavel Koptev,Konstantin Malkov
备注:12 pages, 5 figures
摘要:模拟光学计算机有望为机器学习推理带来巨大的效率提升,但还没有任何演示超越小规模图像基准。我们基准模拟光学计算机(AOC)数字孪生抵押贷款批准分类从584万美国HMDA记录和分离的三个来源的准确性损失。在最初的19个特性上,AOC达到了94.6%的平衡精度,具有5,126个参数(1,024个光学参数),而XGBoost为97.9%;当光学核心从16个扩展到48个通道时,3.3个存储点的差距仅缩小了0.5pp,这表明了架构而不是硬件限制。将所有模型限制为共享的127位二进制编码使每个模型下降到89.4- 89.6%,数字模型的编码成本为8 pp,AOC为5 pp。七个校准的硬件非理想不施加可测量的惩罚。由此产生的三个限制层(编码、架构、硬件保真度)定位了精度损失的地方以及下一步要改进的地方。
摘要:Analog optical computers promise large efficiency gains for machine learning inference, yet no demonstration has moved beyond small-scale image benchmarks. We benchmark the analog optical computer (AOC) digital twin on mortgage approval classification from 5.84 million U.S. HMDA records and separate three sources of accuracy loss. On the original 19 features, the AOC reaches 94.6% balanced accuracy with 5,126 parameters (1,024 optical), compared with 97.9% for XGBoost; the 3.3 percentage-point gap narrows by only 0.5pp when the optical core is widened from 16 to 48 channels, suggesting an architectural rather than hardware limitation. Restricting all models to a shared 127-bit binary encoding drops every model to 89.4--89.6%, with an encoding cost of 8pp for digital models and 5pp for the AOC. Seven calibrated hardware non-idealities impose no measurable penalty. The three resulting layers of limitation (encoding, architecture, hardware fidelity) locate where accuracy is lost and what to improve next.


【10】Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
标题:牙科分类Bench:分层牙科分类的多模式推理基准
链接:https://arxiv.org/abs/2604.13060

作者:Ziyi He,Yushi Feng,Shuangyu Yang,Yinghao Zhu,Xichen Zhang,Pak Chuen Patrick Tai,Hei Yuet Lo,Songying Wu,Weifa Yang,Lequan Yu
摘要:牙科分诊是一项安全关键的临床路由任务,需要整合多模态临床信息(例如,患者投诉和放射学证据),以确定完整的转诊计划。我们提出Dental-TriageBench,第一个专家注释的推理驱动的多模式牙科分诊基准。它由真实的门诊工作流程构建而成,包含246个去识别化的病例,这些病例用专家撰写的黄金推理轨迹以及分层分类标签进行注释。我们对19个专有的,开源的,和医疗领域的MLLM对三个初级牙医作为人类基线,并找到一个实质性的人类模型的差距,在细粒度的治疗水平分流。进一步的分析表明,准确的分流需要投诉和OPG信息,模型错误集中在多个转诊域的情况下,MLLM往往会产生过于狭窄的转诊集和遗漏严重的错误。Dental-TriageBench为开发多模式临床AI系统提供了一个现实的测试平台,这些系统更有临床基础,覆盖率更高,对下游护理更安全。
摘要 :Dental triage is a safety-critical clinical routing task that requires integrating multimodal clinical information (e.g., patient complaints and radiographic evidence) to determine complete referral plans. We present Dental-TriageBench, the first expert-annotated benchmark for reasoning-driven multimodal dental triage. Built from authentic outpatient workflows, it contains 246 de-identified cases annotated with expert-authored golden reasoning trajectories, together with hierarchical triage labels. We benchmark 19 proprietary, open-source, and medical-domain MLLMs against three junior dentists serving as the human baseline, and find a substantial human--model gap, on fine-grained treatment-level triage. Further analyses show that accurate triage requires both complaint and OPG information, and that model errors concentrate on cases with multiple referral domains, where MLLMs tend to produce overly narrow referral sets and omission-heavy errors. Dental-TriageBench provides a realistic testbed for developing multimodal clinical AI systems that are more clinically grounded, coverage-aware, and safer for downstream care.


【11】KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
标题:KMMMU:韩国语言和语境中大规模多学科多模式理解的评估
链接:https://arxiv.org/abs/2604.13058

作者:Nahyun Lee,Guijin Son,Hyunwoo Ko,Chanyoung Kim,JunYoung An,Kyubeen Han,Il-Youp Kwak
备注:8 pages
摘要:我们介绍KMMMU,一个本土的韩国基准评估多模态的理解在韩国的文化和制度环境。KMMMU包含3,466个以韩语编写的考试问题,涵盖9个学科和9个视觉模态类别,以及300个韩国特定子集和627个问题的硬子集。与翻译或以英语为中心的基准不同,KMMMU针对的是由当地惯例、官方标准和特定学科的视觉格式形成的信息密集问题。实验表明,最强的开源模型在全集上的准确率仅为42.05%,而最好的专有模型在硬子集上的准确率为52.42%。不同学科的表现各不相同,部分学科出现瓶颈,韩国特定的问题显示差距高达13.43%。错误分析表明,这些失败源于不足的推理深度比弱的公约,标签映射,Few-Shot符号归纳,本地化的知识召回,和特定领域的标准理解。KMMMU提供了一个测试平台,用于超越以英语为中心的基准的多模态评估,并为专家的真实世界任务开发更可靠的系统。
摘要:We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.


【12】Rare Event Analysis via Stochastic Optimal Control
标题:基于随机最优控制的稀有事件分析
链接:https://arxiv.org/abs/2604.13213

作者:Yuanqi Du,Jiajun He,Dinghuai Zhang,Eric Vanden-Eijnden,Carles Domingo-Enrich
摘要:罕见事件,如生物分子的构象变化,相变和化学反应是许多物理系统行为的核心,但它们极难通过计算进行研究,因为无偏模拟很少产生它们。过渡路径理论(TPT)为分析此类事件提供了严格的统计框架:它描述了两个指定亚稳态(反应物和产物)之间的反应轨迹的集合,其中心对象-提交函数,它给出了系统接下来到达产物而不是反应物的概率-编码了所有必要的动力学和热力学信息。我们引入了一个框架,铸造提交人估计作为一个随机最优控制(SOC)问题。在这个公式中,提交者定义了一个反馈控制--与其对数的梯度成比例--主动地将轨迹转向反应区域,从而实现反应路径的有效采样。为了解决由此产生的命中时间控制问题,我们开发了两个互补的目标:一个直接的反向传播损失和一个原则性的政策价值匹配损失,我们建立了一阶最优性保证。我们进一步解决亚稳态,它可以在中间盆地捕获受控轨迹,通过引入一种替代的采样过程,保留无功电流,同时降低有效的能量势垒。在基准系统上,该框架产生了比现有方法更精确的提交者估计、反应速率和平衡常数。
摘要:Rare events such as conformational changes in biomolecules, phase transitions, and chemical reactions are central to the behavior of many physical systems, yet they are extremely difficult to study computationally because unbiased simulations seldom produce them. Transition Path Theory (TPT) provides a rigorous statistical framework for analyzing such events: it characterizes the ensemble of reactive trajectories between two designated metastable states (reactant and product), and its central object--the committor function, which gives the probability that the system will next reach the product rather than the reactant--encodes all essential kinetic and thermodynamic information. We introduce a framework that casts committor estimation as a stochastic optimal control (SOC) problem. In this formulation the committor defines a feedback control--proportional to the gradient of its logarithm--that actively steers trajectories toward the reactive region, thereby enabling efficient sampling of reactive paths. To solve the resulting hitting-time control problem we develop two complementary objectives: a direct backpropagation loss and a principled off-policy Value Matching loss, for which we establish first-order optimality guarantees. We further address metastability, which can trap controlled trajectories in intermediate basins, by introducing an alternative sampling process that preserves the reactive current while lowering effective energy barriers. On benchmark systems, the framework yields markedly more accurate committor estimates, reaction rates, and equilibrium constants than existing methods.


检测相关(2篇)

【1】DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
标题:DroneScan-YOLO:无人机图像中微小物体的冗余感知轻量级检测
链接:https://arxiv.org/abs/2604.13278

作者:Yann V. Bellec
备注:12 pages, 10 figures
摘要:无人机图像中的空中目标检测提出了独特的挑战,由于微小物体的高度流行,恶劣的环境条件和严格的计算限制。标准的基于YOLO的检测器未能共同解决这些问题:它们的最小检测步幅为8像素,使得32像素以下的物体几乎无法检测到,它们的CIoU损失为非重叠的小盒子产生零梯度,并且它们的架构包含显著的滤波器冗余。我们提出了DroneScan-YOLO,这是一个整体系统的贡献,通过四个协调的设计选择来解决这些限制:(1)增加1280 x1280的输入分辨率,以最大化微小对象的空间细节,(2)RPA块,一种基于具有10个时期预热期的惰性余弦相似性更新的动态过滤器修剪机制,(3)MSFD,在步幅4处的轻量级P2检测分支仅添加114,592个参数(+1.1%),以及(4)SAL-NWD,一种将归一化Wasserstein距离与大小自适应CIoU加权相结合的混合损失,集成到YOLOv 8的TaskAligned分配管道中。在VisDrone 2019-DET上进行评估时,DroneScan-YOLO在50和50 -95之间分别实现了55.3%mAP@50和35.6%mAP@50 -95,分别比YOLOv 8 s基线高出+16.6和+12.3点,召回率从0.374提高到0.518,并且仅用+4.1%的参数就保持了96.7 FPS的推理速度。在微小物体类别上的收益最明显:自行车AP@50从0.114提高到0.328(+187%),遮阳篷三轮车从0.156提高到0.237(+52%)。
摘要:Aerial object detection in UAV imagery presents unique challenges due to the high prevalence of tiny objects, adverse environmental conditions, and strict computational constraints. Standard YOLO-based detectors fail to address these jointly: their minimum detection stride of 8 pixels renders sub-32px objects nearly undetectable, their CIoU loss produces zero gradients for non-overlapping tiny boxes, and their architectures contain significant filter redundancy. We propose DroneScan-YOLO, a holistic system contribution that addresses these limitations through four coordinated design choices: (1) increased input resolution of 1280x1280 to maximize spatial detail for tiny objects, (2) RPA-Block, a dynamic filter pruning mechanism based on lazy cosine-similarity updates with a 10-epoch warm-up period, (3) MSFD, a lightweight P2 detection branch at stride 4 adding only 114,592 parameters (+1.1%), and (4) SAL-NWD, a hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting, integrated into YOLOv8's TaskAligned assignment pipeline. Evaluated on VisDrone2019-DET, DroneScan-YOLO achieves 55.3% mAP@50 and 35.6% mAP@50-95, outperforming the YOLOv8s baseline by +16.6 and +12.3 points respectively, improving recall from 0.374 to 0.518, and maintaining 96.7 FPS inference speed with only +4.1% parameters. Gains are most pronounced on tiny object classes: bicycle AP@50 improves from 0.114 to 0.328 (+187%), and awning-tricycle from 0.156 to 0.237 (+52%).


【2】VIGILant: an automatic classification pipeline for glitches in the Virgo detector
标题:VIGILant:处女座探测器故障的自动分类管道
链接:https://arxiv.org/abs/2604.13687

作者:Tiago Fernandes,Francesco Di Renzo,Antonio Onofre,Alejandro Torres-Forné,José A. Font
摘要:引力波探测器中的小故障经常污染数据,使天体物理信号的观测和分析变得复杂。这项工作介绍了VIGILANT,一个自动管道的分类和可视化的小故障在处女座探测器。使用Virgo O3b故障的策划数据集,评估了两种机器学习方法:使用结构化Omicron参数的基于树的模型(决策树,随机森林和XGBoost),以及在频谱图图像上训练的卷积神经网络(ResNet)。虽然基于树的模型提供了更高的可解释性和快速训练,但ResNet34模型实现了卓越的性能,在测试集中达到了0.9772的F1得分和0.9833的准确性,每个故障的推理时间为数十毫秒。自观察运行O4c以来,该管道已部署在Virgo站点的日常操作中,为Virgo协作提供了一个交互式仪表板,以监控故障数量和检测器行为。这允许识别低置信度预测,突出需要进一步关注的故障。
摘要:Glitches frequently contaminate data in gravitational-wave detectors, complicating the observation and analysis of astrophysical signals. This work introduces VIGILant, an automatic pipeline for classification and visualization of glitches in the Virgo detector. Using a curated dataset of Virgo O3b glitches, two machine learning approaches are evaluated: tree-based models (Decision Tree, Random Forest and XGBoost) using structured Omicron parameters, and Convolutional Neural Networks (ResNet) trained on spectrogram images. While tree-based models offer higher interpretability and fast training, the ResNet34 model achieved superior performance, reaching a F1 score of 0.9772 and accuracy of 0.9833 in the testing set, with inference times of tens of milliseconds per glitch. The pipeline has been deployed for daily operation at the Virgo site since observing run O4c, providing the Virgo collaboration with an interactive dashboard to monitor glitch populations and detector behavior. This allows to identify low-confidence predictions, highlighting glitches requiring further attention.


分类|识别(3篇)

【1】A Complete Symmetry Classification of Shallow ReLU Networks
标题:浅ReLU网络的完全对称分类
链接:https://arxiv.org/abs/2604.14037

作者:Pranavkrishnan Ramakrishnan
摘要:参数空间不是神经网络架构的函数空间。这一事实,早在20世纪90年代就在“逆向工程”或“参数可识别性”等术语下进行了研究,导致了参数空间对称性的自然问题--研究实现相同功能的神经结构中的不同参数。事实上,通过识别产生相同函数的参数获得的商空间,称为\textit{neuromanifold},在某些情况下已被证明具有丰富的几何属性,影响优化动力学。到目前为止,实现完整分类的技术需要激活函数的分析性,特别是删除ReLU的重要情况。在这里,相比之下,我们利用ReLU激活的不可微性来提供浅层情况下对称性的完整分类。
摘要:Parameter space is not function space for neural network architectures. This fact, investigated as early as the 1990s under terms such as ``reverse engineering," or ``parameter identifiability", has led to the natural question of parameter space symmetries\textemdash the study of distinct parameters in neural architectures which realize the same function. Indeed, the quotient space obtained by identifying parameters giving rise to the same function, called the \textit{neuromanifold}, has been shown in some cases to have rich geometric properties, impacting optimization dynamics. Thus far, techniques towards complete classifications have required the analyticity of the activation function, notably excising the important case of ReLU. Here, in contrast, we exploit the non-differentiability of the ReLU activation to provide a complete classification of the symmetries in the shallow case.


【2】Towards Green Wearable Computing: A Physics-Aware Spiking Neural Network for Energy-Efficient IMU-based Human Activity Recognition
标题:迈向绿色可穿戴计算:一种物理感知尖峰神经网络,用于节能的基于IMU的人类活动识别
链接:https://arxiv.org/abs/2604.10458

作者:Naichuan Zheng,Hailun Xia,Zepeng Sun,Weiyi Li,Yinze Zhou
摘要:基于IMU的可穿戴人类活动识别(HAR)在很大程度上依赖于深度神经网络(DNN),而深度神经网络承担着巨大的计算和缓冲需求。它们的高功耗浮点运算和处理完整时间窗口的严格要求严重削弱了电池受限的边缘设备。虽然尖峰神经网络(SNN)提供了极端的事件驱动的能源效率,但标准架构却难以应对复杂的生物力学拓扑结构和时间梯度退化。为了弥合这一差距,我们提出了物理感知尖峰神经网络(PAS-Net),这是一个完全无乘法器的架构,专门为绿色HAR量身定制。在空间上,自适应对称拓扑混合器强制执行人体关节物理约束。在时间上,一个O(1)$-记忆因果神经调质产生上下文感知的动态阈值神经元,积极适应非静止的运动节奏。此外,我们利用一个时间尖峰误差目标解锁一个灵活的提前退出机制,连续IMU流。在七个不同的数据集上进行评估,PAS-Net实现了最先进的准确性,同时用稀疏的0.1 pJ整数累积取代了密集运算。至关重要的是,它的信心驱动的提前退出能力大大降低了高达98%的动态能耗。PAS-Net为始终在线的可穿戴传感建立了一个强大的、超低功耗的神经形态标准。
摘要:Wearable IMU-based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power-hungry floating-point operations and rigid requirement to process complete temporal windows severely cripple battery-constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event-driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics-Aware Spiking Neural Network (PAS-Net), a fully multiplier-free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human-joint physical constraints. Temporally, an $O(1)$-memory causal neuromodulator yields context-aware dynamic threshold neurons, adapting actively to non-stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early-exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS-Net achieves state-of-the-art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence-driven early-exit capability drastically reduces dynamic energy consumption by up to 98\%. PAS-Net establishes a robust, ultra-low-power neuromorphic standard for always-on wearable sensing.


【3】Sandpile Economics: Theory, Identification, and Evidence
标题:沙堆经济学:理论、认同和证据
链接:https://arxiv.org/abs/2604.13890

作者:Diego Vallarino
摘要:为什么资本主义经济体会反复产生严重程度与触发冲击的规模不成比例的危机?本文提出了一个结构性的答案,在生产网络的进化几何接地。随着经济体通过专业化、一体化和竞争性选择的发展,它们的部门间联系朝着几何脆弱性不断增加的配置方向漂移,最终跨越一个阈值,超过这个阈值,小扰动就会产生不成比例的大级联。   我们介绍沙堆经济学,一个正式的框架,解释宏观经济不稳定的非均衡生产网络的一个新兴的属性。关键状态变量是输入-输出图的Forman-Ricci曲率,当供应链中断时,捕获本地替代的可能性。我们发现,当曲率低于一个内生阈值,级联大小的分布遵循幂律尾指数$α\在(1,2)$,这意味着一个制度的无限放大。   潜在的机制是进化的:专业化降低了投入的可替代性,推动经济走向临界状态,而危机事件诱导内生网络重构和路径依赖。这些动态本质上是非遍历的,不能被代表代理框架所捕获。   从经验上讲,使用全球输入-输出数据,我们的文件,生产网络持续负曲率制度和曲率稳健地预测中期的输出动态。曲率增加一个标准差与三年期内更高的累积增长相关,曲率在解释弹性的跨国差异方面系统地优于标准网络指标。
摘要 :Why do capitalist economies recurrently generate crises whose severity is disproportionate to the size of the triggering shock? This paper proposes a structural answer grounded in the evolutionary geometry of production networks. As economies evolve through specialization, integration, and competitive selection, their inter-sectoral linkages drift toward configurations of increasing geometric fragility, eventually crossing a threshold beyond which small disturbances generate disproportionately large cascades.   We introduce Sandpile Economics, a formal framework that interprets macroeconomic instability as an emergent property of disequilibrium production networks. The key state variable is the Forman--Ricci curvature of the input--output graph, capturing local substitution possibilities when supply chains are disrupted. We show that when curvature falls below an endogenous threshold, the distribution of cascade sizes follows a power law with tail index $α\in (1,2)$, implying a regime of unbounded amplification.   The underlying mechanism is evolutionary: specialization reduces input substitutability, pushing the economy toward criticality, while crisis episodes induce endogenous network reconfiguration and path dependence. These dynamics are inherently non-ergodic and cannot be captured by representative-agent frameworks.   Empirically, using global input--output data, we document that production networks operate in persistently negative curvature regimes and that curvature robustly predicts medium-run output dynamics. A one-standard-deviation increase in curvature is associated with higher cumulative growth over three-year horizons, and curvature systematically outperforms standard network metrics in explaining cross-country differences in resilience.


表征(2篇)

【1】Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
标题:路由上的表示:克服多时间尺度PPO中的代理黑客攻击
链接:https://arxiv.org/abs/2604.13517

作者:Jing Sun
备注:8 pages, 6 figures
摘要:强化学习中的时间信用分配一直是一个核心挑战。受神经生物学中多巴胺系统的多时间尺度编码的启发,最近的研究试图将多个折扣因子引入Actor-Critic架构,如近端策略优化(PPO),以平衡短期响应与长期规划。然而,本文揭示了在复杂的延迟奖励任务中盲目融合多时间尺度信号可能会导致严重的算法病态。我们系统地证明了将时间注意力路由机制暴露于政策梯度会导致代理目标黑客攻击,而采用无梯度不确定性权重会引发不可逆的近视变性,这种现象我们称之为时间不确定性的Paradox。为了解决这些问题,我们提出了一个目标解耦架构:在批评者方面,我们保留了多时间尺度的预测来执行辅助表示学习,而在演员方面,我们严格隔离短期信号,并仅基于长期优势更新策略。LunarLander-v2环境中多个独立随机种子的严格经验评估表明,我们提出的架构实现了统计上显着的性能改进。在不依赖超参数黑客攻击的情况下,它始终以最小的方差超过“环境已解决”阈值,完全消除了策略崩溃,并避免了陷入单时标基线的徘徊局部最优值。
摘要:Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines.


【2】The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
标题:算术概括的长期延迟:当习得的表示超越行为时
链接:https://arxiv.org/abs/2604.13082

作者:Laura Gomezjurado Gonzalez
备注:19 pages, 10 fugures
摘要:在算法任务上训练的Transformers中,Grokking的特点是训练集拟合和突然泛化之间存在很长的延迟,但这种延迟的来源仍然知之甚少。在编码器-解码器算术模型中,我们认为这种延迟反映了对已经学习的结构的有限访问,而不是首先未能获得该结构。我们研究了一步Collatz预测,发现编码器在前几千个训练步骤中组织奇偶校验和残差结构,而输出精度在数万个步骤中保持接近概率。因果干预支持解码器瓶颈假说。将一个训练好的编码器移植到一个新的模型中,会使grokking的速度加快2.75倍,而移植一个训练好的解码器则会造成伤害。冻结一个收敛的编码器,只重新训练解码器,完全消除了高原,并产生97.6%的准确率,而联合训练的准确率为86.1%。使解码器的工作更难或更容易的是数字表示。在15个碱基中,那些因子分解与Collatz映射的算术一致的碱基(例如,base 24)达到99.8%的准确率,而binary完全失败,因为它的表示崩溃并且永远无法恢复。碱基的选择就像是一种归纳偏差,它控制着解码器可以利用多少局部数字结构,从而在相同的底层任务中产生巨大的可学习性差异。
摘要:Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder's job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map's arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.


3D|3D重建等相关(1篇)

【1】PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction
标题:PatchPoison:毒害多视图数据集,使3D重建降级
链接:https://arxiv.org/abs/2604.13153

作者:Prajas Wadekar,Venkata Sai Pranav Bachina,Kunal Bhosikar,Ankit Gangwal,Charu Sharma
备注:CVPR Workshop on Security, Privacy, and Adversarial Robustness in 3D Generative Vision Models (SPAR-3D), 2026
摘要:3D高斯溅射(3DGS)最近使得能够从随意捕获的多视图图像进行高度照片真实感的3D重建。然而,这种可访问性引起了隐私问题:公开的图像或视频可以在未经所有者同意的情况下被利用来重建场景或物体的详细3D模型。我们提出PatchPoison,一个轻量级的病毒中毒方法,防止未经授权的3D重建。与全局扰动不同,PatchPoison将一个小的高频对抗补丁(一个结构化的棋盘)注入到多视图数据集中每个图像的外围。补丁的目的是破坏的特征匹配阶段的结构从运动(SfM)管道,如COLMAP通过引入虚假的对应关系,系统地错位估计相机姿势。因此,下游3DGS优化偏离正确的场景几何形状。在NeRF-Synthetic基准测试中,插入一个12 X 12像素的补丁会使LPIPS中的重建误差增加6.8倍,而中毒的图像对人类观众来说仍然不显眼。PatchPoison不需要修改管道,为内容创建者提供了一个实用的“插入式”预处理步骤,以保护他们的多视图数据。
摘要 :3D Gaussian Splatting (3DGS) has recently enabled highly photorealistic 3D reconstruction from casually captured multi-view images. However, this accessibility raises a privacy concern: publicly available images or videos can be exploited to reconstruct detailed 3D models of scenes or objects without the owner's consent. We present PatchPoison, a lightweight dataset-poisoning method that prevents unauthorized 3D reconstruction. Unlike global perturbations, PatchPoison injects a small high-frequency adversarial patch, a structured checkerboard, into the periphery of each image in a multi-view dataset. The patch is designed to corrupt the feature-matching stage of Structure-from-Motion (SfM) pipelines such as COLMAP by introducing spurious correspondences that systematically misalign estimated camera poses. Consequently, downstream 3DGS optimization diverges from the correct scene geometry. On the NeRF-Synthetic benchmark, inserting a 12 X 12 pixel patch increases reconstruction error by 6.8x in LPIPS, while the poisoned images remain unobtrusive to human viewers. PatchPoison requires no pipeline modifications, offering a practical, "drop-in" preprocessing step for content creators to protect their multi-view data.


优化|敛散性(10篇)

【1】First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs
标题:先看后设计:最佳绩效公平权衡的多利益相关者观点
链接:https://arxiv.org/abs/2604.14035

作者:Kavya Gupta,Nektarios Kalampalikis,Christoph Heitz,Isabel Valera
备注:31 pages, 15 figures, to be published in FAccT 26
摘要:算法决策中的公平性通常在预测空间中定义,其中预测性能-用作决策者(DM)效用的代理-与基于预测的公平概念(如人口统计学平价或机会均等)进行权衡。然而,这种观点忽略了预测如何转化为决策,并最终转化为DM和决策主体(DS)的效用和福利,以及它们在社会显著群体中的分配。   在本文中,我们提出了一个基于福利经济学和分配正义的公平算法决策的多利益相关者框架,明确地对DM和DS的效用进行建模,并通过社会规划者的效用来定义公平性,该效用捕获了不同基于正义的公平概念(例如,平等主义,罗尔斯主义)。我们制定公平的决策作为一个事后的多目标优化问题,其特征在于实现的性能公平权衡在二维效用空间的DM效用和社会规划师的效用,在不同的决策政策类(确定性与随机,共享与组特定)。使用所提出的框架,然后,我们确定的条件下(在利益相关者的效用),随机政策比确定性的更优,并实证证明,简单的随机政策可以产生优越的性能,公平的权衡,利用结果的不确定性。总体而言,我们主张从以预测为中心的公平转向透明,基于正义的多利益相关者方法,支持决策政策的协作设计。
摘要:Fairness in algorithmic decision-making is often defined in the predictive space, where predictive performance - used as a proxy for decision-maker (DM) utility - is traded off against prediction-based fairness notions, such as demographic parity or equality of opportunity. This perspective, however, ignores how predictions translate into decisions and ultimately into utilities and welfare for both DM and decision subjects (DS), as well as their allocation across social-salient groups.   In this paper, we propose a multi-stakeholder framework for fair algorithmic decision-making grounded in welfare economics and distributive justice, explicitly modeling the utilities of both the DM and DS, and defining fairness via a social planner's utility that captures inequalities in DS utilities across groups under different justice-based fairness notions (e.g., Egalitarian, Rawlsian). We formulate fair decision-making as a post-hoc multi-objective optimization problem, characterizing the achievable performance-fairness trade-offs in the two-dimensional utility space of DM utility and the social planner's utility, under different decision policy classes (deterministic vs. stochastic, shared vs. group-specific). Using the proposed framework, we then identify conditions (in terms of the stakeholders' utilities) under which stochastic policies are more optimal than deterministic ones, and empirically demonstrate that simple stochastic policies can yield superior performance-fairness trade-offs by leveraging outcome uncertainty. Overall, we advocate a shift from prediction-centric fairness to a transparent, justice-based, multi-stakeholder approach that supports the collaborative design of decision-making policies.


【2】BOAT: Navigating the Sea of In Silico Predictors for Antibody Design via Multi-Objective Bayesian Optimization
标题:BOAT:通过多目标Bayesian优化探索抗体设计的Silico预测器
链接:https://arxiv.org/abs/2604.13980

作者:Jackie Rao,Ferran Gonzalez Hernandez,Leon Gerard,Alexandra Gessner
备注:Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026
摘要:抗体先导物优化本质上是药物发现中的多目标挑战。实现不同药物性质之间的平衡对于开发可行的候选物至关重要,随着所需性质的增长,这种搜索变得具有指数级挑战性。用于预测抗体特性的复杂的计算机模拟工具的不断增长的动物园要求有效的联合优化程序,以克服资源密集型的顺序过滤管道。我们提出了船,一个多功能的贝叶斯优化框架的多属性抗体工程。我们的“即插即用”框架将具有不确定性的代理建模与遗传算法结合起来,以共同优化各种预测的抗体性状,同时能够有效地探索序列空间。通过对遗传算法和较新的生成学习方法进行系统的基准测试,我们展示了具有竞争力的性能,具有最先进的多目标蛋白质优化方法。我们确定了明确的制度,代理驱动的优化优于昂贵的生成方法,并建立实际的限制所施加的序列维数和甲骨文成本。
摘要:Antibody lead optimization is inherently a multi-objective challenge in drug discovery. Achieving a balance between different drug-like properties is crucial for the development of viable candidates, and this search becomes exponentially challenging as desired properties grow. The ever-growing zoo of sophisticated in silico tools for predicting antibody properties calls for an efficient joint optimization procedure to overcome resource-intensive sequential filtering pipelines. We present BOAT, a versatile Bayesian optimization framework for multi-property antibody engineering. Our `plug-and-play' framework couples uncertainty-aware surrogate modeling with a genetic algorithm to jointly optimize various predicted antibody traits while enabling efficient exploration of sequence space. Through systematic benchmarking against genetic algorithms and newer generative learning approaches, we demonstrate competitive performance with state-of-the-art methods for multi-objective protein optimization. We identify clear regimes where surrogate-driven optimization outperforms expensive generative approaches and establish practical limits imposed by sequence dimensionality and oracle costs.


【3】DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
标题:DiPO:针对细粒度勘探-开发权衡的理清困惑政策优化
链接:https://arxiv.org/abs/2604.13902

作者:Xiaofan Li,Ming Yang,Zhiyuan Ma,Shichao Ma,Jintao Du,Yu Cheng,Weiqiang Wang,Zhizhong Zhang,Xin Tan,Yanyun Qu,Lizhuang Ma,Yuan Xie
备注:LLM Reinforce Learning
摘要:带有可验证奖励的强化学习(RLVR)促进了大型语言模型(LLM)推理能力的显着进步。然而,有效管理勘探和开采之间的取舍关系仍然是一项重大挑战。本文充分分析了训练过程中极难样本和极易样本的探索和利用困境,提出了一种新的细粒度权衡机制。具体地说,我们引入了一个困惑空间解缠策略,将样本空间划分为不同的探索(高困惑)和开发(低困惑)子空间,从而挖掘细粒度的样本,需要探索-开发权衡。随后,我们提出了一个双向的奖励分配机制,验证奖励的影响最小,实现困惑引导的探索和利用,使更稳定的政策优化。最后,我们已经评估了我们的方法上的两个主流任务:数学推理和函数调用,实验结果表明,所提出的方法的优越性,证实其有效性,提高LLM性能的细粒度的探索-开发权衡。
摘要 :Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.


【4】UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
标题:UI-Copilot:通过工具集成策略优化推进长期图形用户界面自动化
链接:https://arxiv.org/abs/2604.13822

作者:Zhengxi Lu,Fei Tang,Guangyi Liu,Kaitao Song,Xu Tan,Jin Ma,Wenqi Zhang,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
摘要:基于MLLM的GUI代理在复杂的用户界面交互任务中表现出强大的能力。然而,长视野场景仍然具有挑战性,因为这些智能体承担着超出其固有能力的任务,遭受记忆退化,进度混乱和数学幻觉。为了解决这些挑战,我们提出了UI-Copilot,一个协作框架,其中GUI代理专注于任务执行,而一个轻量级的copilot提供按需协助内存检索和数值计算。我们引入内存解耦,以分离持久的观察,从瞬态执行上下文,和训练的政策代理有选择地调用副驾驶员检索器或计算器的任务需求的基础上。为了实现有效的工具调用学习,我们提出了工具集成策略优化(TIPO),它分别通过单轮预测和任务执行优化工具选择,通过对政策的多轮推出。实验结果表明,UI-Copilot-7 B在具有挑战性的MemGUI-Bench上实现了最先进的性能,优于强大的7 B级GUI代理,如GUI-Owl-7 B和UI-TARS-1.5- 7 B。此外,UI-Copilot-7 B在AndroidWorld上比基本Qwen模型有17.1%的绝对改进,突出了UI-Copilot对真实世界GUI任务的强大概括。
摘要:MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.


【5】Optimization with SpotOptim
标题:使用SpotOptimm进行优化
链接:https://arxiv.org/abs/2604.13672

作者:Thomas Bartz-Beielstein
摘要:`spotoptim`包实现了对Python中昂贵的黑盒函数的基于代理模型的优化。基于二十年的序贯参数优化(SPO)方法,它提供了一个基于Kriging的优化循环,具有预期改进,支持连续,整数和分类变量,通过最优计算预算分配(OCBA)进行噪声感知评估,以及多目标扩展。稳态并行化策略将代理搜索与多核硬件上的客观评估重叠,基于成功率的重启机制检测停滞,同时保留找到的最佳解决方案。该包返回scipy兼容的“OptimizeResult”对象,并接受任何scikit-learn兼容的代理模型。内置的TensorBoard日志记录提供了对收敛和代理质量的实时监控。本报告描述了spotoptim的架构和模块结构,提供了包括神经网络超参数调整在内的工作示例,并将该框架与BoTorch,Optuna,Ray Tune,BOHB,SMAC和Hyperopt进行了比较。该软件包是开源的。
摘要:The `spotoptim` package implements surrogate-model-based optimization of expensive black-box functions in Python. Building on two decades of Sequential Parameter Optimization (SPO) methodology, it provides a Kriging-based optimization loop with Expected Improvement, support for continuous, integer, and categorical variables, noise-aware evaluation via Optimal Computing Budget Allocation (OCBA), and multi-objective extensions. A steady-state parallelization strategy overlaps surrogate search with objective evaluation on multi-core hardware, and a success-rate-based restart mechanism detects stagnation while preserving the best solution found. The package returns scipy-compatible `OptimizeResult` objects and accepts any scikit-learn-compatible surrogate model. Built-in TensorBoard logging provides real-time monitoring of convergence and surrogate quality. This report describes the architecture and module structure of spotoptim, provides worked examples including neural network hyperparameter tuning, and compares the framework with BoTorch, Optuna, Ray Tune, BOHB, SMAC, and Hyperopt. The package is open-source.


【6】Self-Organizing Maps with Optimized Latent Positions
标题:具有最优隐位置的自组织映射
链接:https://arxiv.org/abs/2604.13622

作者:Seiki Ubukata,Akira Notsu,Katsuhiro Honda
备注:8 pages, 4 figures. Accepted for publication in the 2026 International Joint Conference on Neural Networks (IJCNN 2026), part of the 2026 IEEE World Congress on Computational Intelligence (WCCI 2026). This version is the author's accepted manuscript
摘要:自组织映射(SOM)是无监督学习、矢量量化和高维数据地形映射的经典方法。然而,现有的SOM公式往往涉及计算效率和明确定义的优化目标之间的权衡。软拓扑矢量量化(Soft Topographic Vector Quantization,STVQ)等基于卷积的变体提供了一个原则性的公式,但随着潜在节点数量的增加,它们的邻域耦合计算变得昂贵。在本文中,我们提出了自组织地图与优化的潜在位置(SOM-OLP),一个基于目标的地形映射方法,为每个数据点引入一个连续的潜在位置。从STVQ的邻域失真出发,基于STVQ的局部二次结构构造了一个可分离的代理局部代价,并在此基础上构造了一个熵正则化目标,得到了一个简单的块坐标下降方案,该方案具有分配概率、潜在位置和参考向量的封闭形式更新,同时保证目标的单调不增加并在数据点和潜在节点的数量上保持线性的每次迭代复杂度。在合成鞍流形上的实验、Digits和MNIST数据集上的可扩展性研究以及16个基准数据集上的实验表明,SOM-OLP具有较好的邻域保持和量化性能,对大量潜在节点和大数据集具有良好的可扩展性,在基准数据集上的平均排名是比较方法中最好的。
摘要:Self-Organizing Maps (SOM) are a classical method for unsupervised learning, vector quantization, and topographic mapping of high-dimensional data. However, existing SOM formulations often involve a trade-off between computational efficiency and a clearly defined optimization objective. Objective-based variants such as Soft Topographic Vector Quantization (STVQ) provide a principled formulation, but their neighborhood-coupled computations become expensive as the number of latent nodes increases. In this paper, we propose Self-Organizing Maps with Optimized Latent Positions (SOM-OLP), an objective-based topographic mapping method that introduces a continuous latent position for each data point. Starting from the neighborhood distortion of STVQ, we construct a separable surrogate local cost based on its local quadratic structure and formulate an entropy-regularized objective based on it. This yields a simple block coordinate descent scheme with closed-form updates for assignment probabilities, latent positions, and reference vectors, while guaranteeing monotonic non-increase of the objective and retaining linear per-iteration complexity in the numbers of data points and latent nodes. Experiments on a synthetic saddle manifold, scalability studies on the Digits and MNIST datasets, and 16 benchmark datasets show that SOM-OLP achieves competitive neighborhood preservation and quantization performance, favorable scalability for large numbers of latent nodes and large datasets, and the best average rank among the compared methods on the benchmark datasets.


【7】Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence
标题:Markov依赖下多数票集合的极小极大最优性和谱路由
链接:https://arxiv.org/abs/2604.13414

作者:Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
摘要:多数表决集成实现方差减少平均超过不同的,近似独立的基础学习。当训练数据表现出马尔可夫依赖性时,如在时间序列预测、强化学习(RL)重放缓冲区和空间网格中,这种经典的保证会以现有理论无法完全量化的方式退化。我们提供了一个极小极大表征这种现象的离散分类在一个固定的维马尔可夫设置,连同一个自适应算法,匹配率上的一个图形规则的子类。我们首先建立了一个信息论下界的平稳,可逆,几何遍历链在固定的环境维度,显示没有可测量的估计可以实现超额分类风险优于$Ω(\sqrt{\Tmix/n})$。然后,我们证明了,在AR(1)见证子类的下界结构,依赖不可知的均匀装袋是可证明的次优与超额风险下界为$Ω(\Tmix/\sqrt{n})$,表现出$\sqrt{\Tmix}$算法差距。最后,我们提出了自适应谱路由,它通过依赖图的经验Fiedler特征向量来划分训练数据,并在不知道$\Tmix$的情况下,在图正则子类上实现最小最大速率$\mathcal{O}(\sqrt{\Tmix/n})$到低阶几何切割项。在合成马尔可夫链、2D空间网格、128个数据集UCR存档和Atari DQN集成上的实验验证了理论预测。深度RL目标方差、通过Nyström近似的可扩展性和有界非平稳性的结果在附录中作为支持材料。
摘要:Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $Ω(\sqrt{\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $Ω(\Tmix/\sqrt{n})$, exhibiting a $\sqrt{\Tmix}$ algorithmic gap. Finally, we propose \emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\mathcal{O}(\sqrt{\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nyström approximation, and bounded non-stationarity are developed as supporting material in the appendix.


【8】Multistage Conditional Compositional Optimization
标题:多阶段条件成分优化
链接:https://arxiv.org/abs/2604.14075

作者:Buse Şen,Yifan Hu,Daniel Kuhn
摘要:我们引入多阶段条件组合优化(MCCO)作为一种新的范例,在不确定性下的决策,结合多阶段随机规划和条件随机优化方面。MCCO最小化一套条件期望和非线性成本函数。它有许多应用和出现,例如,在最佳停止,线性二次调节器问题,分布鲁棒上下文土匪,以及在涉及动态风险措施的问题。MCCO的朴素嵌套抽样方法受到基于情景树的多阶段随机规划的维数灾难的影响,即其情景复杂性随嵌套数量呈指数增长。我们开发了新的多级蒙特卡罗技术MCCO的情况下的复杂性增长只有多项式与所需的精度。
摘要:We introduce Multistage Conditional Compositional Optimization (MCCO) as a new paradigm for decision-making under uncertainty that combines aspects of multistage stochastic programming and conditional stochastic optimization. MCCO minimizes a nest of conditional expectations and nonlinear cost functions. It has numerous applications and arises, for example, in optimal stopping, linear-quadratic regulator problems, distributionally robust contextual bandits, as well as in problems involving dynamic risk measures. The naïve nested sampling approach for MCCO suffers from the curse of dimensionality familiar from scenario tree-based multistage stochastic programming, that is, its scenario complexity grows exponentially with the number of nests. We develop new multilevel Monte Carlo techniques for MCCO whose scenario complexity grows only polynomially with the desired accuracy.


【9】Reachability Constraints in Variational Quantum Circuits: Optimization within Polynomial Group Module
标题:变分量子电路中的可达性约束:多项群模块内的优化
链接:https://arxiv.org/abs/2604.13735

作者:Yun-Tak Oh,Dongsoo Lee,Jungyoul Park,Kyung Chul Jeong,Panjin Kim
备注:27 pages, 4 figures, appendix
摘要:这项工作确定了一个必要的条件,任何变分量子方法达到确切的基态。简而言之,输入和基态在每个群模块上的投影的范数必须匹配,这意味着必须预先知道解状态的模块权重,以便达到精确的基态。一个示例性的情况下提供的匹配门电路的解决方案是经典的位串的问题,因为所有的计算基础状态共享相同的模块式的权重。结合已知的量子电路的经典可模拟性,其可观测量位于一个小的线性子空间中,这意味着某些问题允许经典替代精确解,每一步花费O(n^5)$时间。最大割问题是一个说明性的例子。
摘要:This work identifies a necessary condition for any variational quantum approach to reach the exact ground state. Briefly, the norms of the projections of the input and the ground state onto each group module must match, implying that module weights of the solution state have to be known in advance in order to reach the exact ground state. An exemplary case is provided by matchgate circuits applied to problems whose solutions are classical bit strings, since all computational basis states share the same module-wise weights. Combined with the known classical simulability of quantum circuits for which observables lie in a small linear subspace, this implies that certain problems admit a classical surrogate for exact solution with each step taking $O(n^5)$ time. The Maximum Cut problem serves as an illustrative example.


【10】HUANet: Hard-Constrained Unrolled ADMM for Constrained Convex Optimization
标题:HUANet:求解约束凸优化的硬约束展开ADMM算法
链接:https://arxiv.org/abs/2604.13179

作者:Trinh Tran,Binh Nguyen,Truong X. Nghiem
摘要:本文介绍了HUANet,这是一种受约束的深度神经网络架构,它将交替方向乘法(ADMM)的迭代展开为可训练的神经网络,用于解决约束凸优化问题。现有的端到端学习方法作为从参数到解决方案的黑箱映射,通常缺乏明确的最优性原则,并且无法强制执行约束。为了解决这一限制,我们展开ADMM并在每次迭代时嵌入硬约束神经网络以加速算法,其中等式约束通过网络输出处的可微校正阶段来实施。此外,我们将一阶最优性条件作为软约束在训练过程中,以促进所提出的展开算法的收敛性。大量的数值实验进行了验证所提出的架构的有效性约束优化问题。
摘要:This paper presents HUANet, a constrained deep neural network architecture that unrolls the iterations of the Alternating Direction Method of Multipliers (ADMM) into a trainable neural network for solving constrained convex optimization problems. Existing end-to-end learning methods operate as black-box mappings from parameters to solutions, often lacking explicit optimality principles and failing to enforce constraints. To address this limitation, we unroll ADMM and embed a hard-constrained neural network at each iteration to accelerate the algorithm, where equality constraints are enforced via a differentiable correction stage at the network output. Furthermore, we incorporate first-order optimality conditions as soft constraints during training to promote the convergence of the proposed unrolled algorithm. Extensive numerical experiments are conducted to validate the effectiveness of the proposed architecture for constrained optimization problems.


预测|估计(3篇)

【1】Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment
标题:从变革中学习:受监管的IT环境中事件预防的预测模型
链接:https://arxiv.org/abs/2604.13462

作者:Eileen Kapel,Jan Lennartz,Luis Cruz,Diomidis Spinellis,Arie van Deursen
备注:12 pages, 6 figures, 2026 IEEE/ACM 48th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
摘要:Effective IT change management is important for businesses that depend on software and services, particularly in highly regulated sectors such as finance, where operational reliability, auditability, and explainability are essential. A significant portion of IT incidents are caused by changes, making it important to identify high-risk changes before deployment. This study presents a predictive incident risk scoring approach at a large international bank. The approach supports engineers during the assessment and planning phases of change deployments by predicting the potential of inducing incidents. To satisfy regulatory constraints, we built the model with auditability and explainability in mind, applying SHAP values to provide feature-level insights and ensure decisions are traceable and transparent. Using a one-year real-world dataset, we compare the existing rule-based process with three machine learning models: HGBC, LightGBM, and XGBoost. LightGBM achieved the best performance, particularly when enriched with aggregated team metrics that capture organisational context. Our results show that data-driven, interpretable models can outperform rule-based approaches while meeting compliance needs, enabling proactive risk mitigation and more reliable IT operations.


【2】Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps
标题:基于可解释故障热图的工业RUL预测的无源损失引导混合CNN-BiLSTM-Attention模型
链接:https://arxiv.org/abs/2604.13459

作者:Mohammed Ezzaldin Babiker Abdullah
摘要:Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi-sensor spatial correlations and long-range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating residual life. This study proposes a hybrid architecture integrating Twin-Stage One-Dimensional Convolutional Neural Networks (1D-CNN), a Bidirectional Long Short-Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) FD001 sub-dataset employing a zero-leakage preprocessing pipeline, piecewise-linear RUL labeling capped at 130 cycles, and the NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S-Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per-engine insights into the temporal progression of degradation, supporting informed maintenance decision-making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.


【3】Outperforming Self-Attention Mechanisms in Solar Irradiance Forecasting via Physics-Guided Neural Networks
标题:通过物理引导神经网络在太阳辐射率预测中超越自我注意机制
链接:https://arxiv.org/abs/2604.13455

作者:Mohammed Ezzaldin Babiker Abdullah,Rufaidah Abdallah Ibrahim Mohammed
备注:This is a second version of a previously published paper. DOI: Https://doi.org/10.36227/techrxiv.176827103.31624241/v1
摘要:Accurate Global Horizontal Irradiance (GHI) forecasting is critical for grid stability, particularly in arid regions characterized by rapid aerosol fluctuations. While recent trends favor computationally expensive Transformer-based architectures, this paper challenges the prevailing "complexity-first" paradigm. We propose a lightweight, Physics-Informed Hybrid CNN-BiLSTM framework that prioritizes domain knowledge over architectural depth. The model integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Bi-Directional LSTM for capturing temporal dependencies. Unlike standard data-driven approaches, our model is explicitly guided by a vector of 15 engineered features including Clear-Sky indices and Solar Zenith Angle - rather than relying solely on raw historical data. Hyperparameters are rigorously tuned using Bayesian Optimization to ensure global optimality. Experimental validation using NASA POWER data in Sudan demonstrates that our physics-guided approach achieves a Root Mean Square Error (RMSE) of 19.53 W/m^2, significantly outperforming complex attention-based baselines (RMSE 30.64 W/m^2). These results confirm a "Complexity Paradox": in high-noise meteorological tasks, explicit physical constraints offer a more efficient and accurate alternative to self-attention mechanisms. The findings advocate for a shift towards hybrid, physics-aware AI for real-time renewable energy management.


其他神经网络|深度学习|模型|建模(24篇)

【1】Complex Interpolation of Matrices with an application to Multi-Manifold Learning
标题:矩阵的复内插及其在多分支学习中的应用
链接:https://arxiv.org/abs/2604.14118

作者:Adi Arbel,Stefan Steinerberger,Ronen Talmon
摘要:Given two symmetric positive-definite matrices $A, B \in \mathbb{R}^{n \times n}$, we study the spectral properties of the interpolation $A^{1-x} B^x$ for $0 \leq x \leq 1$. The presence of `common structures' in $A$ and $B$, eigenvectors pointing in a similar direction, can be investigated using this interpolation perspective. Generically, exact log-linearity of the operator norm $\|A^{1-x} B^x\|$ is equivalent to the existence of a shared eigenvector in the original matrices; stability bounds show that approximate log-linearity forces principal singular vectors to align with leading eigenvectors of both matrices. These results give rise to and provide theoretical justification for a multi-manifold learning framework that identifies common and distinct latent structures in multiview data.


【2】How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
标题:如何合成高质量的预训练数据?即时设计、生成器模型和源数据的系统研究
链接:https://arxiv.org/abs/2604.13977

作者:Joel Niklaus,Atsuki Yamaguchi,Michal Štefánik,Guilherme Penedo,Hynek Kydlíček,Elie Bakouch,Lewis Tunstall,Edward Emanuel Beeching,Thibaud Frere,Colin Raffel,Leandro von Werra,Thomas Wolf
摘要:Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf{\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased web text. We show that \textsc{FinePhrase} outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.


【3】Randomized Neural Networks for Integro-Differential Equations with Application to Neutron Transport
标题:积分-微方程的随机神经网络及其应用于中子输运
链接:https://arxiv.org/abs/2604.13830

作者:Haoning Dang,Fei Wang,Yifan Chen,Zhouyu Liu,Dong Liu,Hongchun Wu
摘要:Integro-differential equations arise in a wide range of applications, including transport, kinetic theory, radiative transfer, and multiphysics modeling, where nonlocal integral operators couple the solution across phase space. Such nonlocality often introduces dense coupling blocks in deterministic discretizations, leading to increased computational cost and memory usage, while physics-informed neural networks may suffer from expensive nonconvex training and sensitivity to hyperparameter choices. In this work, we present randomized neural networks (RaNNs) as a mesh-free collocation framework for linear integro-differential equations. Because the RaNN approximation is intrinsically dense through globally supported random features, the nonlocal integral operator does not introduce an additional loss of sparsity, while the approximate solution can still be represented with relatively few trainable degrees of freedom. By randomly fixing the hidden-layer parameters and solving only for the linear output weights, the training procedure reduces to a convex least-squares problem in the output coefficients, enabling stable and efficient optimization. As a representative application, we apply the proposed framework to the steady neutron transport equation, a high-dimensional linear integro-differential model featuring scattering integrals and diverse boundary conditions. Extensive numerical experiments demonstrate that, in the reported test settings, the RaNN approach achieves competitive accuracy while incurring substantially lower training cost than the selected neural and deterministic baselines, highlighting RaNNs as a robust and efficient alternative for the numerical simulation of nonlocal linear operators.


【4】Beyond State Consistency: Behavior Consistency in Text-Based World Models
标题:超越状态一致性:基于文本的世界模型中的行为一致性
链接:https://arxiv.org/abs/2604.13824

作者:Youling Huang,Guanqiao Chen,Junchi Yao,Lu Wang,Fangkai Yang,Chao Du,ChenZhuo Zhao,Pu Zhao,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
备注:20 pages, 2 figures
摘要:World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text-based environments, world models are typically evaluated and trained with single-step metrics such as Exact Match, aiming to improve the similarity between predicted and real-world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior-aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step-level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world-model-predicted state under a frozen Reference Agent. Experiments on WebShop and TextWorld show that BehR-based training improves long-term alignment in several settings, with the clearest gains in WebShop and less movement in near-ceiling regimes, while preserving or improving single-step prediction quality in three of four settings. World models trained with BehR also achieve lower false positives in offline surrogate evaluation and show modest but encouraging gains in inference-time lookahead planning.


【5】Online learning with noisy side observations
标题:在线学习,带着吵闹的侧面观察
链接:https://arxiv.org/abs/2604.13740

作者:Tomáš Kocák,Gergely Neu,Michal Valko
备注:Published at International Conference on Artificial Intelligence and Statistics (AISTATS) 2016. 13 pages, 7 figures
摘要:We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. We represent this structure by a weighted directed graph, where the edge weights are related to the quality of the feedback shared by the connected nodes. Our main contribution is an efficient algorithm that guarantees a regret of $\widetilde{O}(\sqrt{α^* T})$ after $T$ rounds, where $α^*$ is a novel graph property that we call the effective independence number. Our algorithm is completely parameter-free and does not require knowledge (or even estimation) of $α^*$. For the special case of binary edge weights, our setting reduces to the partial-observability models of Mannor and Shamir (2011) and Alon et al. (2013) and our algorithm recovers the near-optimal regret bounds.


【6】Physics-Informed Neural Networks for Solving Derivative-Constrained PDEs
标题:用于求解派生约束偏置方程的物理信息神经网络
链接:https://arxiv.org/abs/2604.13723

作者:Kentaro Hoshisashi,Carolyn E Phelan,Paolo Barucca
备注:Phys. Rev. E - Accepted 14 April, 2026
摘要:Physics-Informed Neural Networks (PINNs) recast PDE solving as an optimisation problem in function space by minimising a residual-based objective, yet many applications require additional derivative-based relations that are just as fundamental as the governing equations. In this paper, we present Derivative-Constrained PINNs (DC-PINNs), a general framework that treats constrained PDE solving as an optimisation guided by a minimum objective function criterion where the physics resides in the minimum principle. DC-PINNs embed general nonlinear constraints on states and derivatives, e.g., bounds, monotonicity, convexity, incompressibility, computed efficiently via automatic differentiation, and they employ self-adaptive loss balancing to tune the influence of each objective, reducing reliance on manual hyperparameters and problem-specific architectures. DC-PINNs consistently reduce constraint violations and improve physical fidelity versus baseline PINN variants, representative hard-constraint formulations on benchmarks, including heat diffusion with bounds, financial volatilities with arbitrage-free, and fluid flow with vortices shed. Explicitly encoding derivative constraints stabilises training and steers optimisation toward physically admissible minima even when the PDE residual alone is small, providing reliable solutions of constrained PDEs grounded in energy minimum principles.


【7】(How) Learning Rates Regulate Catastrophic Overtraining
标题:(How)学习率调节灾难性过度训练
链接:https://arxiv.org/abs/2604.13627

作者:Mark Rofin,Aditya Varre,Nicolas Flammarion
摘要:Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a picture of the overtraining mechanism in LLMs and broadly contribute to the understanding of the interplay between optimization dynamics during pretraining and finetuning.


【8】C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
标题:C2:来自二元偏好的可扩展条目增强奖励建模
链接:https://arxiv.org/abs/2604.13618

作者:Akira Kawabata,Saku Sugawara
备注:ACL 2026
摘要:Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4$\times$ larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.


【9】Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
标题:大模型时代的奖励黑客:机制、紧急失调、挑战
链接:https://arxiv.org/abs/2604.13602

作者:Xiaohua Wang,Muzhao Tian,Yuqi Zeng,Zisu Huang,Jiakang Yuan,Bowen Chen,Jingwen Xu,Mingbo Zhou,Wenhao Liu,Muling Wu,Zhengkang Guo,Qi Qian,Yifei Wang,Feiran Zhang,Ruicheng Yin,Shihan Dou,Changze Lv,Tao Chen,Kaitao Song,Xu Tan,Tao Gui,Xiaoqing Zheng,Xuanjing Huang
备注:42 pages, 5 figures, 2 tables
摘要:Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.


【10】Parameter-efficient Quantum Multi-task Learning
标题:参数高效的量子多任务学习
链接:https://arxiv.org/abs/2604.13560

作者:Hevish Cowlessur,Chandra Thapa,Tansu Alpcan,Seyit Camtepe
摘要:Multi-task learning (MTL) improves generalization and data efficiency by jointly learning related tasks through shared representations. In the widely used hard-parameter-sharing setting, a shared backbone is combined with task-specific prediction heads. However, task-specific parameters can grow rapidly with the number of tasks. Therefore, designing multi-task heads that preserve task specialization while improving parameter efficiency remains a key challenge. In Quantum Machine Learning (QML), variational quantum circuits (VQCs) provide a compact mechanism for mapping classical data to quantum states residing in high-dimensional Hilbert spaces, enabling expressive representations within constrained parameter budgets. We propose a parameter-efficient quantum multi-task learning (QMTL) framework that replaces conventional task-specific linear heads with a fully quantum prediction head in a hybrid architecture. The model consists of a VQC with a shared, task-independent quantum encoding stage, followed by lightweight task-specific ansatz blocks enabling localized task adaptation while maintaining compact parameterization. Under a controlled and capacity-matched formulation where the shared representation dimension grows with the number of tasks, our parameter-scaling analysis demonstrates that a standard classical head exhibits quadratic growth, whereas the proposed quantum head parameter cost scales linearly. We evaluate QMTL on three multi-task benchmarks spanning natural language processing, medical imaging, and multimodal sarcasm detection, where we achieve performance comparable to, and in some cases exceeding, classical hard-parameter-sharing baselines while consistently outperforming existing hybrid quantum MTL models with substantially fewer head parameters. We further demonstrate QMTL's executability on noisy simulators and real quantum hardware, illustrating its feasibility.


【11】Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP
标题:月度扩散v0.9:第一个AI-MPP的潜在扩散模型
链接:https://arxiv.org/abs/2604.13481

作者:Kyle J. C. Hall,Maria J. Molina
摘要:Here, we describe Monthly Diffusion at 1.5-degree grid spacing (MD-1.5 version 0.9), a climate emulator that leverages a spherical Fourier neural operator (SFNO)-inspired Conditional Variational Auto-Encoder (CVAE) architecture to model the evolution of low-frequency internal atmospheric variability using latent diffusion. MDv0.9 was designed to forward-step at monthly mean timesteps in a data-sparse regime, using modest computational requirements. This work describes the motivation behind the architecture design, the MDv0.9 training procedure, and initial results.


【12】From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning
标题:从顺序到分布:持续学习中遗忘的光谱特征
链接:https://arxiv.org/abs/2604.13460

作者:Zonghuan Xu,Xingjun Ma
摘要:A central challenge in continual learning is forgetting, the loss of performance on previously learned tasks induced by sequential adaptation to new ones. While forgetting has been extensively studied empirically, rigorous theoretical characterizations remain limited. A notable step in this direction is \citet{evron2022catastrophic}, which analyzes forgetting under random orderings of a fixed task collection in overparameterized linear regression. We shift the perspective from order to distribution. Rather than asking how a fixed task collection behaves under random orderings, we study an exact-fit linear regime in which tasks are sampled i.i.d.\ from a task distribution~$Π$, and ask how the generating distribution itself governs forgetting. In this setting, we derive an exact operator identity for the forgetting quantity, revealing a recursive spectral structure. Building on this identity, we establish an unconditional upper bound, identify the leading asymptotic term, and, in generic nondegenerate cases, characterize the convergence rate up to constants. We further relate this rate to geometric properties of the task distribution, clarifying what drives slow or fast forgetting in this model.


【13】Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
标题:线性探头准确度与模型尺寸的关系以及多层集成的好处
链接:https://arxiv.org/abs/2604.13386

作者:Erik Nordby,Tasha Pais,Aviel Parrack
摘要:Linear probes can detect when language models produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B--176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.


【14】Structure- and Stability-Preserving Learning of Port-Hamiltonian Systems
标题:波特汉密尔顿系统的结构和稳定性保持学习
链接:https://arxiv.org/abs/2604.13297

作者:Binh Nguyen,Nam T. Nguyen,Truong X. Nghiem
摘要 :This paper investigates the problem of data-driven modeling of port-Hamiltonian systems while preserving their intrinsic Hamiltonian structure and stability properties. We propose a novel neural-network-based port-Hamiltonian modeling technique that relaxes the convexity constraint commonly imposed by neural network-based Hamiltonian approximations, thereby improving the expressiveness and generalization capability of the model. By removing this restriction, the proposed approach enables the use of more general non-convex Hamiltonian representations to enhance modeling flexibility and accuracy. Furthermore, the proposed method incorporates information about stable equilibria into the learning process, allowing the learned model to preserve the stability of multiple isolated equilibria rather than being restricted to a single equilibrium as in conventional methods. Two numerical experiments are conducted to validate the effectiveness of the proposed approach and demonstrate its ability to achieve more accurate structure- and stability-preserving learning of port-Hamiltonian systems compared with a baseline method.


【15】Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
标题:随着规模的变化而变得更好和更坏:情境卷入如何随着模型尺寸而变化
链接:https://arxiv.org/abs/2604.13275

作者:Dikshant Kukreja,Kshitij Sah,Gautam Gupta,Avinash Anand,Rajiv Ratn Shah,Zhengkui Wang,Aik Beng Ng,Erik Cambria
备注:16 pages, 11 figures, 6 tables. Accepted to Findings of ACL 2026
摘要:Larger language models become simultaneously better and worse at handling contextual information -- better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opposite trends depending on context type: semantic contexts show decreasing entrainment with scale, while non-semantic contexts show increasing entrainment. Concretely, the largest models are four times more resistant to counterfactual misinformation than the smallest, yet simultaneously twice as prone to copying arbitrary tokens. These diverging trends, which replicate across model families, suggest that semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition -- scaling alone does not resolve context sensitivity, it reshapes it.


【16】A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models
标题:基于概念的XAI高分辨率景观数据集及其应用于物种分布模型
链接:https://arxiv.org/abs/2604.13240

作者:Augustin de la Brosse,Damien Garreau,Thomas Houet,Thomas Corpetti
摘要:Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.


【17】Learning Probabilistic Responsibility Allocations for Multi-Agent Interactions
标题:多智能体交互的学习概率责任分配
链接:https://arxiv.org/abs/2604.13128

作者:Isaac Remy,Caleb Chang,Karen Leung
摘要:Human behavior in interactive settings is shaped not only by individual objectives but also by shared constraints with others, such as safety. Understanding how people allocate responsibility, i.e., how much one deviates from their desired policy to accommodate others, can inform the design of socially compliant and trustworthy autonomous systems. In this work, we introduce a method for learning a probabilistic responsibility allocation model that captures the multimodal uncertainty inherent in multi-agent interactions. Specifically, our approach leverages the latent space of a conditional variational autoencoder, combined with techniques from multi-agent trajectory forecasting, to learn a distribution over responsibility allocations conditioned on scene and agent context. Although ground-truth responsibility labels are unavailable, the model remains tractable by incorporating a differentiable optimization layer that maps responsibility allocations to induced controls, which are available. We evaluate our method on the INTERACTION driving dataset and demonstrate that it not only achieves strong predictive performance but also provides interpretable insights, through the lens of responsibility, into patterns of multi-agent interaction.


【18】Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation
标题:序列级奖励组内学习的设计条件:令牌梯度取消
链接:https://arxiv.org/abs/2604.13088

作者:Fei Ding,Yongkang Zhang,youwei wang,Zijian Zeng
摘要:In sparse termination rewards, intra-group comparisons have become the dominant paradigm for fine-tuning reasoning models via reinforcement learning. However, long-term training often leads to issues like ineffective update accumulation (learning tax), solution probability drift, and entropy collapse. This paper presents a necessary condition for algorithm design from a token-level credit assignment perspective: to prevent reward-irrelevant drift, intra-group objectives must maintain gradient exchangeability across token updates, enabling gradient cancellation on weak-credit/high-frequency tokens. We show that two common mechanisms disrupting exchangeability make "non-cancellation" a structural norm. Based on this, we propose minimal intra-group transformations to restore or approximate the cancellation structure in the shared token space. Experimental results demonstrate that these transformations stabilize training, improve sample efficiency, and enhance final performance, validating the value of this design condition.


【19】Sparse Goodness: How Selective Measurement Transforms Forward-Forward Learning
标题:稀疏善良:选择性测量如何改变前向学习
链接:https://arxiv.org/abs/2604.13081

作者:Kamer Ali Yuksel,Hassan Sawaf
摘要:The Forward-Forward (FF) algorithm is a biologically plausible alternative to backpropagation that trains neural networks layer by layer using a local goodness function to distinguish positive from negative data. Since its introduction, sum-of-squares (SoS) has served as the default goodness function. In this work, we systematically study the design space of goodness functions, investigating both which activations to measure and how to aggregate them. We introduce top-k goodness, which evaluates only the k most active neurons, and show that it substantially outperforms SoS, improving Fashion-MNIST accuracy by 22.6 percentage points. We further introduce entmax-weighted energy, which replaces hard top-k selection with a learnable sparse weighting based on the alpha-entmax transformation, yielding additional gains. Orthogonally, we adopt separate label feature forwarding (FFCL), in which class hypotheses are injected at every layer through a dedicated projection rather than concatenated only at the input. Combining these ideas, we achieve 87.1 percent accuracy on Fashion-MNIST with a 4x2000 architecture, representing a 30.7 percentage point improvement over the SoS baseline while changing only the goodness function and the label pathway. Across controlled experiments covering 11 goodness functions, two architectures, and a sparsity spectrum analysis over both k and alpha, we identify a consistent principle: sparsity in the goodness function is the most important design choice in FF networks. In particular, adaptive sparsity with alpha approximately 1.5 outperforms both fully dense and fully sparse alternatives.


【20】Stochastic Trust-Region Methods for Over-parameterized Models
标题:过度参数化模型的随机信任域方法
链接:https://arxiv.org/abs/2604.14017

作者:Aike Yang,Hao Wang
备注:26 pages, 3 figures
摘要:Under interpolation-type assumptions such as the strong growth condition, stochastic optimization methods can attain convergence rates comparable to full-batch methods, but their performance, particularly for SGD, remains highly sensitive to step-size selection. To address this issue, we propose a unified stochastic trust-region framework that eliminates manual step-size tuning and extends naturally to equality-constrained problems. For unconstrained optimization, we develop a first-order stochastic trust-region algorithm and show that, under the strong growth condition, it achieves an iteration and stochastic first-order oracle complexity of $O(\varepsilon^{-2} \log(1/\varepsilon))$ for finding an $\varepsilon$-stationary point. For equality-constrained problems, we introduce a quadratic-penalty-based stochastic trust-region method with penalty parameter $μ$, and establish an iteration and oracle complexity of $O(\varepsilon^{-4} \log(1/\varepsilon))$ to reach an $\varepsilon$-stationary point of the penalized problem, corresponding to an $O(\varepsilon)$-approximate KKT point of the original constrained problem. Numerical experiments on deep neural network training and orthogonally constrained subspace fitting demonstrate that the proposed methods achieve performance comparable to well-tuned stochastic baselines, while exhibiting stable optimization behavior and effectively handling hard constraints without manual learning-rate scheduling.


【21】Nested Fourier-enhanced neural operator for efficient modeling of radiation transfer in fires
标题:嵌套傅立叶增强神经网络算法在火灾辐射传输模拟中的应用
链接:https://arxiv.org/abs/2604.13919

作者:Anran Jiao,Wengyao Jiang,Xiaoyi Lu,Yi Wang,Lu Lu
摘要:Computational fluid dynamics (CFD) has become an essential tool for predicting fire behavior, yet maintaining both efficiency and accuracy remains challenging. A major source of computational cost in fire simulations is the modeling of radiation transfer, which is usually the dominant heat transfer mechanism in fires. Solving the high-dimensional radiative transfer equation (RTE) with traditional numerical methods can be a performance bottleneck. Here, we present a machine learning framework based on Fourier-enhanced multiple-input neural operators (Fourier-MIONet) as an efficient alternative to direct numerical integration of the RTE. We first investigate the performance of neural operator architectures for a small-scale 2D pool fire and find that Fourier-MIONet provides the most accurate radiative solution predictions. The approach is then extended to 3D CFD fire simulations, where the computational mesh is locally refined across multiple levels. In these high-resolution settings, monolithic surrogate models for direct field-to-field mapping become difficult to train and computationally inefficient. To address this issue, a nested Fourier-MIONet is proposed to predict radiation solutions across multiple mesh-refinement levels. We validate the approach on 3D McCaffrey pool fires simulated with FireFOAM, including fixed fire sizes and a unified model trained over a continuous range of heat release rates (HRRs). The proposed method achieves global relative errors of 2-4% for 3D varying-HRR scenarios while providing faster inference than the estimated cost of one finite-volume radiation solve in FireFOAM for the 16-solid-angle case. With fast and accurate inference, the surrogate makes higher-fidelity radiation treatments practical and enables the incorporation of more spectrally resolved radiation models into CFD fire simulations for engineering applications.


【22】Automatic Charge State Tuning of 300 mm FDSOI Quantum Dots Using Neural Network Segmentation of Charge Stability Diagram
标题:利用电荷稳定性图的神经网络分割自动调整300 mm FDSIM量子点的电荷状态
链接:https://arxiv.org/abs/2604.13662

作者:Peter Samaha,Amine Torki,Ysaline Renaud,Sam Fiette,Emmanuel Chanrion,Pierre-Andre Mortemousque,Yann Beilliard
备注:10 pages, 6 figures, supplementary materials available
摘要 :Tuning of gate-defined semiconductor quantum dots (QDs) is a major bottleneck for scaling spin qubit technologies. We present a deep learning (DL) driven, semantic-segmentation pipeline that performs charge auto-tuning by locating transition lines in full charge stability diagrams (CSDs) and returns gate voltage targets for the single charge regime. We assemble and manually annotate a large, heterogeneous dataset of 1015 experimental CSDs measured from silicon QD devices, spanning nine design geometries, multiple wafers, and fabrication runs. A U-Net style convolutional neural network (CNN) with a MobileNetV2 encoder is trained and validated through five-fold group cross validation. Our model achieves an overall offline tuning success of 80.0% in locating the single-charge regime, with peak performance exceeding 88% for some designs. We analyze dominant failure modes and propose targeted mitigations. Finally, wide-range diagram segmentation also naturally enables scalable physic-based feature extraction that can feed back to fabrication and design workflows and outline a roadmap for real-time integration in a cryogenic wafer prober. Overall, our results show that neural network (NN) based wide-diagram segmentation is a practical step toward automated, high-throughput charge tuning for silicon QD qubits.


【23】Data-driven Learning of Probabilistic Model of Binary Droplet Collision for Spray Simulation
标题:喷雾模拟中二元液滴碰撞概率模型的数据驱动学习
链接:https://arxiv.org/abs/2604.13594

作者:Weiming Xu,Tao Yang,Peng Zhang
备注:28 pages, 11 figures, research paper
摘要:Binary droplet collisions are ubiquitous in dense sprays. Traditional deterministic models cannot adequately represent transitional and stochastic behaviors of binary droplet collision. To bridge this gap, we developed a probabilistic model by using a machine learning approach, the Light Gradient-Boosting Machine (LightGBM). The model was trained on a comprehensive dataset of 33,540 experimental cases covering eight collision regimes across broad ranges of Weber number, Ohnesorge number, impact parameter, size ratio, and ambient pressure. The resulting machine learning classifier captures highly nonlinear regime boundaries with 99.2% accuracy and retains sensitivity in transitional regions. To facilitate its implementation in spray simulation, the model was translated into a probabilistic form, a multinomial logistic regression, which preserves 93.2% accuracy and maps continuous inter-regime transitions. A biased-dice sampling mechanism then converts these probabilities into definite yet stochastic outcomes. This work presents the first probabilistic, high-dimensional droplet collision model derived from experimental data, offering a physically consistent, comprehensive, and user-friendly solution for spray simulation.


【24】Identifiability of Potentially Degenerate Gaussian Mixture Models With Piecewise Affine Mixing
标题:具有分段仿射混合的潜在退化高斯混合模型的可识别性
链接:https://arxiv.org/abs/2604.13218

作者:Danru Xu,Sébastien Lachapelle,Sara Magliacane
备注:49 pages, 10 figures, AISTATS 2026
摘要:Causal representation learning (CRL) aims to identify the underlying latent variables from high-dimensional observations, even when variables are dependent with each other. We study this problem for latent variables that follow a potentially degenerate Gaussian mixture distribution and that are only observed through the transformation via a piecewise affine mixing function. We provide a series of progressively stronger identifiability results for this challenging setting in which the probability density functions are ill-defined because of the potential degeneracy. For identifiability up to permutation and scaling, we leverage a sparsity regularization on the learned representation. Based on our theoretical results, we propose a two-stage method to estimate the latent variables by enforcing sparsity and Gaussianity in the learned representations. Experiments on synthetic and image data highlight our method's effectiveness in recovering the ground-truth latent variables.


其他(44篇)

【1】Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
标题:动量进一步抑制随机稳定边缘的尖锐性
链接:https://arxiv.org/abs/2604.14108

作者:Arseniy Andreyev,Advikar Ananthkumar,Marc Walden,Tomaso Poggio,Pierfrancesco Beneventano
备注:40 pages, 38 figures
摘要:Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.


【2】Neural architectures for resolving references in program code
标题:用于解析程序代码中的引用的神经架构
链接:https://arxiv.org/abs/2604.14073

作者:Gergő Szalay,Gergely Zsolt Kovács,Sándor Teleki,Balázs Pintér,Tibor Gregorics
摘要:Resolving and rewriting references is fundamental in programming languages. Motivated by a real-world decompilation task, we abstract reference rewriting into the problems of direct and indirect indexing by permutation. We create synthetic benchmarks for these tasks and show that well-known sequence-to-sequence machine learning architectures are struggling on these benchmarks. We introduce new sequence-to-sequence architectures for both problems. Our measurements show that our architectures outperform the baselines in both robustness and scalability: our models can handle examples that are ten times longer compared to the best baseline. We measure the impact of our architecture in the real-world task of decompiling switch statements, which has an indexing subtask. According to our measurements, the extended model decreases the error rate by 42%. Multiple ablation studies show that all components of our architectures are essential.


【3】MAny: Merge Anything for Multimodal Continual Instruction Tuning
标题:Many:合并任何内容以实现多模式连续教学调优
链接:https://arxiv.org/abs/2604.14016

作者:Zijian Gao,Wangwang Jia,Xingxing Zhang,Pengfei Qian,Tao Sun,Bo Ding,Yong Dou,Huaimin Wang,Kele Xu
摘要:Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57\% and 2.85\% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.


【4】PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
标题:PRiMeFlow:在微扰响应建模中捕捉复杂表达异源性
链接:https://arxiv.org/abs/2604.13986

作者:Zichao Yan,Yan Wu,Mica Xu Ji,Chaitra Agrahar,Esther Wershof,Marcel Nassar,Mehrshad Sadria,Ridvan Eksi,Vladimir Trifonov,Ignacio Ibarra,Telmo Felgueira,Błażej Osiński,Rory Stark
摘要:Predicting the effects of perturbations in-silico on cell state can identify drivers of cell behavior at scale and accelerate drug discovery. However, modeling challenges remain due to the inherent heterogeneity of single cell gene expression and the complex, latent gene dependencies. Here, we present PRiMeFlow, an end-to-end flow matching based approach to directly model the effects of genetic and small molecule perturbations in the gene expression space. The distribution-fitting approach taken by PRiMeFlow enables it to accurately approximate the empirical distribution of single-cell gene expression, which we demonstrate through extensive benchmarking inside PerturBench. Through ablation studies, we also validate important model design choices such as operating in gene expression space and parameterizing the velocity field with a U-Net architecture. The PRiMeFlow architecture was used as the basis for the model that won the Generalist Prize in the first ARC Virtual Cell Challenge.


【5】MolCryst-MLIPs: A Machine-Learned Interatomic Potentials Database for Molecular Crystals
标题:MolCryst-MLIPs:一个机器学习的分子晶体原子间势数据库
链接:https://arxiv.org/abs/2604.13897

作者:Adam Lahouari,Shen Ai,Jihye Han,Jillian Hoffstadt,Philipp Hoellmer,Charlotte Infante,Pulkita Jain,Sangram Kadam,Maya M. Martirossyan,Amara McCune,Hypatia Newton,Shlok J. Paul,Willmor Pena,Jonathan Raghoonanan,Sumon Sahu,Oliver Tan,Andrea Vergara,Jutta Rogal,Mark E. Tuckerman
摘要:We present an open Molecular Crystal (MC) database of Machine-Learned Interatomic Potentials (MLIP) called MolCryst-MLIPs. The first release comprises fine-tuned MACE models for nine molecular crystal systems -- Benzamide, Benzoic acid, Coumarin, Durene, Isonicotinamide, Niacinamide, Nicotinamide, Pyrazinamide, and Resorcinol -- developed using the Automated Machine Learning Pipeline (AMLP), which streamlines the entire MLIP development workflow, from reference data generation to model training and validation, into a reproducible and user-friendly pipeline. Models are fine-tuned from the MACE-MH-1 foundation model (omol head), yielding a mean energy MAE of 0.141 kJ/mol/atom and a mean force MAE of 0.648 kJ/mol/Angstrom across all systems. Dynamical stability and structural integrity, as assessed through energy conservation, P2 orientational order parameters, and radial distribution functions, are evaluated using molecular dynamics simulations. The released models and datasets constitute a growing open database of validated MLIPs, ready for production MD simulations of molecular crystal polymorphism under different thermodynamic conditions.


【6】Context Sensitivity Improves Human-Machine Visual Alignment
标题:上下文敏感性改善人机视觉对齐
链接:https://arxiv.org/abs/2604.13883

作者:Frieda Born,Tom Neuhäuser,Lukas Muttenthaler,Brett D. Roads,Bernhard Spitzer,Andrew K. Lampinen,Matt Jones,Klaus-Robert Müller,Michael C. Mozer
摘要:Modern machine learning models typically represent inputs as fixed points in a high-dimensional embedding space. While this approach has been proven powerful for a wide range of downstream tasks, it fundamentally differs from the way humans process information. Because humans are constantly adapting to their environment, they represent objects and their relationships in a highly context-sensitive manner. To address this gap, we propose a method for context-sensitive similarity computation from neural network embeddings, applied to modeling a triplet odd-one-out task with an anchor image serving as simultaneous context. Modeling context enables us to achieve up to a 15% improvement in odd-one-out accuracy over a context-insensitive model. We find that this improvement is consistent across both original and "human-aligned" vision foundation models.


【7】Simulation-Based Optimisation of Batting Order and Bowling Plans in T20 Cricket
标题:基于仿真的T20板球击球顺序和投球计划优化
链接:https://arxiv.org/abs/2604.13861

作者:Tinniam V Ganesh
备注 :Submitted to the Journal of Quantitative Analysis in Sports (JQAS), April 2026. 23 pages, 8 figures
摘要:This paper develops a unified Markov Decision Process (MDP) framework for optimising two recurring in-match decisions in T20 cricket namely batting order selection and bowling plan assignment, directly in terms of win and defend probability rather than expected runs. A three-phase player profile engine (Powerplay, Middle, Death) with James-Stein shrinkage is estimated from 1,161 IPL ball-by-ball records (2008-2025). Win/defend probabilities are evaluated by vectorised Monte Carlo simulation over N = 50,000 innings trajectories. Batting orders are searched by exhaustive enumeration. Bowling plans are computed by simulated annealing over the remaining quota with the constraint that the same bowler cannot bowl consecutive overs. Applied to two 2026 IPL matches, the optimal batting order improves Mumbai Indians' win probability by 4.1 percentage points (52.4% to 56.5%), and the optimal Gujarat Titans bowling plan improves defend probability by 5.2 percentage points (39.1% to 44.3%). In both cases the observed sub-optimality is consistent with phase-agnostic deployment in decisions that appear reasonable by aggregate metrics but are exposed as costly when phase-specific profiles are applied.


【8】SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
标题:SparseBalance:具有动态稀疏注意力的负载平衡长期上下文训练
链接:https://arxiv.org/abs/2604.13847

作者:Hongtao Xu,Jianchao Tan,Yuxuan Hu,Pengju Lu,Hongyu Wang,Pingwei Sun,Yerui Sun,Yuchen Xie,Xunliang Cai,Mingzhen Li,Weile Jia
摘要:While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33$\times$ end-to-end speedup while still improving the long-context capability by 0.46\% on the LongBench benchmark.


【9】RPS: Information Elicitation with Reinforcement Prompt Selection
标题:RPS:具有强化提示选择的信息启发
链接:https://arxiv.org/abs/2604.13817

作者:Tao Wang,Jingyao Lu,Xibo Wang,Haonan Huang,Su Yao,Zhiqiang Hu,Xingyan Chen,Enmao Diao
摘要:Large language models (LLMs) have shown remarkable capabilities in dialogue generation and reasoning, yet their effectiveness in eliciting user-known but concealed information in open-ended conversations remains limited. In many interactive AI applications, such as personal assistants, tutoring systems, and legal or clinical support, users often withhold sensitive or uncertain information due to privacy concerns, ambiguity, or social hesitation. This makes it challenging for LLMs to gather complete and contextually relevant inputs. In this work, we define the problem of information elicitation in open-ended dialogue settings and propose Reinforcement Prompt Selection (RPS), a lightweight reinforcement learning framework that formulates prompt selection as a sequential decision-making problem. To analyze this problem in a controlled setting, we design a synthetic experiment, where a reinforcement learning agent outperforms a random query baseline, illustrating the potential of policy-based approaches for adaptive information elicitation. Building on this insight, RPS learns a policy over a pool of prompts to adaptively elicit concealed or incompletely expressed information from users through dialogue. We also introduce IELegal, a new benchmark dataset constructed from real legal case documents, which simulates dialogue-based information elicitation tasks aimed at uncovering case-relevant facts. In this setting, RPS outperforms static prompt baselines, demonstrating the effectiveness of adaptive prompt selection for eliciting critical information in LLM-driven dialogue systems.


【10】Composite Silhouette: A Subsampling-based Aggregation Strategy
标题:复合剪影:基于子采样的聚合策略
链接:https://arxiv.org/abs/2604.13816

作者:Aggelos Semoglou,Aristidis Likas,John Pavlopoulos
备注:32 pages including Appendix
摘要:Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We establish key properties of the criterion and derive finite-sample concentration guarantees for its subsampling estimate. Experiments on synthetic and real-world datasets show that Composite Silhouette effectively reconciles the strengths of micro- and macro-averaging, yielding more accurate recovery of the ground-truth number of clusters.


【11】Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
标题:通过稳定对角弯曲估计的鲁棒超低位训练后量化
链接:https://arxiv.org/abs/2604.13806

作者:Jaemin Kim,Sungkyun Kim,Junyeol Lee,Jiwon Seo
备注:EUROMLSYS 2026
摘要 :Large Language Models (LLMs) are widely used across many domains, but their scale makes deployment challenging. Post-Training Quantization (PTQ) reduces memory footprint without retraining by leveraging a small calibration set. Recent Hessian-based PTQ methods compensate quantization error via cross-channel dependencies, but such approaches degrade at low bit-widths due to noisy curvature estimates from limited calibration data. We propose DASH-Q, a robust PTQ framework using diagonal Hessian approximation and iterative weighted least squares. By discarding noise-prone dependencies, DASH-Q filters sampling noise while prioritizing the preservation of salient feature power. We outperform other PTQ baselines in ultra low-bit regime, improving zero-shot accuracy by 7.01% on average and up to 14.01% over the strongest baselines across five baseline LLM models, while showing robust and stable performance with very small calibration data.


【12】A Dynamic-Growing Fuzzy-Neuro Controller, Application to a 3PSP Parallel Robot
标题:动态增长模糊神经控制器在3PSP并行机器人中的应用
链接:https://arxiv.org/abs/2604.13763

作者:Mohsen Jalaeian-Farimani,Mohammad-R Akbarzadeh-T,Alireza Akbarzadeh,Mostafa Ghaemi
备注:2012 IEEE International Conference on Fuzzy Systems
摘要:To date, various paradigms of soft-Computing have been used to solve many modern problems. Among them, a self organizing combination of fuzzy systems and neural networks can make a powerful decision making system. Here, a Dynamic Growing Fuzzy Neural Controller (DGFNC) is combined with an adaptive strategy and applied to a 3PSP parallel robot position control problem. Specifically, the dynamic growing mechanism is considered in more detail. In contrast to other self-organizing methods, DGFNC adds new rules more conservatively; hence the pruning mechanism is omitted. Instead, the adaptive strategy 'adapts' the control system to parameter variation. Furthermore, a sliding mode-based nonlinear controller ensures system stability. The resulting general control strategy aims to achieve faster response with less computation while maintaining overall stability. Finally, the 3PSP is chosen due to its complex dynamics and the utility of such approaches in modern industrial systems. Several simulations support the merits of the proposed DGFNC strategy as applied to the 3PSP robot.


【13】Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation
标题:基于CNN的语义分割中稀疏专家混合层的设计与行为
链接:https://arxiv.org/abs/2604.13761

作者:Svetlana Pavlitska,Haixi Fan,Konstantin Ditschuneit,J. Marius Zöllner
备注:Accepted for publication at the SAIAD workshop at CVPR 2026
摘要:Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch-wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder-decoder and backbone-based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture-dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN-based dense prediction. Our code is available at https://github.com/KASTEL-MobilityLab/moe-layers/.


【14】Spectral Thompson sampling
标题:光谱汤普森采样
链接:https://arxiv.org/abs/2604.13739

作者:Tomas Kocak,Michal Valko,Remi Munos,Shipra Agrawal
备注:Published at AAAI Conference on Artificial Intelligence (AAAI) 2014
摘要:Thompson Sampling (TS) has attracted a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis showing that the regret of SpectralTS scales as d*sqrt(T ln N) with high probability, where T is the time horizon and N is the number of choices. Since a d*sqrt(T ln N) regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is competitive on both synthetic and real-world data.


【15】EMGFlow: Robust and Efficient Surface Electromyography Synthesis via Flow Matching
标题:EMGFlow:通过流量匹配实现稳健高效的表面肌电信号合成
链接:https://arxiv.org/abs/2604.13685

作者:Boxuan Jiang,Chenyun Dai,Can Han
摘要 :Deep learning-based surface electromyography (sEMG) gesture recognition is frequently bottlenecked by data scarcity and limited subject diversity. While synthetic data generation via Generative Adversarial Networks (GANs) and diffusion models has emerged as a promising augmentation strategy, these approaches often face challenges regarding training stability or inference efficiency. To bridge this gap, we propose EMGFlow, a conditional sEMG generation framework. To the best of our knowledge, this is the first study to investigate the application of Flow Matching (FM) and continuous-time generative modeling in the sEMG domain. To validate EMGFlow across three benchmark sEMG datasets, we employ a unified evaluation protocol integrating feature-based fidelity, distributional geometry, and downstream utility. Extensive evaluations show that EMGFlow outperforms conventional augmentation and GAN baselines, and provides stronger standalone utility than the diffusion baselines considered here under the train-on-synthetic test-on-real (TSTR) protocol. Furthermore, by optimizing generation dynamics through advanced numerical solvers and targeted time sampling, EMGFlow achieves improved quality-efficiency trade-offs. Taken together, these results suggest that Flow Matching is a promising and efficient paradigm for addressing data bottlenecks in myoelectric control systems. Our code is available at: https://github.com/Open-EXG/EMGFlow.


【16】Automatically Inferring Teachers' Geometric Content Knowledge: A Skills Based Approach
标题:自动推断教师的几何内容知识:基于技能的方法
链接:https://arxiv.org/abs/2604.13666

作者:Ziv Fenigstein,Kobi Gal,Avi Segal,Osama Swidan,Inbal Israel,Hassan Ayoob
备注:The work is accepted for publication as a full paper (Main Track) at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
摘要:Assessing teachers' geometric content knowledge is essential for geometry instructional quality and student learning, but difficult to scale. The Van Hiele model characterizes geometric reasoning through five hierarchical levels. Traditional Van Hiele assessment relies on manual expert analysis of open-ended responses. This process is time-consuming, costly, and prevents large-scale evaluation. This study develops an automated approach for diagnosing teachers' Van Hiele reasoning levels using large language models grounded in educational theory. Our central hypothesis is that integrating explicit skills information significantly improves Van Hiele classification. In collaboration with mathematics education researchers, we built a structured skills dictionary decomposing the Van Hiele levels into 33 fine-grained reasoning skills. Through a custom web platform, 31 pre-service teachers solved geometry problems, yielding 226 responses. Expert researchers then annotated each response with its Van Hiele level and demonstrated skills from the dictionary. Using this annotated dataset, we implemented two classification approaches: (1) retrieval-augmented generation (RAG) and (2) multi-task learning (MTL). Each approach compared a skills-aware variant incorporating the skills dictionary against a baseline without skills information. Results showed that for both methods, skills-aware variants significantly outperformed baselines across multiple evaluation metrics. This work provides the first automated approach for Van Hiele level classification from open-ended responses. It offers a scalable, theory-grounded method for assessing teachers' geometric reasoning that can enable large-scale evaluation and support adaptive, personalized teacher learning systems.


【17】Golden Handcuffs make safer AI agents
标题:金手铐让人工智能特工更安全
链接:https://arxiv.org/abs/2604.13609

作者:Aram Ebtekar,Michael K. Cohen
备注:26 pages, preliminary version
摘要:Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent's subjective reward range to include a large negative value $-L$, while the true environment's rewards lie in $[0,1]$. After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to $-L$. We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.


【18】RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
标题:RiskWebWorld:电子商务风险管理中图形用户界面代理的现实交互基准
链接:https://arxiv.org/abs/2604.13531

作者:Renqi Chen,Zeyin Tao,Jianming Guo,Jing Wang,Zezhou Xu,Jingzhe Zhu,Qingqing Sun,Tianyi Zhang,Shuai Chen
摘要:Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.


【19】C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
标题:C投票:基于信心的测试时投票,没有显式能量函数
链接:https://arxiv.org/abs/2604.13521

作者:Kenji Kubo,Shunsuke Kamiya,Masanori Koyama,Kohei Hayashi,Yusuke Iwasawa,Yutaka Matsuo
摘要 :Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks. A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training. Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks. In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories. Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the model's confidence. Additionally, it yields 4.9% higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions. An essential advantage of C-voting is its applicability: it can be applied to recurrent models without requiring an explicit energy function. Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme (95.2% vs. 55.0%) and Maze (78.6% vs. 74.5%) tasks.


【20】LEGO-MOF: Equivariant Latent Manipulation for Editable, Generative, and Optimizable MOF Design
标题:LEGO-MBE:可编辑、生成和可优化的MBE设计的等变潜在操纵
链接:https://arxiv.org/abs/2604.13520

作者:Chaoran Zhang,Guangyao Li,Dongxu Ji
备注:36 pages including Supplementary Information, 10 figures in the main text and 12 figures/tables in the Supplementary Information
摘要:Metal-organic frameworks (MOFs) are highly promising for carbon capture, yet navigating their vast design space remains challenging. Recent deep generative models enable de novo MOF design but primarily act as feed-forward structure generators. By heavily relying on predefined building block libraries and non-differentiable post-optimization, they fundamentally sever the information flow required for continuous structural editing. Here, we propose a target-driven generative framework focused on continuous structural manipulation. At its core is LinkerVAE, which maps discrete 3D chemical graphs into a continuous, SE(3)-equivariant latent space. This smooth manifold unlocks geometry-aware manipulations, including implicit chemical style transfer and zero-shot isoreticular expansion. Building upon this, we introduce a test-time optimization (TTO) strategy, utilizing an accurate surrogate model to continuously optimize the latent graphs of existing MOFs toward desired properties. This approach systematically enhances carbon capture performance, achieving a striking average relative boost of 147.5% in pure CO2 uptake while strictly preserving structural validity. Integrated with a latent diffusion model and rigid-body assembly for full MOF construction, our framework establishes a scalable, fully differentiable pathway for both the automated discovery, targeted optimization and editing of functional materials.


【21】SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
标题:SFT-GRPO数据重叠作为自动形式化的训练后超参数
链接:https://arxiv.org/abs/2604.13515

作者:Xiaole Su,Kasey Zhang,Andy Lyu
摘要:Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.


【22】Computational framework for multistep metabolic pathway design
标题:多步代谢途径设计的计算框架
链接:https://arxiv.org/abs/2604.13471

作者:Peter Zhiping Zhang,Jeffrey D. Varner
摘要:In silico tools are important for generating novel hypotheses and exploring alternatives in de novo metabolic pathway design. However, while many computational frameworks have been proposed for retrobiosynthesis, few successful examples of algorithm-guided xenobiotic biochemical retrosynthesis have been reported in the literature. Deep learning has improved the quality of synthesis and retrosynthesis in organic chemistry applications. Inspired by this progress, we explored combining deep learning of biochemical transformations with the traditional retrobiosynthetic workflow to improve in silico synthetic metabolic pathway designs. To develop our computational biosynthetic pathway design framework, we assembled metabolic reaction and enzymatic template data from public databases. A data augmentation procedure, adapted from literature, was carried out to enrich the assembled reaction dataset with artificial metabolic reactions generated by enzymatic reaction templates. Two neural network-based pathway ranking models were trained as binary classifiers to distinguish assembled reactions from artificial counterparts; each model output a scalar quantifying the plausibility of a 1-step or 2-step pathway. Combining these two models with enzymatic templates, we built a multistep retrobiosynthesis pipeline and validated it by reproducing some natural and non-natural pathways computationally.


【23】Universality of Gaussian-Mixture Reverse Kernels in Conditional Diffusion
标题:条件扩散中高斯混合反核的普适性
链接:https://arxiv.org/abs/2604.13470

作者:Nafiz Ishtiaque,Syed Arefinul Haque,Kazi Ashraful Alam,Fatima Jahara
备注:10+19 pages
摘要:We prove that conditional diffusion models whose reverse kernels are finite Gaussian mixtures with ReLU-network logits can approximate suitably regular target distributions arbitrarily well in context-averaged conditional KL divergence, up to an irreducible terminal mismatch that typically vanishes with increasing diffusion horizon. A path-space decomposition reduces the output error to this mismatch plus per-step reverse-kernel errors; assuming each reverse kernel factors through a finite-dimensional feature map, each step becomes a static conditional density approximation problem, solved by composing Norets' Gaussian-mixture theory with quantitative ReLU bounds. Under exact terminal matching the resulting neural reverse-kernel class is dense in conditional KL.


【24】Functional Emotions or Situational Contexts? A Discriminating Test from the Mythos Preview System Card
标题:功能性情绪还是情境背景?Mythos预览系统卡的鉴别测试
链接:https://arxiv.org/abs/2604.13466

作者:Hiranya V. Peiris
备注:6 pages
摘要:The Claude Mythos Preview system card deploys emotion vectors, sparse autoencoder (SAE) features, and activation verbalisers to study model internals during misaligned behaviour. The two primary toolkits are not jointly reported on the most alignment-relevant episodes. This note identifies two hypotheses that are qualitatively consistent with the published results: that the emotion vectors track functional emotions that causally drive behaviour, or that they are a projection of a richer situational-context structure onto human emotional axes. The hypotheses can be distinguished by a test the system card does not report: applying emotion probes to the strategic concealment episodes where only SAE features are currently documented. If emotion probes show flat activation while SAE features are strongly active, the alignment-relevant structure lies outside the emotion subspace. Which hypothesis is correct determines whether emotion-based monitoring will robustly detect dangerous model behaviour or systematically miss it.


【25】WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework
标题:WIN-U:Woodbury-Informed Newton-Unlearning作为一个无保留的机器学习框架
链接:https://arxiv.org/abs/2604.13438

作者:Xingjian Zhao,Mohammad Mohammadi Amiri,Malik Magdon-Ismail
备注:21 pages, 3 figures, under review at COLM2026
摘要:Privacy concerns in LLMs have led to the rapidly growing need to enforce a data's "right to be forgotten". Machine unlearning addresses precisely this task, namely the removal of the influence of some specific data, i.e., the forget set, from a trained model. The gold standard for unlearning is to produce the model that would have been learned on only the rest of the training data, i.e., the retain set. Most existing unlearning methods rely on direct access to the retained data, which may not be practical due to privacy or cost constraints. We propose WIN-U, a retained-data free unlearning framework that requires only second order information for the originally trained model on the full data. The unlearning is performed using a single Newton-style step. Using the Woodbury matrix identity and a generalized Gauss-Newton approximation for the forget set curvature, the WIN-U update recovers the closed-form linear solution and serves as a local second-order approximation to the gold-standard retraining optimum. Extensive experiments on various vision and language benchmarks demonstrate that WIN-U achieves SOTA performance in terms of unlearning efficacy and utility preservation, while being more robust against relearning attacks compared to existing methods. Importantly, WIN-U does not require access to the retained data.


【26】BioTrain: Sub-MB, Sub-50mW On-Device Fine-Tuning for Edge-AI on Biosignals
标题:BioTrain:Sub-MB、Sub-50 MW在设备上微调,适用于生物信号上的边缘AI
链接:https://arxiv.org/abs/2604.13359

作者:Run Wang,Victor J. B. Jung,Philip Wiese,Sebastian Frey,Giusy Spacone,Francesco Conti,Alessio Burrello,Luca Benin
摘要:Biosignals exhibit substantial cross-subject and cross-session variability, inducing severe domain shifts that degrade post-deployment performance for small, edge-oriented AI models. On-device adaptation is therefore essential to both preserve user privacy and ensure system reliability. However, existing sub-100 mW MCU-based wearable platforms can only support shallow or sparse adaptation schemes due to the prohibitive memory footprint and computational cost of full backpropagation (BP). In this paper, we propose BioTrain, a framework enabling full-network fine-tuning of state-of-the-art biosignal models under milliwatt-scale power and sub-megabyte memory constraints. We validate BioTrain using both offline and on-device benchmarks on EEG and EOG datasets, covering Day-1 new-subject calibration and longitudinal adaptation to signal drift. Experimental results show that full-network fine-tuning achieves accuracy improvements of up to 35% over non-adapted baselines and outperforms last-layer updates by approximately 7% during new-subject calibration. On the GAP9 MCU platform, BioTrain enables efficient on-device training throughput of 17 samples/s for EEG and 85 samples/s for EOG models within a power envelope below 50 mW. In addition, BioTrain's efficient memory allocator and network topology optimization enable the use of a large batch size, reducing peak memory usage. For fully on-chip BP on GAP9, BioTrain reduces the memory footprint by 8.1x, from 5.4 MB to 0.67 MB, compared to conventional full-network fine-tuning using batch normalization with batch size 8.


【27】Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
标题:事件张量:编译动态Megakernal的统一抽象
链接:https://arxiv.org/abs/2604.13327

作者:Hongyi Jin,Bohan Hou,Guanjie Wang,Ruihang Lai,Jinqi Chen,Zihao Ye,Yaxing Cai,Yixin Dong,Xinhao Cheng,Zhihao Zhang,Yilong Zhao,Yingyi Huang,Lijie Yang,Jinchen Jiang,Gabriele Oliaro,Jianan Ji,Xupeng Miao,Vinod Grover,Todd C. Mowry,Zhihao Jia,Tianqi Chen
备注:16 pages. 18 figures. accepted in MLSys 2026
摘要:Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.


【28】The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform
标题:Spectrascapes数据集:使用移动平台捕获可见光以外的街景图像
链接 :https://arxiv.org/abs/2604.13315

作者:Akshit Gupta,Joris Timmermans,Filip Biljecki,Remko Uijlenhoet
备注:Submitted, under-review
摘要:High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing, or standard street-view imagery (RGB). These methods and datasets are often constrained respectively by poor scalability, inconsistent spatio-temporal resolutions, overhead views or low spectral information. We present a novel method and its open implementation: a multi-spectral terrestrial-view dataset that circumvents these limitations. This dataset consists of 17,718 street level multi-spectral images captured with RGB, Near-infrared, and Thermal imaging sensors on bikes, across diverse urban morphologies (village, town, small city, and big urban area) in the Netherlands. Strict emphasis is put on data calibration and quality while also providing the details of our data collection methodology (including the hardware and software details). To the best of our knowledge, Spectrascapes is the first open-access dataset of its kind. Finally, we demonstrate two downstream use-cases enabled using this dataset and provide potential research directions in the machine learning, urban planning and remote sensing domains.


【29】Some Theoretical Limitations of t-SNE
标题:t-SNE的一些理论局限性
链接:https://arxiv.org/abs/2604.13295

作者:Rupert Li,Elchanan Mossel
备注:19 pages, 7 figures
摘要:t-SNE has gained popularity as a dimension reduction technique, especially for visualizing data. It is well-known that all dimension reduction techniques may lose important features of the data. We provide a mathematical framework for understanding this loss for t-SNE by establishing a number of results in different scenarios showing how important features of data are lost by using t-SNE.


【30】Physics-informed reservoir characterization from bulk and extreme pressure events with a differentiable simulator
标题:利用可区分的模拟器根据大压力和极端压力事件进行基于物理信息的储层描述
链接:https://arxiv.org/abs/2604.13291

作者:Harun Ur Rashid,Mingxin Li,Aleksandra Pachalieva,Georg Stadler,Daniel O'Malley
摘要:Accurate characterization of subsurface heterogeneity is challenging but essential for applications such as reservoir pressure management, geothermal energy extraction and CO$_2$, H$_2$, and wastewater injection operations. This challenge becomes especially acute in extreme pressure events, which are rarely observed but can strongly affect operational risk. Traditional history matching and inversion techniques rely on expensive full-physics simulations, making it infeasible to handle uncertainty and extreme events at scale. Purely data-driven models often struggle to maintain physics consistency when dealing with sparse observations, complex geology, and extreme events. To overcome these limitations, we introduce a physics-informed machine learning method that embeds a differentiable subsurface flow simulator directly into neural network training. The network infers heterogeneous permeability fields from limited pressure observations, while training minimizes both permeability and pressure losses through the simulator, enforcing physical consistency. Because the simulator is used only during training, inference remains fast once the model is learned. In an initial test, the proposed method reduces the pressure inference error by half compared with a purely data-driven approach. We then extend the test over eight distinct data scenarios, and in every case, our method produces significantly lower pressure inference errors than the purely data-driven model. We also evaluate our method on extreme events, which represent high-consequence data in the tail of the sample distribution. Similar to the bulk distribution, the physics-informed model maintains higher pressure inference accuracy in the extreme event regimes. Overall, the proposed method enables rapid, physics-consistent subsurface inversion for real-time reservoir characterization and risk-aware decision-making.


【31】Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach
标题:未知操作约束下优化地球观测卫星计划:主动约束获取方法
链接:https://arxiv.org/abs/2604.13283

作者:Mohamed-Bachir Belaid
摘要:Earth Observation (EO) satellite scheduling (deciding which imaging tasks to perform and when) is a well-studied combinatorial optimization problem. Existing methods typically assume that the operational constraint model is fully specified in advance. In practice, however, constraints governing separation between observations, power budgets, and thermal limits are often embedded in engineering artefacts or high-fidelity simulators rather than in explicit mathematical models. We study EO scheduling under \emph{unknown constraints}: the objective is known, but feasibility must be learned interactively from a binary oracle. Working with a simplified model restricted to pairwise separation and global capacity constraints, we introduce Conservative Constraint Acquisition~(CCA), a domain-specific procedure designed to identify justified constraints efficiently in practice while limiting unnecessary tightening of the learned model. Embedded in the \textsc{Learn\&Optimize} framework, CCA supports an interactive search process that alternates optimization under a learned constraint model with targeted oracle queries. On synthetic instances with up to 50~tasks and dense constraint networks, L\&O improves over a no-knowledge greedy baseline and uses far fewer main oracle queries than a two-phase acquire-then-solve baseline (FAO). For $n\leq 30$, the average gap drops from 65--68\% (Priority Greedy) to 17.7--35.8\% using L\&O. At $n{=}50$, where the CP-SAT reference is the best feasible solution found in 120~s, L\&O improves on FAO on average (17.9\% vs.\ 20.3\%) while using 21.3 main queries instead of 100 and about $5\times$ less execution time.


【32】Does Dimensionality Reduction via Random Projections Preserve Landscape Features?
标题:通过随机投影降低维度是否可以保留景观特征?
链接:https://arxiv.org/abs/2604.13230

作者:Iván Olarte Rodríguez,Anja Jankovic,Thomas Bäck,Elena Raponi
备注:9 Pages, 5 figures, Submitted and accepted to Proceedings of The Genetic and Evolutionary Computation Conference 2026,
摘要:Exploratory Landscape Analysis (ELA) provides numerical features for characterizing black-box optimization problems. In high-dimensional settings, however, ELA suffers from sparsity effects, high estimator variance, and the prohibitive cost of computing several feature classes. Dimensionality reduction has therefore been proposed as a way to make ELA applicable in such settings, but it remains unclear whether features computed in reduced spaces still reflect intrinsic properties of the original landscape.   In this work, we investigate the robustness of ELA features under dimensionality reduction via Random Gaussian Embeddings (RGEs). Starting from the same sampled points and objective values, we compute ELA features in projected spaces and compare them to those obtained in the original search space across multiple sample budgets and embedding dimensions.   Our results show that linear random projections often alter the geometric and topological structure relevant to ELA, yielding feature values that are no longer representative of the original problem. While a small subset of features remains comparatively stable, most are highly sensitive to the embedding. Moreover, robustness under projection does not necessarily imply informativeness, as apparently robust features may still reflect projection-induced artifacts rather than intrinsic landscape characteristics.


【33】Fast Voxelization and Level of Detail for Microgeometry Rendering
标题:微几何渲染的快速体素化和细节级别
链接:https://arxiv.org/abs/2604.13191

作者:Javier Fabre,Carlos Castillo,Carlos Rodriguez-Pardo,Jorge Lopez-Moreno
备注:Accepted for publication in The Visual Computer. 16 pages, 7 figures, 3 tables. Supplementary material: https://javierfabre.com/projects/voxel-lod/supp.pdf
摘要:Many materials show anisotropic light scattering patterns due to the shape and local alignment of their underlying micro structures: surfaces with small elements such as fibers, or the ridges of a brushed metal, are very sparse and require a high spatial resolution to be properly represented as a volume. The acquisition of voxel data from such objects is a time and memory-intensive task, and most rendering approaches require an additional Level-of-Detail (LoD) data structure to aggregate the visual appearance, as observed from multiple distances, in order to reduce the number of samples computed per pixel (E.g.: MIP mapping). In this work we introduce first, an efficient parallel voxelization method designed to facilitate fast data aggregation at multiple resolution levels, and second, a novel representation based on hierarchical SGGX clustering that provides better accuracy than baseline methods. We validate our approach with a CUDA-based implementation of the voxelizer, tested both on triangle meshes and volumetric fabrics modeled with explicit fibers. Finally, we show the results generated with a path tracer based on the proposed LoD rendering model.


【34】Generalization Guarantees on Data-Driven Tuning of Gradient Descent with Langevin Updates
标题:使用Langevin更新对梯度下降进行数据驱动调整的一般化保证
链接:https://arxiv.org/abs/2604.13130

作者:Saumya Goyal,Rohith Rongali,Ritabrata Ray,Barnabás Póczos
摘要:We study learning to learn for regression problems through the lens of hyperparameter tuning. We propose the Langevin Gradient Descent Algorithm (LGD), which approximates the mean of the posterior distribution defined by the loss function and regularizer of a convex regression task. We prove the existence of an optimal hyperparameter configuration for which the LGD algorithm achieves the Bayes' optimal solution for squared loss. Subsequently, we study generalization guarantees on meta-learning optimal hyperparameters for the LGD algorithm from a given set of tasks in the data-driven setting. For a number of parameters $d$ and hyperparameter dimension $h$, we show a pseudo-dimension bound of $O(dh)$, upto logarithmic terms under mild assumptions on LGD. This matches the dimensional dependence of the bounds obtained in prior work for the elastic net, which only allows for $h=2$ hyperparameters, and extends their bounds to regression on convex loss. Finally, we show empirical evidence of the success of LGD and the meta-learning procedure for few-shot learning on linear regression using a few synthetically created datasets.


【35】Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals
标题:合成表格生成器未能保留行为欺诈模式:时间、速度和多帐户信号的基准
链接:https://arxiv.org/abs/2604.13125

作者:Bhavana Sajja
备注:28 pages, 5 figures. Submitted to DMLR (Journal of Data-centric Machine Learning Research). Code: https://github.com/bhavana3/synthetic-data-experiments
摘要:We introduce behavioral fidelity -- a third evaluation dimension for synthetic tabular data that measures whether generated data preserves the temporal, sequential, and structural behavioral patterns that distinguish real-world entity activity. Existing frameworks evaluate statistical fidelity (marginal distributions and correlations) and downstream utility (classifier AUROC on synthetic-trained models), but neither tests for the behavioral signals that operational detection and analysis systems actually rely on. We formalize a taxonomy of four behavioral fraud patterns (P1-P4) covering inter-event timing, burst structure, multi-account graph motifs, and velocity-rule trigger rates; define a degradation ratio metric calibrated to a real-data noise floor (1.0 = matches real variability, k = k-times worse); and prove that row-independent generators -- the dominant paradigm -- are structurally incapable of reproducing P3 graph motifs (Proposition 1) and produce non-positive within-entity IET autocorrelation (Proposition 2), making the positive burst fingerprint of fraud sequences unachievable regardless of architecture or training data size. We benchmark CTGAN, TVAE, GaussianCopula, and TabularARGN on IEEE-CIS Fraud Detection and the Amazon Fraud Dataset. All four fail severely: on IEEE-CIS composite degradation ratios range from 24.4x (TVAE) to 39.0x (GaussianCopula); on Amazon FDB, row-independent generators score 81.6-99.7x, while TabularARGN achieves 17.2x. We document generator-specific failure modes and their resolutions. The P1-P4 framework extends to any domain with entity-level sequential tabular data, including healthcare and network security. We release our evaluation framework as open source.


【36】Spectral Entropy Collapse as an Empirical Signature of Delayed Generalisation in Grokking
标题:谱熵崩溃作为Grokking延迟推广的经验标志
链接 :https://arxiv.org/abs/2604.13123

作者:Truong Xuan Khanh,Truong Quynh Hoa,Luu Duc Trung,Phan Thanh Duc
备注:18 pages, 10 figs, 7 tables
摘要:Grokking -- delayed generalisation long after memorisation -- lacks a predictive mechanistic explanation. We identify the normalised spectral entropy $\tilde{H}(t)$ of the representation covariance as a scalar order parameter for this transition, validated on 1-layer Transformers on group-theoretic tasks. Five contributions: (i) Grokking follows a two-phase pattern: norm expansion then entropy collapse. (ii) $\tilde{H}$ crosses a stable threshold $\tilde{H}^* \approx 0.61$ before generalisation in 100% of runs (mean lead: 1,020 steps). (iii) A causal intervention preventing collapse delays grokking by +5,020 steps ($p=0.044$); a norm-matched control ($n=30$, $p=5\times10^{-5}$) confirms entropy -- not norm -- drives the transition. (iv) A power-law $ΔT = C_1(\tilde{H}-\tilde{H}^*)^γ+C_2$ ($R^2=0.543$) predicts grokking onset with 4.1% error. (v) The mechanism holds across abelian ($\mathbb{Z}/97\mathbb{Z}$) and non-abelian ($S_5$) groups. Crucially, MLPs show entropy collapse without grokking, proving collapse is necessary but not sufficient -- architecture matters. Code: https://anonymous.4open.science/r/grokking-entropy


【37】Can Coding Agents Be General Agents?
标题:编码代理可以成为总代理吗?
链接:https://arxiv.org/abs/2604.13107

作者:Maksim Ivanov,Abhijay Rana,Gokul Prabhakaran
摘要:As coding agents have seen rapid capability and adoption gains, users are applying them to general tasks beyond software engineering. In this post, we investigate whether coding agents can successfully generalize to end-to-end business process automation. We identify gaps in current evaluations, and conduct a case study to evaluate a coding agent on practical business tasks in an open-core Enterprise Resource Planning system. We find that the agent reliably completes simple tasks but exhibits characteristic failures on complex tasks, suggesting that bridging domain logic and code execution is a key bottleneck to generalizability.


【38】Alignment as Institutional Design: From Behavioral Correction to Transaction Structure in Intelligent Systems
标题:作为制度设计的一致:从行为纠正到智能系统中的交易结构
链接:https://arxiv.org/abs/2604.13079

作者:Rui Chai
备注:This is Paper 5 in a 10-paper series on Super-Alignment via Wuxing Institutional Architecture. It shifts alignment from external behavioral correction to internal institutional design, making aligned behavior the lowest-cost equilibrium
摘要:Current AI alignment paradigms rely on behavioral correction: external supervisors (e.g., RLHF) observe outputs, judge against preferences, and adjust parameters. This paper argues that behavioral correction is structurally analogous to an economy without property rights, where order requires perpetual policing and does not scale. Drawing on institutional economics (Coase, Alchian, Cheung), capability mutual exclusivity, and competitive cost discovery, we propose alignment as institutional design: the designer specifies internal transaction structures (module boundaries, competition topologies, cost-feedback loops) such that aligned behavior emerges as the lowest-cost strategy for each component. We identify three irreducible levels of human intervention (structural, parametric, monitorial) and show that this framework transforms alignment from a behavioral control problem into a political-economy problem. No institution eliminates self-interest or guarantees optimality; the best design makes misalignment costly, detectable, and correctable. We conclude that the proper goal is institutional robustness-a dynamic, self-correcting process under human oversight, not perfection. This work provides the normative foundation for the Wuxing resource-competition mechanisms in companion papers.   Keywords: AI alignment, institutional design, transaction costs, property rights, resource competition, behavioral correction, RLHF, cost truthfulness, modular architecture, correctable alignment


【39】Gradient Descent's Last Iterate is Often (slightly) Suboptimal
标题:梯度下降的最后迭代通常(稍微)次优
链接:https://arxiv.org/abs/2604.13870

作者:Guy Kornowski,Ohad Shamir
摘要:We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.


【40】Covariance-adapting algorithm for semi-bandits with application to sparse rewards
标题:半强盗协方差自适应算法及其在稀疏奖励中的应用
链接:https://arxiv.org/abs/2604.13738

作者:Pierre Perrault,Vianney Perchet,Michal Valko
备注:Published at Conference on Learning Theory (COLT) 2020
摘要 :We investigate stochastic combinatorial semi-bandits, where the entire joint distribution of outcomes impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed sub-Gaussian family. We alleviate this issue by instead considering a new general family of sub-exponential distributions, which contains bounded and Gaussian ones. We prove a new lower bound on the expected regret on this family, that is parameterized by the unknown covariance matrix of outcomes, a tighter quantity than the sub-Gaussian matrix. We then construct an algorithm that uses covariance estimates, and provide a tight asymptotic analysis of the regret. Finally, we apply and extend our results to the family of sparse outcomes, which has applications in many recommender systems.


【41】node2vec or triangle-biased random walks: stationarity, regularity & recurrence
标题:node2vec或三角偏差随机游走:平稳性、规律性和重现性
链接:https://arxiv.org/abs/2604.13681

作者:Luca Avena,Gianmarco Bet,Lars Schroeder,Clara Stegehuis
备注:24 pages, 4 figures
摘要:The node2vec random walk is a non-Markovian random walk on the vertex set of a graph, widely used for network embedding and exploration. This random walk model is defined in terms of three parameters which control the probability of, respectively, backtracking moves, moves within triangles, and moves to the remaining neighboring nodes. From a mathematical standpoint, the node2vec random walk is a nontrivial generalization of the non-backtracking random walk and thus belongs to the class of second-order Markov chains. Despite its widespread use in applications, little is known about its long-run behavior. The goal of this paper is to begin exploring its fundamental properties on arbitrary graphs. To this aim, we show how lifting the node2vec random walk to the state spaces of directed edges and directed wedges yields two distinct Markovian representations which are key for its asymptotic analysis. Using these representations, we find mild sufficient conditions on the underlying finite or infinite graph to guarantee ergodicity, reversibility, recurrence and characterization of the invariant measure. As we discuss, the behavior of the node2vec random walk is drastically different compared to the non-backtracking random walk. While the latter simplifies on arbitrary graphs when using its natural edge Markovian representation thanks to bistochasticity, the former simplifies on regular graphs when using its natural wedge Markovian representation. Remarkably, this representation reveals that a graph is regular if and only if a certain weighted Eulerianity condition holds.


【42】Irregularly Sampled Time Series Interpolation for Binary Evolution Simulations Using Dynamic Time Warping
标题:基于动态时间弯曲的非规则采样时间序列二进制演化模拟
链接:https://arxiv.org/abs/2604.13604

作者:Ugur Demir,Philipp M. Srivastava,Aggelos Katsaggelos,Vicky Kalogera,Santiago L. Tapia,Manuel Ballester,Shamal Lalvani,Patrick Koller,Jeff J. Andrews,Seth Gossage,Max M. Briel,Elizabeth Teng
备注:25 pages, 11 figures. Submitted to ApJ
摘要:Binary stellar evolution simulations are computationally expensive. Stellar population synthesis relies on these detailed evolution models at a fundamental level. Producing thousands of such models requires hundreds of CPU hours, but stellar track interpolation provides one approach to significantly reduce this computational cost. Although single-star track interpolation is straightforward, stellar interactions in binary systems introduce significant complexity to binary evolution, making traditional single-track interpolation methods inapplicable. Binary tracks present fundamentally different challenges compared to single stars, which possess relatively straightforward evolutionary phases identifiable through distinct physical properties. Binary systems are complicated by mutual interactions that can dramatically alter evolutionary trajectories and introduce discontinuities difficult to capture through standard interpolation. In this work, we introduce a novel approach for track alignment and iterative track averaging based on Dynamic Time Warping to address misalignments between neighboring tracks. Our method computes a single shared warping path across all physical parameters simultaneously, placing them on a consistent temporal grid that preserves the causal relationships between parameters. We demonstrate that this joint-alignment strategy maintains key physical relationships such as the Stefan-Boltzmann law in the interpolated tracks. Our comprehensive evaluation across multiple binary configurations demonstrates that proper temporal alignment is crucial for track interpolation methods. The proposed method consistently outperforms existing approaches and enables the efficient generation of more accurate binary population samples for astrophysical studies.


【43】Robust Low-Rank Tensor Completion based on M-product with Weighted Correlated Total Variation and Sparse Regularization
标题:基于加权相关全变差和稀疏正规化M-积的鲁棒低阶张量完成
链接:https://arxiv.org/abs/2604.13525

作者:Biswarup Karmakar,Ratikanta Behera
备注:32 pages
摘要:The robust low-rank tensor completion problem addresses the challenge of recovering corrupted high-dimensional tensor data with missing entries, outliers, and sparse noise commonly found in real-world applications. Existing methodologies have encountered fundamental limitations due to their reliance on uniform regularization schemes, particularly the tensor nuclear norm and $\ell_1$ norm regularization approaches, which indiscriminately apply equal shrinkage to all singular values and sparse components, thereby compromising the preservation of critical tensor structures. The proposed tensor weighted correlated total variation (TWCTV) regularizer addresses these shortcomings through an $M$-product framework that combines a weighted Schatten-$p$ norm on gradient tensors for low-rankness with smoothness enforcement and weighted sparse components for noise suppression. The proposed weighting scheme adaptively reduces the thresholding level to preserve both dominant singular values and sparse components, thus improving the reconstruction of critical structural elements and nuanced details in the recovered signal. Through a systematic algorithmic approach, we introduce an enhanced alternating direction method of multipliers (ADMM) that offers both computational efficiency and theoretical substantiation, with convergence properties comprehensively analyzed within the $M$-product framework.Comprehensive numerical evaluations across image completion, denoising, and background subtraction tasks validate the superior performance of this approach relative to established benchmark methods.


【44】Estimating Continuous Treatment Effects with Two-Stage Kernel Ridge Regression
标题:用两阶段核岭回归估计连续治疗效果
链接:https://arxiv.org/abs/2604.13410

作者 :Seok-Jin Kim,Kaizheng Wang
摘要:We study the problem of estimating the effect function for a continuous treatment, which maps each treatment value to a population-averaged outcome. A central challenge in this setting is confounding: treatment assignment often depends on covariates, creating selection bias that makes direct regression of the response on treatment unreliable. To address this issue, we propose a two-stage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for distribution shift, and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the regularity (eigenvalue decay) of the underlying kernel.


机器翻译由腾讯交互翻译提供,仅供参考

点击“阅读原文”获取带摘要的学术速递

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/195168