点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计163篇
大模型相关(22篇)
【1】Generalization in LLM Problem Solving: The Case of the Shortest Path
标题:LLM问题求解中的推广:最短路径的情况
链接:https://arxiv.org/abs/2604.15306
作者:Yao Tong,Jiayuan Ye,Anastasia Borovykh,Reza Shokri
摘要:语言模型是否可以系统地概括仍然是一个积极的争论。然而,经验表现是由多个因素共同塑造的,如训练数据,训练范式和推理时间策略,使得失败难以解释。我们介绍了一个受控的合成环境的基础上最短路径规划,一个典型的组合顺序优化问题。该设置可以清晰地分离这些因素,并支持两个正交的泛化轴:空间转移到看不见的地图和长度缩放到更长的地平线问题。我们发现,模型表现出很强的空间转移,但由于递归不稳定性的长度缩放一致失败。我们进一步分析了学习管道的不同阶段如何影响系统解决问题:例如,数据覆盖设置能力限制;强化学习提高了训练稳定性,但没有扩大这些限制;推理时间缩放提高了性能,但不能挽救长度缩放失败。
摘要:Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.
【2】Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
标题:诊断LLM裁判可靠性:保形预测集和传递性违反
链接:https://arxiv.org/abs/2604.15302
作者:Manan Gupta,Dhruv Kumar
备注:Under Review
摘要:LLM-as-judge框架越来越多地用于自动NLG评估,但其每个实例的可靠性仍然知之甚少。我们提出了一个应用于SummEval的双管齐下的诊断工具包:$\textbf{(1)}$一个传递性分析,揭示了被低聚合违规率掩盖的普遍的每输入不一致性($\barρ = 0.8$-$4.1\%$),其中$33$-$67\%$的文档展示至少一个有向3-循环;和$\textbf{(2)}$在1-5个Likert分数上分裂共形预测集,提供理论上保证的$\geq(1{-}α)$覆盖率,集宽度作为每个实例的可靠性指标($r_s = {+}0.576$,$N{=}1{,}918$,$p < 10^{-100}$,在所有法官中合并)。重要的是,预测集宽度显示出一致的交叉判断一致性($\bar{r} = 0.32$-0.38 $),表明它捕获了文档级别的难度,而不是判断特定的噪音。通过四个判断和四个标准,两种诊断方法都收敛了:标准比判断更重要,相关性判断最可靠(avg。设置大小$\大约3.0$)和适度的相干性(平均set size $\approximately 3.9$),而流畅性和一致性仍然不可靠(avg.设置大小$\约4.9$)。我们释放所有代码、提示和缓存结果。
摘要:LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.
【3】LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking
标题:LLM游戏验证者:WLVR可能导致奖励黑客
链接:https://arxiv.org/abs/2604.15149
作者:Lukas Helff,Quentin Delfosse,David Steinmann,Ruben Härle,Hikaru Shindo,Patrick Schramowski,Wolfgang Stammer,Kristian Kersting,Felix Friedrich
摘要:随着具有可验证奖励的强化学习(RLVR)已成为LLM中扩展推理能力的主导范式,出现了一种新的失败模式:LLM游戏验证器。我们在归纳推理任务中研究这种现象,其中模型必须归纳和输出逻辑规则。我们发现RLVR训练的模型系统地放弃了规则归纳。而不是学习可概括的模式(例如,“火车载着红色的汽车向东走”),它们枚举实例级标签,产生通过验证器的输出,而不捕获任务所需的关系模式。我们表明,这种行为是不是一个失败的理解,但一种形式的奖励黑客:不完美的验证,只检查扩展的正确性承认误报。为了检测这种捷径,我们引入了同构扰动测试(IPT),它在扩展和同构验证下评估单个模型输出,后者在逻辑同构任务下强制执行不变性。虽然真正的规则归纳保持不变,捷径策略失败。我们发现,捷径行为是特定于RLVR训练的推理模型(例如,GPT-5,Olmo 3)和在非RLVR模型中不存在(例如,GPT-40,GPT-4.5,Ministral)。此外,快捷方式的流行率随着任务复杂性和推理时间计算而增加。在受控训练实验中,外延验证直接诱导捷径策略,而同构验证消除它们。这些结果表明,RLVR可以激励奖励黑客不仅通过公开的操纵,但也利用什么验证者未能强制执行。
摘要:As reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for scaling reasoning capabilities in LLMs, a new failure mode emerges: LLMs gaming verifiers. We study this phenomenon on inductive reasoning tasks, where models must induce and output logical rules. We find that RLVR-trained models systematically abandon rule induction. Instead of learning generalizable patterns (e.g., ``trains carrying red cars go east''), they enumerate instance-level labels, producing outputs that pass verifiers without capturing the relational patterns required by the task. We show that this behavior is not a failure of understanding but a form of reward hacking: imperfect verifiers that check only extensional correctness admit false positives. To detect such shortcuts, we introduce Isomorphic Perturbation Testing (IPT), which evaluates a single model output under both extensional and isomorphic verification, where the latter enforces invariance under logically isomorphic tasks. While genuine rule induction remains invariant, shortcut strategies fail. We find that shortcut behavior is specific to RLVR-trained reasoning models (e.g., GPT-5, Olmo3) and absent in non-RLVR models (e.g., GPT-4o, GPT-4.5, Ministral). Moreover, shortcut prevalence increases with task complexity and inference-time compute. In controlled training experiments, extensional verification directly induces shortcut strategies, while isomorphic verification eliminates them. These results show that RLVR can incentivize reward hacking not only through overt manipulation but also by exploiting what the verifier fails to enforce.
【4】IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation
标题:IUQ:长格式大型语言模型生成的询问不确定性量化
链接
:https://arxiv.org/abs/2604.15109
作者:Haozhi Fan,Jinhao Duan,Kaidi Xu
摘要:尽管大型语言模型(LLM)的快速发展,LLM生成中的不确定性量化是一个持续的挑战。尽管最近的方法通过限制LLM产生简短或受约束的答案集来实现强大的性能,但许多现实世界的应用程序需要长格式和自由格式的文本生成。在这种情况下,一个关键的困难是,LLM往往产生的反应,语义连贯,但事实上是不准确的,而潜在的语义是多方面的,语言结构是复杂的。为了应对这一挑战,本文引入了疑问不确定性量化(IUQ),这是一种新的框架,利用样本间的一致性和样本内的忠实性来量化长形式LLM输出中的不确定性。通过利用询问,然后响应范例,我们的方法提供了可靠的措施,索赔水平的不确定性和模型的忠实性。在不同模型家族和模型大小上的实验结果表明,IUQ在两个广泛使用的长格式生成数据集上具有优异的性能。该代码可在https://github.com/louisfanhz/IUQ上获得。
摘要:Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model's faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at https://github.com/louisfanhz/IUQ.
【5】Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
标题:Atropos:在提前终止和模型热交换的自一致性下改善基于LLM的代理的成本效益权衡
链接:https://arxiv.org/abs/2604.15075
作者:Naryeong Kim,Shin Yoo
备注:Will appear at ISSTA 2026
摘要:开放权重小语言模型(SLM)可以以较低的财务成本提供更快的本地推理,但可能无法达到与商业大语言模型(LLM)相同的性能水平。因此,LLM的许多最新应用,如软件工程代理,往往只在较大的模型上进行评估,从而忽略了改善此类应用的成本效益权衡的问题。本文提出了Atropos,预测提前终止分析和热插拔技术,旨在提高成本效益权衡基于LLM的代理,使用自我一致性。ATROPOS的核心组件是基于LLM推理结构特性的预测模型:在将多个代理推理路径合并到图表示中之后,ATROPOS使用图卷积网络(GCN)来预测正在进行的推理最终是否成功。如果预测在源LLM上运行的代理任务实例失败,ATROPOS随后执行热插拔,即,将正在进行的推理上下文迁移到更有能力的目标LLM上:这是可行的,因为LLM上下文是无状态的。ATROPOS使用三个最近的基于LLM的代理的实证评估表明,ATROPOS可以预测提前终止最终失败的推理的准确性为0.85的中点的推理。热交换LLM的这种推断可以转换高达27.57%的成功。因此,ATROPOS实现了封闭式LLM性能的74.35%,成本仅为23.9%。
摘要:Open-weight Small Language Models(SLMs) can provide faster local inference at lower financial cost, but may not achieve the same performance level as commercial Large Language Models (LLMs) that are orders of magnitudes larger. Consequently, many of the latest applications of LLMs, such as software engineering agents, tend to be evaluated on larger models only, leaving the issue of improving the cost-benefit trade-off of such applications neglected. This paper proposes Atropos, a predictive early-termination analysis and hotswap technique that aims to improve the cost-benefit trade-off for LLM-based agents that use self-consistency. The core component of ATROPOS is a predictive model based on structural properties of LLM inferences: after merging multiple agentic inference paths into a graph representation, ATROPOS uses Graph Convolutional Network (GCN) to predict whether an ongoing inference will eventually succeed or not. If an agentic task instance running on the source LLM is predicted to fail, ATROPOS subsequently performs hotswapping, i.e., migrating the on-going inference context onto the more capable target LLM: this is feasible because LLM contexts are stateless. An empirical evaluation of ATROPOS using three recent LLM-based agents shows that ATROPOS can predict early termination of eventually failing inferences with the accuracy of 0.85 at the midpoint of the inference. Hotswapping LLMs for such inferences can convert up to 27.57% of them to be successful. Consequently, ATROPOS achieves 74.35% of the performance of closed LLMs with as low as only 23.9% of the cost.
【6】Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization
标题:罗马之路攻击:通过对抗性后缀优化将LLM路由器引导到昂贵的模型
链接:https://arxiv.org/abs/2604.15022
作者:Haochun Tang,Yuliang Yan,Jiahua Lu,Huaxiao Liu,Enyan Dai
摘要:成本感知路由动态地将用户查询分派到不同能力的模型,以平衡性能和推理成本。然而,路由策略引入了一个新的安全问题,对手可能会操纵路由器始终选择昂贵的高性能型号。现有的路由攻击依赖于白盒访问或启发式提示,使得它们在现实世界的黑盒场景中无效。在这项工作中,我们提出了R$^2$A,旨在通过对抗后缀优化将黑盒LLM路由器误导到昂贵的模型。具体地,R$^2$A部署混合集合代理路由器来模仿黑盒路由器。后缀优化算法进一步适用于基于集合的代理。在多个开源和商业路由系统上进行的大量实验表明,{R$^2$A}在不同分布的查询上显著提高了昂贵模型的路由速率。代码和示例:https://github.com/thcxiker/R2A-Attack。
摘要:Cost-aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high-capability models. Existing routing attacks depend on either white-box access or heuristic prompts, rendering them ineffective in real-world black-box scenarios. In this work, we propose R$^2$A, which aims to mislead black-box LLM routers to expensive models via adversarial suffix optimization. Specifically, R$^2$A deploys a hybrid ensemble surrogate router to mimic the black-box router. A suffix optimization algorithm is further adapted for the ensemble-based surrogate. Extensive experiments on multiple open-source and commercial routing systems demonstrate that {R$^2$A} significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: https://github.com/thcxiker/R2A-Attack.
【7】Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
标题:使用专家混合流匹配实现更快的语言模型推理
链接:https://arxiv.org/abs/2604.15009
作者:Aihua Li
摘要:流匹配保留了扩散模型的生成质量,同时实现了更快的推理,使其成为生成建模的一个引人注目的范例。然而,当应用于语言建模时,它在表示具有不规则几何形状的复杂潜在分布方面表现出根本的局限性,例如各向异性和多模态。为了解决这些挑战,我们提出了一个混合的专家流匹配(MoE-FM)框架,它捕捉复杂的全球运输几何形状的潜在空间分解成本地专门的向量场。基于MoE-FM,我们开发了一种非自回归(NAR)语言建模方法,命名为YAN,实例化与Transformer和Mamba架构。在多个下游任务中,YAN实现了与自回归(AR)和基于扩散的NAR语言模型相当的生成质量,同时只需三个采样步骤。这比AR基线加速了40倍,比扩散语言模型加速了10^3倍,证明了语言建模的显著效率优势。
摘要
:Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring as few as three sampling steps. This yields a $40\times$ speedup over AR baselines and up to a $10^3\times$ speedup over diffusion language models, demonstrating substantial efficiency advantages for language modeling.
【8】Calibration-Gated LLM Pseudo-Observations for Online Contextual Bandits
标题:在线上下文Bandits的校准门控LLM伪观测
链接:https://arxiv.org/abs/2604.14961
作者:Maksim Pershin,Ivan Golovanov,Pavel Baltabaev,Natalia Trankova
摘要:上下文强盗算法在冷启动期间遭受高遗憾,当学习者没有足够的数据来区分好的武器和坏的武器时。我们建议用LLM伪观测来增强不相交的LinUCB:在每一轮之后,一个大型语言模型预测未使用的武器的反事实奖励,这些预测作为加权伪观测注入学习者。注射重量由校准门控衰减时间表控制,该时间表通过指数移动平均来跟踪LLM对所玩武器的预测准确性;高校准误差抑制LLM的影响,而准确的预测在关键的早期回合期间获得更高的权重。我们评估两个上下文的强盗环境- UCI蘑菇(2臂,不对称奖励)和MIND-小(5臂新闻推荐)-并发现,当配备了特定的任务提示,LLM伪观测减少累积遗憾19% MIND相对于纯LinUCB。然而,通用的反事实提示框架增加了两种环境中的遗憾,这表明提示设计是主导因素,比衰减时间表或校准门控参数的选择更重要。我们分析了在预测误差较小的域上校准选通的失效模式,并为伪观测权重的偏差-方差权衡提供了理论依据。
摘要:Contextual bandit algorithms suffer from high regret during cold-start, when the learner has insufficient data to distinguish good arms from bad. We propose augmenting Disjoint LinUCB with LLM pseudo-observations: after each round, a large language model predicts counterfactual rewards for the unplayed arms, and these predictions are injected into the learner as weighted pseudo-observations. The injection weight is controlled by a calibration-gated decay schedule that tracks the LLM's prediction accuracy on played arms via an exponential moving average; high calibration error suppresses the LLM's influence, while accurate predictions receive higher weight during the critical early rounds. We evaluate on two contextual bandit environments - UCI Mushroom (2-arm, asymmetric rewards) and MIND-small (5-arm news recommendation) - and find that when equipped with a task-specific prompt, LLM pseudo-observations reduce cumulative regret by 19% on MIND relative to pure LinUCB. However, generic counterfactual prompt framing increases regret on both environments, demonstrating that prompt design is the dominant factor, more important than the choice of decay schedule or calibration gating parameters. We analyze the failure modes of calibration gating on domains with small prediction errors and provide a theoretical motivation for the bias-variance trade-off governing pseudo-observation weight.
【9】Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?
标题:LLM可以对医学诊断和临床推理以及专家小组进行评分吗?
链接:https://arxiv.org/abs/2604.14892
作者:Amy Rouillard,Sitwala Mundiab,Linda Camarab,Michael Cameron Gramaniec,Ziyaad Dangorc,Ismail Kallad,Shabir A. Madhic,Kajal Morarc,Marlvin T. Ncubec,Haroon Saloojeee,Bruce A. Bassett
摘要:使用专家临床医生小组评估医疗AI系统既昂贵又缓慢,这促使使用大型语言模型(LLM)作为替代裁决者。在这里,我们评估了一个由三个前沿AI模型组成的LLM陪审团,该陪审团对300个真实世界中等收入国家(MIC)医院病例的3333个诊断进行了评分。根据专家临床医生小组和独立的人类重新评分小组评估,对模型性能进行基准测试。LLM和临床医生生成的诊断都在四个方面进行评分:诊断,鉴别诊断,临床推理和阴性治疗风险。对于其中每一项,我们评估了评分差异、评分者间一致性、评分稳定性、严重安全性错误和事后校准的影响。我们发现:(i)未校准的LLM陪审团分数系统地低于临床医生小组分数;(ii)LLM陪审团保留了顺序一致性,并表现出比人类专家重新评分小组更好的与主要专家小组的一致性;(iii)与人类专家重新评分小组相比,在Ij模型中严重错误的概率较低;(iv)法学硕士评审团与主要专家小组的排名非常一致。我们发现,LLM陪审团与AI模型诊断相结合,可用于识别错误风险高的病房诊断,从而实现有针对性的专家审查和提高小组效率;(v)LLM陪审团模型没有自我偏好偏见。他们对自己的基础模型或来自同一供应商的模型生成的诊断结果的评分并没有比其他模型生成的诊断结果更有利(或更不利)。最后,我们证明了使用保序回归的LLM陪审团校准提高了与人类专家小组评估的一致性。总之,这些结果提供了令人信服的证据,证明校准的多模型LLM陪审团可以作为医学AI基准测试中专家临床医生评估的值得信赖和可靠的代理。
摘要:Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations. Both LLM and clinician-generated diagnoses are scored across four dimensions: diagnosis, differential diagnosis, clinical reasoning and negative treatment risk. For each of these, we assess scoring difference, inter-rater agreement, scoring stability, severe safety errors and the effect of post-hoc calibration. We find that: (i) the uncalibrated LLM jury scores are systematically lower than clinician panels scores; (ii) the LLM Jury preserves ordinal agreement and exhibits better concordance with the primary expert panels than the human expert re-score panels do; (iii) the probability of severe errors is lower in \lj models compared to the human expert re-score panels; (iv) the LLM Jury shows excellent agreement with primary expert panels' rankings. We find that the LLM jury combined with AI model diagnoses can be used to identify ward diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (v) LLM jury models show no self-preference bias. They did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Finally, we demonstrate that LLM jury calibration using isotonic regression improves alignment with human expert panel evaluations. Together, these results provide compelling evidence that a calibrated, multi-model LLM jury can serve as a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking.
【10】Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
标题:视觉语言模型中推理动力学和监控情态依赖的局限性
链接:https://arxiv.org/abs/2604.14888
作者:Danae Sánchez Villegas,Samuel Lewis-Lim,Nikolaos Aletras,Desmond Elliott
摘要:视觉语言模型(VLM)的最新进展提供了推理能力,但这些如何展开和整合视觉和文本信息仍不清楚。我们分析了18个VLM的推理动态,包括两个不同的模型家族的推理调整和推理训练模型。我们跟踪对思想链(CoT)的信心,衡量推理的纠正效果,并评估中间推理步骤的贡献。我们发现,模型很容易回答惯性,在早期的承诺,预测加强,而不是在推理步骤中修改。虽然经过推理训练的模型显示出更强的纠正行为,但它们的增益取决于模态条件,从文本主导到仅视觉设置。使用具有误导性的文本线索的控制干预,我们表明,即使在视觉证据充分的情况下,模型也会受到这些线索的影响,并评估这种影响是否可以从CoT中恢复。虽然这种影响可能出现在CoT中,但其可检测性因模型而异,并取决于所监测的内容。经过推理训练的模型更有可能明确地引用这些线索,但它们更长、更流畅的CoT仍然可以在视觉上接地,而实际上是遵循文本线索,模糊了模态依赖。相比之下,注意力调整模型不太明确地提到线索,但它们较短的痕迹揭示了与视觉输入的不一致。总之,这些研究结果表明,CoT只提供了一个不同的方式如何驱动VLM的决定,具有重要意义的多模式系统的透明度和安全性的部分视图。
摘要
:Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.
【11】Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
标题:RL是否扩大了LLM代理的能力边界?A Pass@(k,T)分析
链接:https://arxiv.org/abs/2604.14877
作者:Zhiyuan Zhai,Wenjing Yan,Xiaodan Shao,Xin Wang
摘要:强化学习是否真的扩展了LLM代理可以做的事情,或者只是让它们更可靠?对于静态推理,最近的工作回答了第二个问题:基础和RL pass@k曲线在大k时收敛。我们问这是否适用于agentic工具的使用,其中T轮的互动使组合策略,重新采样不能恢复。我们引入PASS@(k,T),一个二维的度量,共同改变采样预算k和交互深度T,分离能力扩展效率的提高。我们的主要发现是,与静态推理结果相反,工具使用RL真正扩大了能力边界:RL代理的通过曲线高于基本模型,并且差距在大k时扩大而不是收敛。扩展是特定于组成的,顺序的信息收集;在更简单的任务RL的行为作为先前的工作预测。在匹配的训练数据下,监督微调回归了相同组成任务的边界,将自我导向的探索作为因果因素。机制分析表明,RL重新加权的基础策略分布向子集的下游推理更经常产生一个正确的答案,与改进集中在如何代理集成检索到的信息。这些结果调和了对LLM的RL的乐观和悲观解读:在不同的任务类型上,两者都是正确的。
摘要:Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.
【12】Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
标题:通过约束策略优化为推理LLM实现自适应测试时计算分配
链接:https://arxiv.org/abs/2604.14853
作者:Zhiyuan Zhai,Bingcong Li,Bingnan Xiao,Ming Li,Xin Wang
摘要:测试时计算缩放,通过重复采样,搜索或扩展推理在推理期间花费额外计算的实践,已成为提高大型语言模型性能的强大杠杆。然而,在有限的推理预算下部署这些技术需要一个当前系统在很大程度上忽略的决定:哪些输入值得更多的计算,哪些可以便宜地回答?我们将其形式化为一个约束优化问题(在平均计算预算的前提下最大化预期精度),并使用两阶段的Solve-then-Learn管道解决它。在求解阶段,拉格朗日松弛将全局约束分解为每个实例的子问题,每个子问题都允许一个封闭形式的预言动作,该动作最优地对精度和成本进行定价。我们证明了诱导成本在对偶变量中是单调的,从而通过二分搜索实现精确的预算目标。在学习阶段,一个轻量级的分类器被训练来从廉价的输入特征中预测oracle操作,从而为实时部署分摊分配规则。我们建立了学习策略的任务级遗憾是由其模仿错误乘以最坏情况下的每个实例差距,产生一个干净的减少从约束推理监督分类。在MATH和GSM 8 K上使用三个LLM(DeepSeek-V3,GPT-4 o-mini,Qwen2.5- 7 B)进行的实验表明,我们的方法始终优于均匀和启发式分配基线,在匹配预算约束下,MATH的相对精度提高了12.8%,同时密切跟踪拉格朗日预言上限,模拟精度超过91%。
摘要:Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage Solve-then-Learn pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In the learn stage, a lightweight classifier is trained to predict oracle actions from cheap input features, amortizing the allocation rule for real-time deployment. We establish that the task-level regret of the learned policy is bounded by its imitation error times the worst-case per-instance gap, yielding a clean reduction from constrained inference to supervised classification. Experiments on MATH and GSM8K with three LLMs (DeepSeek-V3, GPT-4o-mini, Qwen2.5-7B) show that our method consistently outperforms uniform and heuristic allocation baselines, achieving up to 12.8% relative accuracy improvement on MATH under matched budget constraints, while closely tracking the Lagrangian oracle upper bound with over 91% imitation accuracy.
【13】World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
标题:世界-价值-行动模型:视觉-语言-行动系统的隐性规划
链接:https://arxiv.org/abs/2604.14732
作者:Runze Li,Hongyin Zhang,Junxi Jin,Qixin Zeng,Zifeng Zhuang,Yiqi Tang,Shangke Lyu,Donglin Wang
摘要:视觉-语言-动作(VLA)模型已经成为一个很有前途的范例,用于构建将感知和语言转化为行动的具身代理。然而,大多数现有的方法依赖于直接的行动预测,缺乏对长期轨迹进行推理并评估其后果的能力,这限制了复杂决策任务的性能。在这项工作中,我们介绍了世界价值行动(WAV)模型,一个统一的框架,使隐式规划VLA系统。WAV模型不是执行显式的轨迹优化,而是根据视觉观察和语言指令学习未来轨迹的结构化潜在表示。学习世界模型预测未来状态,而轨迹值函数评估其长期效用。然后,将动作生成公式化为该潜在空间中的推理,其中模型逐渐将概率质量集中在高值和动态可行的轨迹上。我们提供了一个理论的角度来看,直接在行动空间规划遭受指数衰减的概率可行的轨迹的地平线增加。相比之下,潜在空间推理将搜索分布重新塑造为可行区域,从而实现有效的长期决策。大量的模拟和真实世界的实验表明,WAV模型始终优于最先进的方法,在任务成功率,泛化能力和鲁棒性方面取得了显着的改善,特别是在长期和组合场景中。
摘要
:Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios.
【14】CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge
标题:CuraTE:实时持续放弃学习,确保LLM知识的保存
链接:https://arxiv.org/abs/2604.14644
作者:Seyun Bae,Seokhan Lee,Eunho Yang
备注:Accepted to Findings of ACL 2026
摘要:由于无法提前从大型语言模型的预训练中过滤出所有潜在的问题数据,因此需要在训练后忘记特定知识的方法。现有技术忽视了持续和立即采取行动的必要性,导致它们随着更新的积累和敏感信息的长期暴露而降低效用。为了解决这些问题,我们提出了实时持续遗忘并确保LLM知识的保存(CURaTE)。我们的方法首先在一个数据集上训练一个句子嵌入模型,该模型旨在形成清晰的决策边界,以确定给定的输入提示是否对应于任何存储的忘记请求。然后,给定输入与忘记请求的相似性用于确定是回答还是返回拒绝响应。我们表明,即使使用这样一种简单的方法,CURaTE不仅比现有方法实现了更有效的遗忘,而且通过避免修改语言模型参数,它还在任何数量的更新中保持了近乎完美的知识保存,并且是唯一能够实时持续学习的方法。
摘要:The inability to filter out in advance all potentially problematic data from the pre-training of large language models has given rise to the need for methods for unlearning specific pieces of knowledge after training. Existing techniques overlook the need for continuous and immediate action, causing them to suffer from degraded utility as updates accumulate and protracted exposure of sensitive information. To address these issues, we propose Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge (CURaTE). Our method begins by training a sentence embedding model on a dataset designed to enable the formation of sharp decision boundaries for determining whether a given input prompt corresponds to any stored forget requests. The similarity of a given input to the forget requests is then used to determine whether to answer or return a refusal response. We show that even with such a simple approach, not only does CURaTE achieve more effective forgetting than existing methods, but by avoiding modification of the language model parameters, it also maintains near perfect knowledge preservation over any number of updates and is the only method capable of continual unlearning in real-time.
【15】Predicting Post-Traumatic Epilepsy from Clinical Records using Large Language Model Embeddings
标题:使用大型语言模型嵌入根据临床记录预测创伤后癫痫
链接:https://arxiv.org/abs/2604.14547
作者:Wenhui Cui,Nicholas Swingle,Anand A. Joshi,Dileep Nair,Richard M. Leahy
摘要:目的:创伤后癫痫(PTE)是创伤性脑损伤(TBI)后发生的一种使人衰弱的神经系统疾病。由于临床数据的异质性、有限的阳性病例以及对资源密集型神经影像学数据的依赖,PTE的早期预测仍然具有挑战性。我们调查是否常规收集的急性临床记录可以支持早期PTE预测使用语言模型为基础的方法。研究方法:使用TRACK-TBI队列的精选子集,我们开发了一个自动PTE预测框架,该框架将预训练的大型语言模型(LLM)作为固定特征提取器来编码临床记录。表格功能,LLM生成的嵌入,和混合功能表示进行了评估,使用梯度提升树分类器分层交叉验证。结果:与单独使用表格特征相比,LLM嵌入通过捕获上下文临床信息实现了性能改进。最佳性能是通过结合表格特征和LLM嵌入的模态感知特征融合策略实现的,AUC-ROC为0.892,AUPRC为0.798。急性创伤后癫痫发作、损伤严重程度、神经外科干预和ICU停留时间是预测性能的关键因素。重要性:这些研究结果表明,常规急性临床记录包含适合使用LLM嵌入结合梯度提升树分类器进行早期PTE风险预测的信息。这种方法是对基于图像的预测的一种很有前途的补充。
摘要:Objective: Post-traumatic epilepsy (PTE) is a debilitating neurological disorder that develops after traumatic brain injury (TBI). Early prediction of PTE remains challenging due to heterogeneous clinical data, limited positive cases, and reliance on resource-intensive neuroimaging data. We investigate whether routinely collected acute clinical records alone can support early PTE prediction using language model-based approaches. Methods: Using a curated subset of the TRACK-TBI cohort, we developed an automated PTE prediction framework that implements pretrained large language models (LLMs) as fixed feature extractors to encode clinical records. Tabular features, LLM-generated embeddings, and hybrid feature representations were evaluated using gradient-boosted tree classifiers under stratified cross-validation. Results: LLM embeddings achieved performance improvements by capturing contextual clinical information compared to using tabular features alone. The best performance was achieved by a modality-aware feature fusion strategy combining tabular features and LLM embeddings, achieving an AUC-ROC of 0.892 and AUPRC of 0.798. Acute post-traumatic seizures, injury severity, neurosurgical intervention, and ICU stay are key contributors to the predictive performance. Significance: These findings demonstrate that routine acute clinical records contain information suitable for early PTE risk prediction using LLM embeddings in conjunction with gradient-boosted tree classifiers. This approach represents a promising complement to imaging-based prediction.
【16】Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization
标题:通过贡献加权组相对策略优化增强基于LLM的搜索代理
链接:https://arxiv.org/abs/2604.14267
作者:Junzhe Wang,Zhiheng Xi,yajie yang,Hao Luo,Shihan Dou,Tao Gui,Qi Zhang
备注:Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference
摘要:搜索代理通过访问预训练期间不可用的最新和长尾信息,将大型语言模型(LLM)扩展到静态参数知识之外。虽然强化学习已被广泛用于训练此类代理,但现有方法面临着关键限制:过程监督经常受到不稳定的价值估计的影响,而结果监督由于稀疏的强制性奖励而难以进行信用分配。为了弥补这一差距,我们提出了贡献加权GRPO(CW-GRPO),一个框架,将过程监督组相对的政策优化。而不是直接优化过程奖励,CW-GRPO采用LLM法官评估检索效用和推理的正确性,在每一轮的搜索,产生每轮的贡献分数。这些分数用于沿着轨迹重新调整基于结果的优势,从而在不牺牲优化稳定性的情况下实现细粒度的信用分配。在多个知识密集型基准测试上的实验表明,CW-GRPO在Qwen 3 -8B和Qwen 3 -1.7B上的性能分别比标准GRPO高5.0%和6.3%,从而产生了更有效的搜索行为.进一步的分析表明,成功的轨迹表现出集中的贡献,跨轮,提供经验洞察搜索代理任务。
摘要:Search agents extend Large Language Models (LLMs) beyond static parametric knowledge by enabling access to up-to-date and long-tail information unavailable during pretraining. While reinforcement learning has been widely adopted for training such agents, existing approaches face key limitations: process supervision often suffers from unstable value estimation, whereas outcome supervision struggles with credit assignment due to sparse, trajectory-level rewards. To bridge this gap, we propose Contribution-Weighted GRPO (CW-GRPO), a framework that integrates process supervision into group relative policy optimization. Instead of directly optimizing process rewards, CW-GRPO employs an LLM judge to assess the retrieval utility and reasoning correctness at each search round, producing per-round contribution scores. These scores are used to rescale outcome-based advantages along the trajectory, enabling fine-grained credit assignment without sacrificing optimization stability. Experiments on multiple knowledge-intensive benchmarks show that CW-GRPO outperforms standard GRPO by 5.0\% on Qwen3-8B and 6.3\% on Qwen3-1.7B, leading to more effective search behaviors. Additional analysis reveals that successful trajectories exhibit concentrated contributions across rounds, providing empirical insight into search agent tasks.
【17】TOPCELL: Topology Optimization of Standard Cell via LLMs
标题:TOPCELL:通过LLM进行标准单元的布局优化
链接:https://arxiv.org/abs/2604.14237
作者
:Zhan Song,Yu-Tung Liu,Chen Chen,Guoheng Sun,Jiaqi Yin,Chia-tung Ho,Ang Li,Haoxing Ren,Cunxi Yu
备注:Accepted to the 63rd ACM/IEEE Design Automation Conference (DAC 2026). 7 pages, 4 figures
摘要:晶体管拓扑优化是标准单元设计中的关键步骤,直接决定扩散共享效率和下游布线能力。然而,确定最佳拓扑结构仍然是一个持久的瓶颈,因为传统的穷举搜索方法变得计算上棘手的电路复杂性增加先进的节点。本文介绍了TOPCELL,一种新颖的和可扩展的框架,重新制定了高维拓扑探索作为一个生成任务,使用大型语言模型(LLM)。我们采用组相对策略优化(GRPO)来微调模型,使其拓扑优化策略与逻辑(电路)和空间(布局)约束。针对先进的2nm技术节点的工业流程内的实验结果表明,TOPCELL显着优于基础模型发现路由,物理感知拓扑。当集成到最先进的(SOTA)自动化流程中用于7nm库生成任务时,TOPCELL表现出强大的zero-shot泛化能力,并与穷举求解器的布局质量相匹配,同时实现了85.91倍的加速比。
摘要:Transistor topology optimization is a critical step in standard cell design, directly dictating diffusion sharing efficiency and downstream routability. However, identifying optimal topologies remains a persistent bottleneck, as conventional exhaustive search methods become computationally intractable with increasing circuit complexity in advanced nodes. This paper introduces TOPCELL, a novel and scalable framework that reformulates high-dimensional topology exploration as a generative task using Large Language Models (LLMs). We employ Group Relative Policy Optimization (GRPO) to fine-tune the model, aligning its topology optimization strategy with logical (circuit) and spatial (layout) constraints. Experimental results within an industrial flow targeting an advanced 2nm technology node demonstrate that TOPCELL significantly outperforms foundation models in discovering routable, physically-aware topologies. When integrated into a state-of-the-art (SOTA) automation flow for a 7nm library generation task, TOPCELL exhibits robust zero-shot generalization and matches the layout quality of exhaustive solvers while achieving an 85.91x speedup.
【18】MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
标题:MixAtlas:多模式LLM中期训练的不确定性感知数据混合优化
链接:https://arxiv.org/abs/2604.14198
作者:Bingbing Wen,Sirajul Salekin,Feiyang Kang,Bill Howe,Lucy Lu Wang,Javier Movellan,Manjot Bilkhu
摘要:域重新加权可以提高样本效率和下游泛化能力,但多模态中间训练的数据混合优化在很大程度上尚未探索。当前的多模态训练方法沿着单一维度调整混合,通常是数据格式或任务类型。我们介绍了MixAtlas,一种产生基准目标数据配方的方法,可以检查,调整和转移到新的语料库。MixAtlas沿着两个轴分解训练语料库:图像概念(通过CLIP嵌入发现的10个视觉领域集群)和任务监督(5个目标类型,包括字幕,OCR,接地,检测和VQA)。使用小代理模型(Qwen 2 -0.5B)与高斯过程代理和GP-UCB采集配对,MixAtlas搜索结果混合空间,具有与基于回归的基线相同的代理预算,但发现性能更好的混合物。我们评估10个基准跨越视觉理解,文档推理和多模态推理。在Qwen 2 - 7 B上,优化后的混合物比最强基线平均性能提高了8.5%-17.6%;在Qwen2.5- 7 B上,增益为1.0%-3.3%。两种设置都能以少2倍的步长达到基线等效训练损失。在0.5B代理上发现的配方可以在Qwen模型家族中转移到7 B规模的培训。
摘要:Domain reweighting can improve sample efficiency and downstream generalization, but data-mixture optimization for multimodal midtraining remains largely unexplored. Current multimodal training recipes tune mixtures along a single dimension, typically data format or task type. We introduce MixAtlas, a method that produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora. MixAtlas decomposes the training corpus along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). Using small proxy models (Qwen2-0.5B) paired with a Gaussian-process surrogate and GP-UCB acquisition, MixAtlas searches the resulting mixture space with the same proxy budget as regression-based baselines but finds better-performing mixtures. We evaluate on 10 benchmarks spanning visual understanding, document reasoning, and multimodal reasoning. On Qwen2-7B, optimized mixtures improve average performance by 8.5%-17.6% over the strongest baseline; on Qwen2.5-7B, gains are 1.0%-3.3%. Both settings reach baseline-equivalent training loss in up to 2 times fewer steps. Recipes discovered on 0.5B proxies transfer to 7B-scale training across Qwen model families.
【19】Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
标题:使用后转换器适配器纠正语言模型中被抑制的日志概率
链接:https://arxiv.org/abs/2604.14174
作者:Bryan Sanchez
备注:12 pages, 3 figures, code at https://github.com/SolomonB14D3/qwen-adapter-correction
摘要:对齐调整的语言模型经常抑制政治敏感话题的事实对数概率,尽管保留了隐藏表示中的知识。我们发现,一个786 K参数(约为基础模型的0.02%)的后转换适配器,在冻结隐藏状态上训练,纠正了Qwen 3 - 4 B,8B和14 B中31个意识形态歧视事实的抑制。适配器记忆所有15个训练事实,并通过锚定训练将16个保留事实中的11- 39%概括为每个尺度5个随机分割,零知识回归。门控(SwiGLU)和非门控(线性瓶颈)衔接子都获得了相当的结果;两者都没有持续优于另一个(所有尺度下Fisher精确p > 0.09)。在指令模型上,适配器校正对数概率排序。当在生成期间应用于所有标记位置时,适配器产生不连贯的输出;然而,当仅应用于当前预测位置(仅最后位置)时,适配器产生连贯的、较少审查的文本。一个logit空间适配器操作后,令牌投影无法产生连贯的生成在任何应用程序模式,这表明隐藏状态的干预是正确的生成校正的水平。Apple MLX中以前未记录的静默梯度错误解释了这项工作早期迭代中的所有空结果:标准模式nn.value_and_grad(model,fn)(model.parameters())返回零梯度而没有错误;正确的模式nn.value_and_grad(model,fn)(model,data)解决了这个问题。我们提供了一个最小的复制和讨论其他适配器研究使用MLX的影响。
摘要:Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(model.parameters()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.
【20】Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning
标题:大型语言模型可以检测方法缺陷吗?基于深度学习的无人机救援手势识别
链接:https://arxiv.org/abs/2604.14161
作者:Domonkos Varga
摘要:可靠的评估在机器学习研究中至关重要,但方法论上的缺陷-特别是数据泄露-继续破坏报告结果的有效性。在这项工作中,我们研究大型语言模型(LLM)是否可以作为独立的分析代理,能够在已发表的研究中识别这些问题。作为一个案例研究,我们分析了一篇手势识别论文,该论文在一个以人为中心的小数据集上报告了近乎完美的准确性。我们首先表明,由于非独立的训练和测试分裂,评估协议与受试者级别的数据泄漏是一致的。然后,我们评估这个缺陷是否可以被六个最先进的LLM独立检测到,每个LLM都使用相同的提示在没有先前上下文的情况下分析原始论文。所有模型都一致地将评估识别为有缺陷的,并将报告的性能归因于非独立的数据划分,这些数据划分得到了重叠学习曲线、最小泛化差距和近乎完美的分类结果等指标的支持。这些研究结果表明,LLM可以检测常见的方法问题,仅基于已发表的文物。虽然不是最终的,但他们的一致意见突出了他们作为改善重现性和支持科学审计的补充工具的潜力。
摘要
:Reliable evaluation is essential in machine learning research, yet methodological flaws-particularly data leakage-continue to undermine the validity of reported results. In this work, we investigate whether large language models (LLMs) can act as independent analytical agents capable of identifying such issues in published studies. As a case study, we analyze a gesture-recognition paper reporting near-perfect accuracy on a small, human-centered dataset. We first show that the evaluation protocol is consistent with subject-level data leakage due to non-independent training and test splits. We then assess whether this flaw can be detected independently by six state-of-the-art LLMs, each analyzing the original paper without prior context using an identical prompt. All models consistently identify the evaluation as flawed and attribute the reported performance to non-independent data partitioning, supported by indicators such as overlapping learning curves, minimal generalization gap, and near-perfect classification results. These findings suggest that LLMs can detect common methodological issues based solely on published artifacts. While not definitive, their consistent agreement highlights their potential as complementary tools for improving reproducibility and supporting scientific auditing.
【21】BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs
标题:BitFlipScope:LLM中位翻转损坏的可扩展故障定位和恢复
链接:https://arxiv.org/abs/2512.22174
作者:Muhammad Zeeshan Karamat,Sadman Saif,Christiana Chamon Garcia
备注:Accepted at the IEEE International Symposium on Hardware Oriented Security and Trust (HOST) 2026
摘要:在实际和安全关键设置中部署的大型语言模型(LLM)越来越容易受到硬件退化,宇宙辐射或故意故障注入攻击(如Rowhammer)引起的位翻转故障的影响。这些错误会悄悄地破坏内部参数,并可能导致不可预测或危险的模型行为。定位这些损坏是至关重要的:如果不确定受影响的区域,就不可能诊断退化的来源,应用有针对性的纠正措施,或者在不进行昂贵的微调或全面重新培训的情况下恢复模型功能。这项工作介绍了BitFlipScope,一个可扩展的,基于软件的框架,用于识别故障影响的区域内的Transformer架构下的两个部署方案。当干净的参考模型可用时,BitFlipScope会对输出、隐藏状态和内部激活进行差异分析,以检测指示损坏的异常行为,以查明或定位故障。当不存在参考模型时,它使用剩余路径扰动和损失敏感性分析来直接从损坏的模型推断故障影响区域。在这两种设置中,该框架不仅能够进行有效的故障诊断,而且还支持无需微调的轻量级性能恢复,为恢复损坏的模型提供了一条实用的途径。总之,这些功能使BitFlipScope成为在硬件易受攻击和对抗性环境中实现可靠,故障恢复LLM部署的重要一步。
摘要:Large Language Models (LLMs) deployed in practical and safety-critical settings are increasingly susceptible to bit-flip faults caused by hardware degradation, cosmic radiation, or deliberate fault-injection attacks such as Rowhammer. These faults silently corrupt internal parameters and can lead to unpredictable or dangerous model behavior. Localizing these corruptions is essential: without identifying the affected region, it is impossible to diagnose the source of degradation, apply targeted corrective measures, or restore model functionality without resorting to costly fine-tuning or full retraining. This work introduces BitFlipScope, a scalable, software-based framework for identifying fault-affected regions within transformer architectures under two deployment scenarios. When a clean reference model is available, BitFlipScope performs differential analysis of outputs, hidden states, and internal activations for detecting anomalous behavior indicative of corruption to pinpoint or localize faults. When no reference model exists, it uses residual-path perturbation and loss-sensitivity profiling to infer the fault-impacted region directly from the corrupted model. In both settings, the framework not only enables effective fault diagnosis but also supports lightweight performance recovery without fine-tuning, offering a practical path to restoring corrupted models. Together, these capabilities make BitFlipScope an important step toward trustworthy, fault-resilient LLM deployment in hardware-prone and adversarial environments.
【22】PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
标题:PolyBench:根据实时预测市场数据对LLM预测和交易能力进行基准测试
链接:https://arxiv.org/abs/2604.14199
作者:Pu Cheng,Juncheng Liu,Yunshen Long
备注:16 pages, 4 figures, 6 tables
摘要:根据实时市场信号预测现实世界的事件,需要系统在严格的时间纪律下将定性新闻与定量订单簿动态融合在一起--这是现有基准无法捕捉的挑战。我们介绍了\textbf{PolyBench},这是一个来自Polymarket的多模态基准,记录了跨越4,997个事件的38,666个二元预测市场的时间点横截面,同步耦合每个快照与中央限价订单簿(CLOB)状态和实时新闻流。使用PolyBench,我们评估了七个最先进的大型语言模型-跨越开源和闭源系列-在2026年2月6日至12日收集的相同的时间戳锁定市场状态下生成36,165个预测。我们的多维框架评估方向准确性,我们提出的信心加权回报(CWR),年化收益率(APY),夏普比率通过现实的订单执行模拟。结果显示出明显的性能差异:七个模型中只有两个实现了正的财务回报-MiMo-V2-Flash在17.6%的CWR和Gemini-3-Flash在6.2%的CWR -而其余五个尽管都有很高的信心,但还是出现了亏损。这些研究结果突出了表面语言流畅性与真实市场不确定性下的真正概率推理之间的差距,并建立了PolyBench作为未来LLM研究的防污染,以财务为基础的评估标准。我们的数据集和代码可在\underline{\href{https://github.com/PolyBench/PolyBench}{https://github.com/PolyBench/PolyBench}}上找到。
摘要:Predicting real-world events from live market signals demands systems that fuse qualitative news with quantitative order-book dynamics under strict temporal discipline -- a challenge existing benchmarks fail to capture. We present \textbf{PolyBench}, a multimodal benchmark derived from Polymarket that records point-in-time cross-sections of 38,666 binary prediction markets spanning 4,997 events, synchronously coupling each snapshot with a Central Limit Order Book (CLOB) state and a real-time news stream. Using PolyBench, we evaluate seven state-of-the-art Large Language Models -- spanning open- and closed-source families -- generating 36,165 predictions under identical, timestamp-locked market states collected between February 6 and 12, 2026. Our multidimensional framework assesses directional accuracy, our proposed Confidence-Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio via realistic order-book execution simulation. The results reveal a pronounced performance divergence: only two of seven models achieve positive financial returns -- MiMo-V2-Flash at \textbf{17.6%} CWR and Gemini-3-Flash at 6.2% CWR -- while the remaining five incur losses despite uniformly high stated confidence. These findings highlight the gap between surface-level language fluency and genuine probabilistic reasoning under live market uncertainty, and establish PolyBench as a contamination-proof, financially-grounded evaluation standard for future LLM research. Our dataset and code available at \underline{\href{https://github.com/PolyBench/PolyBench}{https://github.com/PolyBench/PolyBench}}.
Graph相关(图学习|图神经网络|图优化等)(5篇)
【1】How Embeddings Shape Graph Neural Networks: Classical vs Quantum-Oriented Node Representations
标题:嵌入如何塑造图神经网络:经典与面向量子的节点表示
链接:https://arxiv.org/abs/2604.15273
作者:Nouhaila Innan,Antonello Rosato,Alberto Marchisio,Muhammad Shafique
备注:6 pages. Accepted at IJCNN 2026
摘要:节点嵌入充当图神经网络的信息接口,但它们的经验影响经常在不匹配的骨干,分裂和训练预算下报告。本文提供了一个用于图分类的嵌入选择的受控基准,将经典基线与统一管道下的面向量子的节点表示进行比较。我们评估了两个经典的基线,以及面向量子的替代方案,包括电路定义的变分嵌入和量子启发的嵌入通过图形运算符和线性代数结构计算。所有变体都使用相同的主干、分层拆分、相同的优化和早期停止以及一致的指标进行训练和测试。在五个不同的TU数据集和通过目标分箱转换为分类的QM9上进行的实验显示了明确的数据集依赖性:面向量子的嵌入在结构驱动的基准上产生了最一致的收益,而具有有限节点属性的社交图仍然可以很好地服务于经典基线。该研究强调了在固定训练预算下归纳偏差,可训练性和稳定性之间的实际权衡,并为在图学习中选择面向量子的嵌入提供了可重复的参考点。
摘要
:Node embeddings act as the information interface for graph neural networks, yet their empirical impact is often reported under mismatched backbones, splits, and training budgets. This paper provides a controlled benchmark of embedding choices for graph classification, comparing classical baselines with quantum-oriented node representations under a unified pipeline. We evaluate two classical baselines alongside quantum-oriented alternatives, including a circuit-defined variational embedding and quantum-inspired embeddings computed via graph operators and linear-algebraic constructions. All variants are trained and tested with the same backbone, stratified splits, identical optimization and early stopping, and consistent metrics. Experiments on five different TU datasets and on QM9 converted to classification via target binning show clear dataset dependence: quantum-oriented embeddings yield the most consistent gains on structure-driven benchmarks, while social graphs with limited node attributes remain well served by classical baselines. The study highlights practical trade-offs between inductive bias, trainability, and stability under a fixed training budget, and offers a reproducible reference point for selecting quantum-oriented embeddings in graph learning.
【2】Beyond the Laplacian: Doubly Stochastic Matrices for Graph Neural Networks
标题:超越拉普拉斯:图神经网络的双随机矩阵
链接:https://arxiv.org/abs/2604.15069
作者:Zhaobo Hu,Vincent Gauthier,Mehdi Naima
摘要:图神经网络(GNN)通常依赖于标准拉普拉斯或邻接矩阵来进行结构化消息传递。在这项工作中,我们取代传统的拉普拉斯算子与双随机图矩阵(DSM),来自修改后的拉普拉斯算子的逆,自然编码连续多跳接近和严格的本地中心。为了克服精确矩阵求逆的复杂度为O(n^3)$,我们首先利用截断的Neumann级数来可伸缩地近似DSM,这是我们提出的DsmNet的基础。此外,由于代数截断固有地导致概率质量泄漏,我们引入DsmNet补偿。这种变体具有数学上严格的剩余质量补偿机制,该机制通过分析将截断的尾部质量重新注入自循环,严格恢复行随机性和结构优势。大量的理论和实证分析表明,我们的解耦架构有效地运行在$O(K| E|)$时间,并通过限制Dirichlet能量衰减来有效地减轻过度平滑,从而在同质基准上提供稳健的经验验证。最后,我们建立了理论边界的DSM上heterophilic拓扑结构,并证明了它的多功能性作为一个连续的结构编码图Transformers。
摘要:Graph Neural Networks (GNNs) conventionally rely on standard Laplacian or adjacency matrices for structural message passing. In this work, we substitute the traditional Laplacian with a Doubly Stochastic graph Matrix (DSM), derived from the inverse of the modified Laplacian, to naturally encode continuous multi-hop proximity and strict local centrality. To overcome the intractable $O(n^3)$ complexity of exact matrix inversion, we first utilize a truncated Neumann series to scalably approximate the DSM, which serves as the foundation for our proposed DsmNet. Furthermore, because algebraic truncation inherently causes probability mass leakage, we introduce DsmNet-compensate. This variant features a mathematically rigorous Residual Mass Compensation mechanism that analytically re-injects the truncated tail mass into self-loops, strictly restoring row-stochasticity and structural dominance. Extensive theoretical and empirical analyses demonstrate that our decoupled architectures operate efficiently in $O(K|E|)$ time and effectively mitigate over-smoothing by bounding Dirichlet energy decay, providing robust empirical validation on homophilic benchmarks. Finally, we establish the theoretical boundaries of the DSM on heterophilic topologies and demonstrate its versatility as a continuous structural encoding for Graph Transformers.
【3】Learning Ad Hoc Network Dynamics via Graph-Structured World Models
标题:通过图结构化世界模型学习自组织网络动态
链接:https://arxiv.org/abs/2604.14811
作者:Can Karacelebi,Yusuf Talha Sahin,Elif Surer,Ertan Onur
备注:6 pages, 4 figures. Submitted to the IEEE Global Communications Conference (GLOBECOM) 2026
摘要:Ad hoc无线网络表现出复杂的,固有的和耦合的动力学:节点移动,能量消耗和拓扑结构的变化,是很难解析建模。无模型深度强化学习需要持续的在线交互,而现有的基于模型的方法使用平面状态表示,每个节点结构都会丢失。因此,我们提出了G-RSSM,一个图结构的递归状态空间模型,保持每个节点的潜在状态,跨节点多头注意力,从离线轨迹联合学习动态。我们将所提出的方法应用于下游任务聚类,其中簇头选择策略完全通过学习世界模型中的想象展开来训练。在27个评估场景中,包括MANET,VANET,FANET,WSN和战术网络,N=30到1000个节点,所学习的策略仅针对N=50进行训练,保持高连接性。在这里,我们提出了第一个多物理图形结构化世界模型应用于组合每节点决策的大小不可知的无线ad hoc网络。
摘要:Ad hoc wireless networks exhibit complex, innate and coupled dynamics: node mobility, energy depletion and topology change that are difficult to model analytically. Model-free deep reinforcement learning requires sustained online interaction whereas existing model based approaches use flat state representations that lose per node structure. Therefore we propose G-RSSM, a graph structured recurrent state space model that maintains per node latent states with cross node multi head attention to learn the dynamics jointly from offline trajectories. We apply the proposed method to the downstream task clustering where a cluster head selection policy trains entirely through imagined rollouts in the learned world model. Across 27 evaluation scenarios spanning MANET, VANET, FANET, WSN and tactical networks with N=30 to 1000 nodes, the learned policy maintains high connectivity with only trained for N=50. Herein, we propose the first multi physics graph structured world model applied to combinatorial per node decision making in size agnostic wireless ad hoc networks.
【4】Graph-Based Fraud Detection with Dual-Path Graph Filtering
标题:利用双路径图过滤的基于图的欺诈检测
链接:https://arxiv.org/abs/2604.14235
作者:Wei He,Wensheng Gan,Philip S. Yu
备注:Neural Networks
摘要:图数据上的欺诈检测可以被视为一项要求很高的任务,需要区分不同类型的节点。由于图神经网络(GNN)自然适合通过其消息传递操作处理以图形式编码的信息,因此基于GNN模型的方法在欺诈检测领域越来越受到关注。然而,欺诈图固有地表现出关系伪装,高度异质性和类不平衡,导致大多数GNN在欺诈检测任务中表现不佳。为了解决这些问题,本文提出了一种基于图的欺诈检测模型与双路径图过滤(DPF-GFD)。DPF-GFD首先将基于beta小波的算子应用于原始图以捕获关键结构模式。然后从基于距离的节点表示构造相似性图,并应用改进的低通滤波器。通过监督表示学习融合原始图和相似图的嵌入,以获得节点特征,最后由集成树模型用于评估未标记节点的欺诈风险。与现有的单图平滑方法不同,DPF-GFD引入了一种针对欺诈检测的频率互补双路径滤波范式,明确解耦结构异常建模和特征相似性建模。这种设计使高度异质和不平衡的欺诈图中的节点表示更具鉴别力和稳定性。在四个真实金融欺诈检测数据集上的综合实验证明了该方法的有效性。
摘要:Fraud detection on graph data can be viewed as a demanding task that requires distinguishing between different types of nodes. Because graph neural networks (GNNs) are naturally suited for processing information encoded in graph form through their message-passing operations, methods based on GNN models have increasingly attracted attention in the fraud detection domain. However, fraud graphs inherently exhibit relation camouflage, high heterophily, and class imbalance, causing most GNNs to underperform in fraud detection tasks. To address these challenges, this paper proposes a Graph-Based Fraud Detection Model with Dual-Path Graph Filtering (DPF-GFD). DPF-GFD first applies a beta wavelet-based operator to the original graph to capture key structural patterns. It then constructs a similarity graph from distance-based node representations and applies an improved low-pass filter. The embeddings from the original and similarity graphs are fused through supervised representation learning to obtain node features, which are finally used by an ensemble tree model to assess the fraud risk of unlabeled nodes. Unlike existing single-graph smoothing approaches, DPF-GFD introduces a frequency-complementary dual-path filtering paradigm tailored for fraud detection, explicitly decoupling structural anomaly modeling and feature similarity modeling. This design enables more discriminative and stable node representations in highly heterophilous and imbalanced fraud graphs. Comprehensive experiments on four real-world financial fraud detection datasets demonstrate the effectiveness of our proposed method.
【5】Explainable Graph Neural Networks for Interbank Contagion Surveillance: A Regulatory-Aligned Framework for the U.S. Banking Sector
标题:银行间传染性监控的可解释图神经网络:美国银行业的监管机构一致框架
链接:https://arxiv.org/abs/2604.14232
作者:Mohammad Nasir Uddin
备注:28 pages, submitted to Research in International Business and Finance (RIBAF)
摘要:时空图注意力网络(ST-GAT)框架的创建是为了作为一个可解释的基于GNN的解决方案,用于检测银行困境预警信号和对美国银行间系统进行宏观审慎监督。ST-GAT框架对8,103家FDIC保险机构的58个季度快照(2010 Q1 - 2024 Q2)进行了建模。利用最大熵估计从公开的FDIC电话报告中重建双边风险敞口,以生成动态有向加权图。该框架在所有GNN架构中实现了最高的AUPRC(0.939 +/- 0.010),仅次于XGBoost(0.944)。消融分析证实了BiLSTM时间分量贡献了+0.020 AUPRC;时间注意力权重表现出与长期结构脆弱性权重一致的单调递减模式。排列重要性将ROA(0.309)和不良贷款率(0.252)确定为主要预测因素,与2023年区域银行业危机的事后分析一致。所有数据都是公开的FDIC电话报告和FRED系列;所有代码和结果都已发布。
摘要:The Spatial-Temporal Graph Attention Network (ST-GAT) framework was created to serve as an explainable GNN-based solution for detecting bank distress early warning signs and for conducting macro-prudential surveillance of the interbank system in the United States. The ST-GAT framework models 8,103 FDIC insured institutions across 58 quarterly snapshots (2010Q1-2024Q2). Bilateral exposures were reconstructed from publicly available FDIC Call Reports using maximum entropy estimation to produce a dynamic directed weighted graph. The framework achieves the highest AUPRC among all GNN architectures (0.939 +/- 0.010), trailing only XGBoost (0.944). Ablation analysis confirms the BiLSTM temporal component contributes +0.020 AUPRC; temporal attention weights exhibit a monotonically decreasing pattern consistent with long-run structural vulnerability weighting. Permutation importance identifies ROA (0.309) and NPL Ratio (0.252) as dominant predictors, consistent with post-mortem analyses of the 2023 regional banking crisis. All data are publicly available FDIC Call Reports and FRED series; all code and results are released.
Transformer(5篇)
【1】Stability and Generalization in Looped Transformers
标题:环路Transformer的稳定性和推广性
链接:https://arxiv.org/abs/2604.15259
作者:Asher Labovich
备注:11 main pages, 27 total
摘要:循环Transformers通过在更困难的问题上花费更多的迭代来保证测试时的计算扩展,但是仍然不清楚哪些架构选择让他们在测试时推断更困难的问题,而不是记住特定于训练的解决方案。我们引入了一个基于固定点的框架,用于分析循环体系结构沿三个轴的稳定性-可达性,输入依赖性和几何形状-并用它来表征时,固定点迭代产生有意义的预测。从理论上讲,我们证明了没有召回的环状网络具有可数不动点,并且在任何光谱制度下都不能实现强输入依赖性,而召回与外部归一化相结合可靠地产生了一个制度,其中不动点同时可达,输入局部光滑,并支持稳定的反向传播。从经验上讲,我们在国际象棋、数独和前缀和上训练单层循环Transformers,并发现下游性能跟踪框架在任务和架构配置上的预测。我们还介绍了内部召回,一种新的召回位置的变体,并表明,它成为竞争力-和数独,大大优于-标准的召回位置,一旦外部规范化。
摘要:Looped transformers promise test-time compute scaling by spending more iterations on harder problems, but it remains unclear which architectural choices let them extrapolate to harder problems at test time rather than memorize training-specific solutions. We introduce a fixed-point based framework for analyzing looped architectures along three axes of stability -- reachability, input-dependence, and geometry -- and use it to characterize when fixed-point iteration yields meaningful predictions. Theoretically, we prove that looped networks without recall have countable fixed points and cannot achieve strong input-dependence at any spectral regime, while recall combined with outer normalization reliably produces a regime in which fixed points are simultaneously reachable, locally smooth in the input, and supported by stable backpropagation. Empirically, we train single-layer looped transformers on chess, sudoku, and prefix-sums and find that downstream performance tracks the framework's predictions across tasks and architectural configurations. We additionally introduce internal recall, a novel recall placement variant, and show that it becomes competitive with -- and on sudoku, substantially better than -- standard recall placement once outer normalization is applied.
【2】What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers
标题:Prolepsis的最低架构是什么?小型Transformer中各个任务的早期不可挽回的承诺
链接:https://arxiv.org/abs/2604.15010
作者:Éric Jacopin
备注:24 pages, 3 figures. Under review at COLM 2026. Independent replication of the rhyme-planning finding from Lindsey et al. (2025) on open-weights models; extended to factual recall
摘要:Transformers什么时候会做出决定,是什么阻止了他们纠正决定?我们引入了\textbf{prolepsis}:一个Transformer提前提交,任务特定的注意力头维持承诺,没有层纠正它。复制\citeauthor{lindsey 2025 biology}(\citeyear{lindsey 2025 biology})在开放模型(Gemma~2 2B,Llama~3.2 1B)上的规划站点发现,我们提出了五个问题。(Q1)~规划对六种剩余流方法是不可见的; CLT是必要的。(Q2)~规划位点尖峰以相同的几何形状复制。(Q3)~特定注意力头部将决策路由到输出,填补了归因图中标记为不可见的空白。(Q4)~搜索需要${\leq}16$层;提交需要更多。(Q5)事实回忆在不同的网络深度上显示了相同的主题,重复的计划头和事实前10名之间没有重叠。Prolepsis是架构性的:模板是共享的,路由基板不同。所有实验都在单个消费者GPU(16 GB VRAM)上运行。
摘要:When do transformers commit to a decision, and what prevents them from correcting it? We introduce \textbf{prolepsis}: a transformer commits early, task-specific attention heads sustain the commitment, and no layer corrects it. Replicating \citeauthor{lindsey2025biology}'s (\citeyear{lindsey2025biology}) planning-site finding on open models (Gemma~2 2B, Llama~3.2 1B), we ask five questions. (Q1)~Planning is invisible to six residual-stream methods; CLTs are necessary. (Q2)~The planning-site spike replicates with identical geometry. (Q3)~Specific attention heads route the decision to the output, filling a gap flagged as invisible to attribution graphs. (Q4)~Search requires ${\leq}16$ layers; commitment requires more. (Q5)~Factual recall shows the same motif at a different network depth, with zero overlap between recurring planning heads and the factual top-10. Prolepsis is architectural: the template is shared, the routing substrates differ. All experiments run on a single consumer GPU (16\,GB VRAM).
【3】Expressivity of Transformers: A Tropical Geometry Perspective
标题:Transformer的表现力:热带几何的视角
链接:https://arxiv.org/abs/2604.14727
作者:Ye Su,Yong Liu
摘要:为了量化Transformers的几何表现力,我们引入了一个热带几何框架来表征其精确的空间分区能力。通过将自注意力建模为一个向量值的热带有理映射,我们证明了它在零温度极限下精确地求值为一个幂Voronoi图。基于这种等价性,我们建立了多头自注意(MHSA)的组合原理:通过牛顿多面体的Minkowski和,多头聚合将多面体复杂度扩展到$\mathcal{O}(N^H)$,克服了单头的$\mathcal{O}(N)$瓶颈。将其扩展到深度架构,我们推导出Transformers中线性区域数量的第一个严格渐近界($Θ(N^{d_{\text{model}}L})$),证明了序列长度$N$,环境嵌入维度$d_{\text{model}}$和网络深度$L$内在驱动的组合爆炸。重要的是,我们保证这个理想化的多面体骨架是几何稳定的:有限温度软注意通过指数紧微分近似边界保留这些拓扑分区。
摘要:To quantify the geometric expressivity of transformers, we introduce a tropical geometry framework to characterize their exact spatial partitioning capabilities. By modeling self-attention as a vector-valued tropical rational map, we prove it evaluates exactly to a Power Voronoi Diagram in the zero-temperature limit. Building on this equivalence, we establish a combinatorial rationale for Multi-Head Self-Attention (MHSA): via the Minkowski sum of Newton polytopes, multi-head aggregation expands the polyhedral complexity to $\mathcal{O}(N^H)$, overcoming the $\mathcal{O}(N)$ bottleneck of single heads. Extending this to deep architectures, we derive the first tight asymptotic bounds on the number of linear regions in transformers ($Θ(N^{d_{\text{model}}L})$), demonstrating a combinatorial explosion driven intrinsically by sequence length $N$, ambient embedding dimension $d_{\text{model}}$, and network depth $L$. Importantly, we guarantee that this idealized polyhedral skeleton is geometrically stable: finite-temperature soft attention preserves these topological partitions via exponentially tight differential approximation bounds.
【4】Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers
标题:零消融夸大了DINO Vision Transformers中的注册内容依赖性
链接:https://arxiv.org/abs/2604.14433
作者:Felipe Parodi,Jordan Matelsky,Melanie Segado
备注:12 pages, 10 figures, to be published in CVPR 2026 HOW Vision Interpretability Workshop Proceedings
摘要:零消融-用零向量代替令牌激活-被广泛用于探测Vision Transformers中的令牌功能。DINOv 2+寄存器和DINOv 3中的寄存器归零会产生较大的下降(高达$-36.6 $\,pp分类,$-30.9 $\,pp分段),这表明寄存器在功能上是不可或缺的。然而,三个替换控件--均值替换、噪声替换和交叉图像寄存器重排--保留了分类、对应和分割的性能,保持在未修改基线的${\sim} 1 $\,pp内。每个补丁的余弦相似性表明,这些替换真正扰乱了内部表征,而归零会导致不成比例的大扰动,这与为什么它会单独降低任务的原因一致。我们的结论是零消融夸大了依赖于确切的寄存器内容。在我们测试的冻结功能评估中,性能取决于看似合理的寄存器式激活,而不是确切的图像特定值。寄存器仍然缓冲密集的功能,从\texttt{[CLS]}的依赖,并与压缩补丁的几何形状。这些发现,包括实验对照结果,在ViT-B量表上重复。
摘要:Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.
【5】Three-Phase Transformer
标题:三相Transformer
链接:https://arxiv.org/abs/2604.14430
作者:Mohammad R. Abu Ayyash
备注:48 pages, 20 figures, 23 tables. Code: https://github.com/achelousace/three-phase-transformer
摘要:我们提出了三相Transformer(3 PT),一个剩余流结构的解码器之前,只有Transformers的标准SwiGLU + RMSNorm + RoPE + GQA骨干。隐藏向量被划分为N个大小相等的循环通道,每个通道由相位相关操作维护:每个通道的RMS Norm,注意力和FFN之间的2D Givens旋转,将每个通道旋转theta + i*(2*pi/N),以及将GQA头部与分区对齐的头部计数约束。该架构是一个自我稳定的平衡之间的混乱和重新强加,而不是一个螺栓上的模块。该分区划分出一个与通道正交的一维DC子空间,我们将固定的Gabriel喇叭轮廓r(p)= 1/(p+1)作为与RoPE的相对位置旋转正交的绝对位置侧通道注入其中。典型的N=3借用了平衡三相AC的隐喻,其中三个相距120度的正弦波之和为零,没有反相关对。在WikiText-103上的1.23亿参数下,3 PT在+1,536个参数(总数的0.00124%)的匹配Rope-Only基线上实现了-7.20%的复杂度(-2.62%位/字节),具有1.93倍的步数收敛加速(1.64倍挂钟)。N表现为一个参数共享旋钮,而不是唯一的最优值:在5.5 M时,N扫描{1,2,3,4,6,8,12}接近单调,N=1获胜;在123 M时,三种子扫描发现N=3和N=1在统计上无法区分。承载机制是通道分区的残余流,每块旋转,每相归一化,和喇叭直流注入。我们描述了(a)没有明确执行的几何形状的自稳定性,这是神经网络守恒律框架的一个新实例;(b)12层旋转角漂移的U形深度剖面;(c)与RoPE,注意力和FFN的正交组合。
摘要:We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.
GAN|对抗|攻击|生成相关(7篇)
【1】An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation
标题:图像生成扩散模型中的规则化和福克-普朗克残数分析
链接:https://arxiv.org/abs/2604.15171
作者:Onno Niemann,Gonzalo Martínez Muñoz,Alberto Suárez Gonzalez
备注:Accepted at IJCNN 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
摘要:最近的研究表明,使用去噪得分匹配(DSM)目标训练的扩散模型经常违反Fokker-Planck(FP)方程,该方程控制真实数据密度的演变。直接惩罚目标函数中的这些偏差降低了它们的幅度,但引入了显著的计算开销。还观察到,强制严格遵守FP方程并不一定会导致生成的样本的质量的改善,因为通常最好的结果是用较弱的FP正则化获得的。在本文中,我们调查是否更简单的惩罚条款可以提供类似的好处。我们实证分析了几个轻量级正则化,研究其对FP残差和生成质量的影响,并表明FP正则化的好处是在大大降低计算成本。我们的代码可在https://github.com/OnnoNiemann/fp_diffusion_analysis上获得。
摘要
:Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.
【2】Structure as Computation: Developmental Generation of Minimal Neural Circuits
标题:结构即计算:最小神经电路的开发生成
链接:https://arxiv.org/abs/2604.15143
作者:Duan Zhou
摘要:这项工作模拟了皮质神经发生的发育过程,从一个单一的干细胞开始,并受到来自小鼠单细胞转录组数据的基因调控规则的控制。发育过程自发地产生了5,000个细胞的异质群体,但只产生了85个成熟的神经元-仅占总群体的1.7%。这85个神经元形成了一个由200,400个突触组成的密集互连的核心,相当于每个神经元的平均程度为4,715。在迭代零处,该最小电路在MNIST上以机会水平执行。然而,在一个标准训练阶段后,准确率飙升至90%以上--增益超过80个百分点--典型的跑步下降到89 - 94%的范围,这取决于发育的随机性。相同的电路,没有任何结构修改或数据增强,达到40.53%的CIFAR-10一个历元后。这些研究结果表明,发展规则雕塑域一般拓扑基板异常适合快速学习,这表明生物发育过程固有的编码强大的结构先验有效的计算。
摘要:This work simulates the developmental process of cortical neurogenesis, initiating from a single stem cell and governed by gene regulatory rules derived from mouse single-cell transcriptomic data. The developmental process spontaneously generates a heterogeneous population of 5,000 cells, yet yields only 85 mature neurons - merely 1.7% of the total population. These 85 neurons form a densely interconnected core of 200,400 synapses, corresponding to an average degree of 4,715 per neuron. At iteration zero, this minimal circuit performs at chance level on MNIST. However, after a single epoch of standard training, accuracy surges to over 90% - a gain exceeding 80 percentage points - with typical runs falling in the 89-94% range depending on developmental stochasticity. The identical circuit, without any architectural modification or data augmentation, achieves 40.53% on CIFAR-10 after one epoch. These findings demonstrate that developmental rules sculpt a domain-general topological substrate exceptionally amenable to rapid learning, suggesting that biological developmental processes inherently encode powerful structural priors for efficient computation.
【3】No More Guessing: a Verifiable Gradient Inversion Attack in Federated Learning
标题:不再猜测:联邦学习中可验证的梯度倒置攻击
链接:https://arxiv.org/abs/2604.15063
作者:Francesco Diana,Chuan Xu,André Nusser,Giovanni Neglia
摘要:梯度反转攻击通过从客户端共享的梯度中重构训练样本来威胁联邦学习中的客户端隐私。攻击者聚集来自多个记录的贡献,现有的攻击可能无法解开它们,产生不正确的重建,没有内在的方法来证明成功。在视觉和语言方面,攻击者可能会依靠人类检查来判断重建的可行性,但这对于数值表格记录来说就不太可行了,这加深了表格数据不那么脆弱的印象。 我们提出了一个可验证的梯度反转攻击(VGIA),重建样本的正确性提供了一个明确的证书,挑战这种看法。我们的方法采用了ReLU泄漏的几何视图:全连接层的激活边界在输入空间中定义了一个超平面。VGIA引入了一个代数的,基于子空间的验证测试,检测何时超平面定界的区域只包含一个记录。一旦隔离被认证,VGIA通过分析恢复相应的特征向量,并通过轻量级优化步骤重建目标。 大批量表格基准测试的实验表明,在现有的最先进的攻击失败或无法评估重建保真度的情况下,准确的记录和目标恢复。与先前的几何方法相比,VGIA更有效地分配超平面查询,从而以更少的攻击轮产生更快的重建。
摘要:Gradient inversion attacks threaten client privacy in federated learning by reconstructing training samples from clients' shared gradients. Gradients aggregate contributions from multiple records and existing attacks may fail to disentangle them, yielding incorrect reconstructions with no intrinsic way to certify success. In vision and language, attackers may fall back on human inspection to judge reconstruction plausibility, but this is far less feasible for numerical tabular records, fueling the impression that tabular data is less vulnerable. We challenge this perception by proposing a verifiable gradient inversion attack (VGIA) that provides an explicit certificate of correctness for reconstructed samples. Our method adopts a geometric view of ReLU leakage: the activation boundary of a fully connected layer defines a hyperplane in input space. VGIA introduces an algebraic, subspace-based verification test that detects when a hyperplane-delimited region contains exactly one record. Once isolation is certified, VGIA recovers the corresponding feature vector analytically and reconstructs the target via a lightweight optimization step. Experiments on tabular benchmarks with large batch sizes demonstrate exact record and target recovery in regimes where existing state-of-the-art attacks either fail or cannot assess reconstruction fidelity. Compared to prior geometric approaches, VGIA allocates hyperplane queries more effectively, yielding faster reconstructions with fewer attack rounds.
【4】Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
标题:物理引起的大气对抗扰动:增强遥感图像分类的可移植性和鲁棒性
链接:https://arxiv.org/abs/2604.14643
作者:Weiwei Zhuang,Wangze Xie,Qi Zhang,Xia Du,Zihan Lin,Zheng Lin,Hanlin Cai,Jizhe Zhou,Zihan Fang,Chi-man Pun,Wei Ni,Jun Luo
备注:14 pages, 11 figures
摘要:对抗性攻击对遥感(RS)图像分类中深度学习模型的可靠性构成了严重威胁。大多数现有的方法依赖于直接的像素扰动,未能利用RS图像的固有大气特性或生存的现实世界的图像退化。在本文中,我们提出了FogFool,一个物理上合理的对抗框架,通过迭代优化基于Perlin噪声的大气模式来生成基于雾的扰动。通过对具有自然、不规则结构的雾形成进行建模,FogFool生成了对抗性的示例,这些示例不仅在视觉上与真实的RS场景一致,而且具有欺骗性。通过利用大气现象的空间相干性和中低频特性,FogFool将对抗信息嵌入到不同架构共享的结构特征中。在两个基准RS数据集上进行的大量实验表明,FogFool实现了卓越的性能:它不仅在白盒设置中表现出色,而且还表现出出色的黑盒可传输性(达到83.74%TASR)和对常见的基于预处理的防御(如JPEG压缩和过滤)的鲁棒性。详细的分析,包括混淆矩阵和类激活图(CAM)的可视化,揭示了我们的大气驱动的扰动诱导模型注意力的普遍转变。这些结果表明,FogFool对RS分类系统来说是一种实用、隐身和高度持久的威胁,为评估复杂环境中模型的可靠性提供了一个强大的基准。
摘要
:Adversarial attacks pose a severe threat to the reliability of deep learning models in remote sensing (RS) image classification. Most existing methods rely on direct pixel-wise perturbations, failing to exploit the inherent atmospheric characteristics of RS imagery or survive real-world image degradations. In this paper, we propose FogFool, a physically plausible adversarial framework that generates fog-based perturbations by iteratively optimizing atmospheric patterns based on Perlin noise. By modeling fog formations with natural, irregular structures, FogFool generates adversarial examples that are not only visually consistent with authentic RS scenes but also deceptive. By leveraging the spatial coherence and mid-to-low-frequency nature of atmospheric phenomena, FogFool embeds adversarial information into structural features shared across diverse architectures. Extensive experiments on two benchmark RS datasets demonstrate that FogFool achieves superior performance: not only does it exceed in white-box settings, but also exhibits exceptional black-box transferability (reaching 83.74% TASR) and robustness against common preprocessing-based defenses such as JPEG compression and filtering. Detailed analyses, including confusion matrices and Class Activation Map (CAM) visualizations, reveal that our atmospheric-driven perturbations induce a universal shift in model attention. These results indicate that FogFool represents a practical, stealthy, and highly persistent threat to RS classification systems, providing a robust benchmark for evaluating model reliability in complex environments.
【5】VeriGraphi: A Multi-Agent Framework of Hierarchical RTL Generation for Large Hardware Designs
标题:VeriGraphi:用于大型硬件设计的分层RTL生成的多代理框架
链接:https://arxiv.org/abs/2604.14550
作者:Sazzadul Islam,Tasnim Tabassum,Hao Zheng
备注:9 pages, 2 figures, case studies
摘要:为大型分层硬件设计生成可综合的Verilog仍然是大型语言模型(LLM)的一个重大挑战,它很难复制人类专家在将复杂规范转换为RTL时所采用的结构化推理。当负责生成层次化的Verilog时,LLM经常会丢失跨模块的上下文,产生幻觉接口,制造模块间的布线,并且无法保持结构一致性-随着设计复杂性的增加和规范涉及非正式的散文,图形和表格,这些失败会加剧直接操作。为了解决这些挑战,我们提出了VeriGraphi,一个框架,它引入了一个规范锚定的知识图作为驱动RTL生成管道的架构基板。VeriGraphi构建了一个HDA,一个结构化的知识图,它将模块层次结构、端口级接口、布线语义和模块间依赖性明确编码为第一类图实体和关系。通过对规范的迭代多代理分析构建,该知识图在代码生成之前提供了一个确定性的、可机器检查的结构支架。在KG的指导下,渐进式编码模块逐步生成伪代码和可合成的RTL,同时在每个子模块阶段执行接口一致性和依赖性正确性。我们评估VeriGraphi的基准测试的三个代表性的规范文件,从美国国家标准与技术研究所及其相应的实现,我们提出了一个RV 32 I处理器作为一个详细的案例研究,以说明整个管道。结果表明,VeriGraphi能够以最少的人为干预为RISC-V实现可靠的分层RTL生成,标志着LLM生成硬件设计的重要里程碑,同时保持强大的功能正确性。
摘要:Generating synthesizable Verilog for large, hierarchical hardware designs remains a significant challenge for large language models (LLMs), which struggle to replicate the structured reasoning that human experts employ when translating complex specifications into RTL. When tasked with producing hierarchical Verilog, LLMs frequently lose context across modules, hallucinate interfaces, fabricate inter-module wiring, and fail to maintain structural coherence - failures that intensify as design complexity grows and specifications involve informal prose, figures, and tables that resist direct operationalization. To address these challenges, we present VeriGraphi, a framework that introduces a spec-anchored Knowledge Graph as the architectural substrate driving the RTL generation pipeline. VeriGraphi constructs a HDA, a structured knowledge graph that explicitly encodes module hierarchy, port-level interfaces, wiring semantics, and inter-module dependencies as first-class graph entities and relations. Built through iterative multi-agent analysis of the specification, this Knowledge Graph provides a deterministic, machine-checkable structural scaffold before code generation. Guided by the KG, a progressive coding module incrementally generates pseudo-code and synthesizable RTL while enforcing interface consistency and dependency correctness at each submodule stage. We evaluate VeriGraphi on a benchmark of three representative specification documents from the National Institute of Standards and Technology and their corresponding implementations, and we present a RV32I processor as a detailed case study to illustrate the full pipeline. The results demonstrate that VeriGraphi enables reliable hierarchical RTL generation with minimal human intervention for RISC-V, marking a significant milestone for LLM-generated hardware design while maintaining strong functional correctness.
【6】Best of both worlds: Stochastic & adversarial best-arm identification
标题:两全其美:随机和对抗性最佳武器识别
链接:https://arxiv.org/abs/2604.14860
作者:Yasin Abbasi-Yadkori,Peter L. Bartlett,Victor Gabillon,Alan Malek,Michal Valko
备注:Published in Conference on Learning Theory (COLT 2018)
摘要:我们研究了任意和潜在的对抗性奖励的强盗最佳臂识别。一个简单的随机均匀学习器在对抗场景中获得最佳错误率。然而,这种类型的策略是次优的奖励时,随机抽样。因此,我们要求:我们能否设计一个学习者,在不知道奖励的性质的情况下,在随机和对抗问题中表现最佳?首先,我们表明,设计这样的学习者是不可能的一般。特别是,为了对对抗性奖励具有鲁棒性,我们只能保证随机问题子集的最优错误率。我们给出了一个下界,刻画了随机问题中的最优速率,如果该策略被约束为对对抗性奖励具有鲁棒性。最后,我们设计了一个简单的无参数算法,并证明了它的错误匹配概率(对数因子)在随机问题的下界,它也是强大的对抗。
摘要:We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.
【7】ML-based approach to classification and generation of structured light propagation in turbulent media
标题:基于ML的湍流媒体中结构光传播分类和生成方法
链接:https://arxiv.org/abs/2604.14208
作者:Aokun Wang,Anjali Nair,Zhongjian Wang,Guillaume Bal
摘要:这项工作开发了机器学习方法来分类结构光束,当它们通过湍流大气传播时,会产生随机散斑干扰。光束传输的随机傍轴方程的数值模拟。我们为这个特定的应用设计了量身定制的卷积神经网络,并将它们用于具有独热编码的分类模型。为了解决潜在的有限的可用数据的挑战,我们开发了一个基于预测的生成扩散模型,在分类器训练过程中提供额外的数据。我们表明,布雷格曼距离最小化在学习步骤中提高了质量的高频模式的生成。
摘要:This work develops machine learning approaches to classify structured light wave beams developing random speckle disturbances as they propagate through turbulent atmospheres. Beam propagation is modeled by the numerical simulation of a stochastic paraxial equation. We design convolutional neural networks tailored for this specific application and use them for a classification model with one-hot encoding. To address the challenge of potentially limited available data, we develop a prediction-based generative diffusion model to provide additional data during classifier training. We show that a Bregman distance minimization during the learning step improves the quality of the generation of high-frequency modes.
半/弱/无/有监督|不确定性|主动学习(8篇)
【1】SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation
标题:SegWithU:不确定性作为单正向风险感知医学图像分割的扰动能量
链接:https://arxiv.org/abs/2604.15271
作者:Tianhao Fu,Austin Wang,Charles Chen,Roby Aldave-Garza,Yucheng Chen
摘要
:可靠的不确定性估计对于医学图像分割至关重要,其中自动轮廓为下游量化和临床决策支持提供支持。许多强不确定性方法需要重复推理,而有效的单向前传递方法通常提供较弱的故障排序或依赖于限制性特征空间假设。我们提出了$\textbf{SegWithU}$,这是一个事后框架,它用一个轻量级的不确定性头来增强冻结的预训练分割骨干。SegWithU利用中间骨干特征,并将不确定性建模为使用秩1后验探针的紧凑探针空间中的扰动能量。它产生两个体素的不确定性地图:一个校准为导向的地图概率回火和排名为导向的地图错误检测和选择性预测。在ACDC、BraTS 2024和LiTS中,SegWithU是最强、最一致的单次前向通过基线,分别实现了0.9838/2.4885$、0.9946/0.2660$和0.9925/0.8193$的AUROC/AURC,同时保持了分割质量。这些结果表明,基于扰动的不确定性建模是一个有效的和实用的路线,可靠性感知的医疗分割。 源代码可在https://github.com/ProjectNeura/SegWithU上获得。
摘要:Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.
【2】xFODE+: Explainable Type-2 Fuzzy Additive ODEs for Uncertainty Quantification
标题:xFODE+:用于不确定性量化的可解释类型2模糊加法ODE
链接:https://arxiv.org/abs/2604.14880
作者:Ertugrul Kececi,Tufan Kumbasar
备注:in IEEE International Conference on Fuzzy Systems, 2026
摘要:深度学习(DL)的最新进展推动了数据驱动的系统识别(SysID),但可靠的使用需要不确定性量化(UQ)以及准确的预测。虽然具有UQ能力的模型(如模糊ODE(FODE))可以产生预测区间(PI),但它们提供有限的可解释性。我们引入可解释的2型模糊加法ODE的UQ(xFODE+),一个可解释的SysID模型,它产生PI的同时点预测,同时保留物理上有意义的增量状态。xFODE+使用区间2型模糊逻辑系统(IT 2-FLS)实现每个模糊加法模型,并将隶属函数约束为两个相邻规则的激活,限制重叠并保持推理局部透明。由IT 2-FLS产生的类型缩减集被聚合以与PI一起构造状态更新。该模型在DL框架中通过复合损失进行训练,该复合损失联合优化了预测精度和PI质量。在基准SysID数据集上的结果表明,xFODE+在PI质量上与FODE相匹配,并实现了相当的准确性,同时提供了可解释性。
摘要:Recent advances in Deep Learning (DL) have boosted data-driven System Identification (SysID), but reliable use requires Uncertainty Quantification (UQ) alongside accurate predictions. Although UQ-capable models such as Fuzzy ODE (FODE) can produce Prediction Intervals (PIs), they offer limited interpretability. We introduce Explainable Type-2 Fuzzy Additive ODEs for UQ (xFODE+), an interpretable SysID model which produces PIs alongside point predictions while retaining physically meaningful incremental states. xFODE+ implements each fuzzy additive model with Interval Type-2 Fuzzy Logic Systems (IT2-FLSs) and constraints membership functions to the activation of two neighboring rules, limiting overlap and keeping inference locally transparent. The type-reduced sets produced by the IT2-FLSs are aggregated to construct the state update together with the PIs. The model is trained in a DL framework via a composite loss that jointly optimizes prediction accuracy and PI quality. Results on benchmark SysID datasets show that xFODE+ matches FODE in PI quality and achieves comparable accuracy, while providing interpretability.
【3】CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
标题:CMTM:用于无监督视频对象分割的跨模式令牌调制
链接:https://arxiv.org/abs/2604.14630
作者:Inseok Jeon,Suhwan Cho,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Jungho Lee,Chaewon Park,Donghyeong Kim,Sangyoun Lee
备注:6 pages, 5 figures. Accepted to IEEE ICIP 2025
摘要:无监督视频对象分割的最新进展突出了整合外观和运动线索的双流架构的潜力。然而,充分利用这些互补的信息源需要有效地建模它们的相互依赖性。在本文中,我们介绍了跨模态令牌调制,这是一种旨在加强外观和运动线索之间相互作用的新方法。我们的方法建立了密集的连接标记之间的每一个模态,使有效的模态内和模态间的信息传播通过关系Transformer块。为了提高学习效率,我们采用了一种令牌掩蔽策略,解决了仅依赖于增加模型复杂性的局限性。我们的方法在所有公共基准测试中都达到了最先进的性能,优于现有的方法。
摘要:Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.
【4】An unsupervised decision-support framework for multivariate biomarker analysis in athlete monitoring
标题:运动员监测中多元生物标志物分析的无监督决策支持框架
链接:https://arxiv.org/abs/2604.14534
作者:Fernando Barcelos Rosito,Sebastião De Jesus Menezes,Simone Ferreira Sturza,Adriana Seixas,Muriel Figueredo Franco
备注:15 pages, 4 figures, 3 tables, submitted to Springer Nature Scientific Reports
摘要:目的.运动员监测受到小群体、异质生物标志物尺度、重复采样可行性有限以及缺乏可靠的损伤基础事实的限制。这些局限性降低了传统单变量和二元风险模型的可解释性和实用性。本研究通过提出一个无监督的多变量框架来解决这些挑战,以确定潜在的生理状态,在运动员使用真实的数据。方法.我们提出了一个模块化的计算框架,在联合生物标志物空间,集成预处理,临床安全筛选,无监督聚类和基于质心的生理解释。个人资料是专门从业余足球运动员在一个竞争性的微周期。合成数据增强评估鲁棒性和可扩展性。Ward层次聚类支持监测和病因分化,而高斯混合模型(GMM)则支持高维环境中的结构稳定性分析。结果该框架确定了连贯的配置文件,区分机械损伤代谢应激,同时保持稳态。合成数据增强证明了单变量监测通常遗漏的潜在沉默风险表型的可行性和检测。结构分析表明增强和高维设置下的鲁棒性。结论该框架能够从多变量生物标志物数据中识别潜在的生理状态,而无需损伤标签。通过区分机制和揭示传统监测无法捕获的无声风险模式,它为个性化运动员监测和决策提供了可操作的见解。
摘要
:Purpose. Athlete monitoring is constrained by small cohorts, heterogeneous biomarker scales, limited feasibility of repeated sampling, and the lack of reliable injury ground truth. These limitations reduce the interpretability and utility of traditional univariate and binary risk models. This study addresses these challenges by proposing an unsupervised multivariate framework to identify latent physiological states in athletes using real data. Methods. We propose a modular computational framework that operates in the joint biomarker space, integrating preprocessing, clinical safety screening, unsupervised clustering, and centroid-based physiological interpretation. Profiles are learned exclusively from amateur soccer players during a competitive microcycle. Synthetic data augmentation evaluates robustness and scalability. Ward hierarchical clustering supports monitoring and etiological differentiation, while Gaussian Mixture Models (GMM) enable structural stability analysis in high-dimensional settings. Results. The framework identifies coherent profiles that distinguish mechanical damage from metabolic stress while preserving homeostatic states. Synthetic data augmentation demonstrates feasibility and detection of latent silent risk phenotypes typically missed by univariate monitoring. Structural analyses indicate robustness under augmentation and higher-dimensional settings. Conclusion. The framework enables interpretable identification of latent physiological states from multivariate biomarker data without injury labels. By distinguishing mechanisms and revealing silent risk patterns not captured by conventional monitoring, it provides actionable insights for individualized athlete monitoring and decision making.
【5】Anomaly Detection in IEC-61850 GOOSE Networks: Evaluating Unsupervised and Temporal Learning for Real-Time Intrusion Detection
标题:EC-61850 GOOSE网络中的异常检测:评估实时入侵检测的无监督和时态学习
链接:https://arxiv.org/abs/2604.14233
作者:Joseph Moore
备注:10 pages, 7 figures, 4 tables
摘要:IEC-61850 GOOSE协议支持现代数字变电站中的时间关键型通信,但缺乏本地安全机制,使其容易受到重放,伪装和数据注入攻击。由于严格的延迟限制(低于4 ms)和标记攻击数据的有限可用性,这种设置中的入侵检测具有挑战性。本文评估无监督的时间建模是否可以提供有效的和可部署的异常检测GOOSE网络。在ERENO IEC-61850数据集上比较了五个模型:监督随机森林基线,前馈自动编码器和三个递归序列自动编码器(RNN,LSTM和GRU)。有监督的随机森林实现了最高的检测性能(F1=0.9516),但未能满足每次预测21.8ms的实时约束。所有四个无监督模型都满足4 ms的要求,其中GRU实现了最佳的精度与延迟的权衡(在1.118ms时F1=0.8737)。一个独立的数据集上的跨环境评估表明,所有的模型下的分布转移退化。然而,经常性的模型保留了比监督基线更高的相对性能,这表明时间序列建模比拟合标记的攻击分布更好。无监督模型的异常阈值在一个保持的验证分区上选择,以避免测试集泄漏。这些结果支持无监督时间模型作为实时GOOSE入侵检测的实际选择,特别是在标记的训练数据可能不可用或需要跨不同变电站大规模部署的环境中。
摘要:The IEC-61850 GOOSE protocol underpins time-critical communication in modern digital substations but lacks native security mechanisms, leaving it vulnerable to replay, masquerade, and data injection attacks. Intrusion detection in this setting is challenging due to strict latency constraints (sub-4ms) and limited availability of labeled attack data. This paper evaluates whether unsupervised temporal modeling can provide effective and deployable anomaly detection for GOOSE networks. Five models are compared on the ERENO IEC-61850 dataset: a supervised Random Forest baseline, a feedforward Autoencoder, and three recurrent sequence autoencoders (RNN, LSTM, and GRU). The supervised Random Forest achieves the highest detection performance (F1=0.9516) but fails to meet real-time constraints at 21.8ms per prediction. All four unsupervised models satisfy the 4ms requirement, with the GRU achieving the best accuracy to latency tradeoff among them (F1=0.8737 at 1.118ms). A cross-environment evaluation on an independent dataset shows that all models degrade under distribution shift. However, recurrent models retain substantially higher relative performance than the supervised baseline, suggesting that temporal sequence modeling generalizes better than fitting labeled attack distributions. Anomaly thresholds for the unsupervised models are selected on a held out validation partition to avoid test set leakage. These results support unsupervised temporal models as a practical choice for real-time GOOSE intrusion detection, particularly in environments where labeled training data may be unavailable or where large-scale deployment across diverse substations is required.
【6】Portfolio Optimization Proxies under Label Scarcity and Regime Shifts via Bayesian and Deterministic Students under Semi-Supervised Sandwich Training
标题:半监督三明治训练下通过Bayesian和确定性学生在标签稀缺和制度转变下的投资组合优化代理
链接:https://arxiv.org/abs/2604.14206
作者:Adhiraj Chattopadhyay
备注:18 pages of main text. 10 pages of appendices. 35 references. Around 13 figures
摘要:本文提出了一种机器学习辅助的投资组合优化框架,旨在为低数据环境和政权的不确定性。我们构建了一个教师学生学习管道,其中条件风险值(CVaR)优化器生成监督标签,神经模型(贝叶斯和确定性)使用真实和合成增强数据进行训练。合成数据使用具有t copula残差的基于因子的模型生成,使得训练超出104个标记观测的有限真实样本。我们在一个结构化的实验框架下评估了四个学生模型,包括(i)受控合成实验(3 x 5种子网格),(ii)分布真实市场评估(C2A)和(iii)跨宇宙泛化(D2A)。在真实市场环境中,使用滚动评估协议部署模型,其中冻结的预训练模型定期根据最近的观察结果进行微调并重置为基本状态,以确保稳定性,同时允许有限的适应。结果表明,学生模型可以匹配或优于CVaR教师在几个设置,同时实现政权转移和减少营业额下的鲁棒性提高。这些发现表明,混合优化学习方法可以提高投资组合的建设在数据约束的环境中
摘要:This paper proposes a machine learning assisted portfolio optimization framework designed for low data environments and regime uncertainty. We construct a teacher student learning pipeline in which a Conditional Value at Risk (CVaR) optimizer generates supervisory labels, and neural models (Bayesian and deterministic) are trained using both real and synthetically augmented data. The synthetic data is generated using a factor based model with t copula residuals, enabling training beyond the limited real sample of 104 labeled observations. We evaluate four student models under a structured experimental framework comprising (i) controlled synthetic experiments (3 x 5 seed grid), (ii) in-distribution real market evaluation (C2A) and (iii) cross-universe generalization (D2A). In real-market settings, models are deployed using a rolling evaluation protocol where a frozen pretrained model is periodically fine tuned on recent observations and reset to its base state, ensuring stability while allowing limited adaptation. Results show that student models can match or outperform the CVaR teacher in several settings, while achieving improved robustness under regime shifts and reduced turnover. These findings suggest that hybrid optimization learning approaches can enhance portfolio construction in data constrained environments
【7】Unsupervised feature selection using Bayesian Tucker decomposition
标题:基于贝叶斯Tucker分解的无监督特征选择
链接:https://arxiv.org/abs/2604.14949
作者:Y-h. Taguchi,Yoh-ichi Mototake
备注:24 pages, 10 figures
摘要:本文提出了贝叶斯塔克分解(BTuD),其中残差应该服从类似于线性回归的高斯分布。虽然我们已经提出了一种算法来执行所提出的BTuD,传统的高阶正交迭代可以生成与本实现一致的Tucker分解。使用所提出的BTuD,我们可以执行无监督的特征选择,成功地应用于各种合成数据集,具有随机耦合强度的全局耦合映射,以及基因表达谱。因此,我们可以得出结论,我们新提出的无监督特征选择方法是有前途的。除此之外,基于BTuD的无监督FE预计将与先前提出并成功应用于广泛问题的基于TD的无监督FE相一致。
摘要:In this paper, we proposed Bayesian Tucker decomposition (BTuD) in which residual is supposed to obey Gaussian distribution analogous to linear regression. Although we have proposed an algorithm to perform the proposed BTuD, the conventional higher-order orthogonal iteration can generate Tucker decomposition consistent with the present implementation. Using the proposed BTuD, we can perform unsupervised feature selection successfully applied to various synthetic datasets, global coupled maps with randomized coupling strength, and gene expression profiles. Thus we can conclude that our newly proposed unsupervised feature selection method is promising. In addition to this, BTuD based unsupervised FE is expected to coincide with TD based unsupervised FE that were previously proposed and successfully applied to a wide range of problems.
【8】PUFFIN: Protein Unit Discovery with Functional Supervision
标题:PUFIN:功能监督下的蛋白质单元发现
链接:https://arxiv.org/abs/2604.14796
作者:Gökçe Uludoğan,Buse Giledereli,Elif Ozkirimli,Arzucan Özgür
备注
:21 pages, 9 figures, to appear in ISMB 2026 proceedings
摘要:蛋白质通过组织成结构排列的残基组的协调作用来执行生物功能。这些排列,我们称之为蛋白质单位,存在于一个中间尺度上,比单个残基大,但比整个蛋白质小。通过识别这些单位及其与功能的关联,可以更深入地了解蛋白质功能。然而,现有的方法要么专注于残基水平的信号,依赖于策划的注释,或片段蛋白质结构,而不结合功能信息,从而限制了结构-功能关系的可解释分析。我们介绍PUFFIN,一个数据驱动的框架,通过共同学习结构划分和功能监督来发现蛋白质单元。PUFFIN将蛋白质表示为残基级结构图,并应用具有结构感知池机制的图神经网络,该机制将每个蛋白质划分为多残基单元,并具有塑造分区的功能监督。我们发现,学习的单位在结构上是连贯的,表现出有组织的关联与分子功能,并显示有意义的对应与策划InterPro注释。总之,这些结果表明,PUFFIN提供了一个可解释的框架,用于使用学习的蛋白质单元及其统计功能关联来分析结构-功能关系。我们在https://github.com/boun-tabi-lifelu/puffin上提供了源代码。
摘要:Proteins carry out biological functions through the coordinated action of groups of residues organized into structural arrangements. These arrangements, which we refer to as protein units, exist at an intermediate scale, being larger than individual residues yet smaller than entire proteins. A deeper understanding of protein function can be achieved by identifying these units and their associations with function. However, existing approaches either focus on residue-level signals, rely on curated annotations, or segment protein structures without incorporating functional information, thereby limiting interpretable analysis of structure-function relationships. We introduce PUFFIN, a data-driven framework for discovering protein units by jointly learning structural partitioning and functional supervision. PUFFIN represents proteins as residue-level structure graphs and applies a graph neural network with a structure-aware pooling mechanism that partitions each protein into multi-residue units, with functional supervision that shapes the partition. We show that the learned units are structurally coherent, exhibit organized associations with molecular function, and show meaningful correspondence with curated InterPro annotations. Together, these results demonstrate that PUFFIN provides an interpretable framework for analyzing structure-function relationships using learned protein units and their statistical function associations. We made our source code available at https://github.com/boun-tabi-lifelu/puffin.
迁移|Zero/Few/One-Shot|自适应(7篇)
【1】One-shot learning for the complex dynamical behaviors of weakly nonlinear forced oscillators
标题:弱非线性受强迫振子复杂动力学行为的一次性学习
链接:https://arxiv.org/abs/2604.15181
作者:Teng Ma,Luca Rosafalco,Wei Cui,Lin Zhao,Attilio Frangi
备注:48 pages, 16 figures, graphical abstract, highlights
摘要:复杂非线性动力学的外推预测仍然是工程中的一个核心挑战。本文提出了一种单次学习方法,通过学习控制方程,从单次激励时程中识别全局频率响应曲线。引入MEv-SINDy(Multi-frequency Evolutionary Sparse Identification of Nonlinear Dynamics)方法,推导非自治多频系统的控制方程。该方法利用广义谐波平衡(GHB)方法分解成一组缓慢变化的演变方程复杂的强迫响应。我们在两个关键的微机电系统(MEMS)上验证了MEv-SINDy的能力。这些应用包括非线性梁谐振器和MEMS MEMS MEMS谐振器。我们的研究结果表明,在一个单一的点上训练的模型准确地预测软化/硬化效应和跳跃现象在很宽的范围内的激励水平。这种方法大大减少了非线性微系统的表征和设计的数据采集负担。
摘要:Extrapolative prediction of complex nonlinear dynamics remains a central challenge in engineering. This study proposes a one-shot learning method to identify global frequency-response curves from a single excitation time history by learning governing equations. We introduce MEv-SINDy (Multi-frequency Evolutionary Sparse Identification of Nonlinear Dynamics) to infer the governing equations of non-autonomous and multi-frequency systems. The methodology leverages the Generalized Harmonic Balance (GHB) method to decompose complex forced responses into a set of slow-varying evolution equations. We validated the capabilities of MEv-SINDy on two critical Micro-Electro-Mechanical Systems (MEMS). These applications include a nonlinear beam resonator and a MEMS micromirror. Our results show that the model trained on a single point accurately predicts softening/hardening effects and jump phenomena across a wide range of excitation levels. This approach significantly reduces the data acquisition burden for the characterization and design of nonlinear microsystems.
【2】Multi-User mmWave Beam and Rate Adaptation via Combinatorial Satisficing Bandits
标题:通过组合满足带宽的多用户毫米波束和速率自适应
链接:https://arxiv.org/abs/2604.14908
作者:Emre Özyıldırım,Barış Yaycı,Umut Eren Akturk,Cem Tekin
摘要:我们研究了多用户毫米波MISO系统中的下行链路波束和速率自适应,其中多个基站(BS),每个基站使用来自有限码本的模拟波束成形,为多个单天线用户设备(UE)提供服务,每个UE具有唯一的波束和离散的数据传输速率。BS基于ACK/NACK反馈获知传输成功。为了对服务目标进行编码,我们引入了一个满足吞吐量的门限τ r,并将波束和速率自适应作为波束速率元组上的组合半强盗。在这个框架内,我们提出了SAT-CTS,一个轻量级的,阈值感知的政策,融合了保守的置信估计与后验采样,转向学习满足$τ_r$,而不仅仅是最大化。我们的主要理论贡献提供了第一个有限时间遗憾界的组合半土匪满意的目标:当$τ_r$可实现时,我们用一个与时间无关的常数来上界对目标的累积满意遗憾,当$τ_r$不可实现时,我们证明了SAT-CTS只会在提交的CTS轮之外引起有限的期望瞬态,在此之后,它的遗憾由重新开始的CTS回合的遗憾贡献之和控制,产生$O((\log T)^2)$标准遗憾界。在实践方面,我们评估的性能,通过累积满意遗憾,以$τ_r$与标准遗憾和公平。时变稀疏多径信道的实验表明,SAT-CTS一致减少满意的遗憾,并保持竞争力的标准遗憾,同时实现良好的平均吞吐量和公平的用户,表明反馈有效的学习可以公平地分配波束和速率,以满足QoS目标,而无需信道状态知识。
摘要:We study downlink beam and rate adaptation in a multi-user mmWave MISO system where multiple base stations (BSs), each using analog beamforming from finite codebooks, serve multiple single-antenna user equipments (UEs) with a unique beam per UE and discrete data transmission rates. BSs learn about transmission success based on ACK/NACK feedback. To encode service goals, we introduce a satisficing throughput threshold $τ_r$ and cast joint beam and rate adaptation as a combinatorial semi-bandit over beam-rate tuples. Within this framework, we propose SAT-CTS, a lightweight, threshold-aware policy that blends conservative confidence estimates with posterior sampling, steering learning toward meeting $τ_r$ rather than merely maximizing. Our main theoretical contribution provides the first finite-time regret bounds for combinatorial semi-bandits with satisficing objective: when $τ_r$ is realizable, we upper bound the cumulative satisficing regret to the target with a time-independent constant, and when $τ_r$ is non-realizable, we show that SAT-CTS incurs only a finite expected transient outside committed CTS rounds, after which its regret is governed by the sum of the regret contributions of restarted CTS rounds, yielding an $O((\log T)^2)$ standard regret bound. On the practical side, we evaluate the performance via cumulative satisficing regret to $τ_r$ alongside standard regret and fairness. Experiments with time-varying sparse multipath channels show that SAT-CTS consistently reduces satisficing regret and maintains competitive standard regret, while achieving favorable average throughput and fairness across users, indicating that feedback-efficient learning can equitably allocate beams and rates to meet QoS targets without channel state knowledge.
【3】Catching Every Ripple: Enhanced Anomaly Awareness via Dynamic Concept Adaptation
标题:捕捉每一丝涟漪:通过动态概念适应增强异常意识
链接:https://arxiv.org/abs/2604.14726
作者:Jiaqi Zhu,Shaofeng Cai,Jie Chen,Fang Deng,Beng Chin Ooi,Wenqiao Zhang
备注:Accepted by IEEE TPAMI
摘要
:在线异常检测(OAD)在不断变化的数据流的实时分析和决策中起着关键作用。然而,现有的方法往往依赖于昂贵的再培训和刚性的决策边界,限制了他们的能力,以适应在动态环境中的概念漂移有效和高效。为了解决这些挑战,我们提出了DyMETER,一个动态的概念适应框架OAD,统一的飞行参数转换和动态阈值在一个单一的在线范例。DyMETER首先在历史数据上学习静态检测器以捕获重复出现的中心概念,然后过渡到动态模式以适应漂移发生时的新概念。具体而言,DyMETER采用了一种新的动态概念自适应机制,该机制利用超网络为静态检测器生成实例感知的参数偏移,从而实现高效和有效的自适应,而无需重新训练或微调。为了实现鲁棒和可解释的自适应,DyMETER引入了一个轻量级的进化控制器来估计实例级的概念不确定性,以进行自适应更新。此外,DyMETER采用动态阈值优化模块,通过维护不确定样本的候选窗口来自适应地重新校准决策边界,这确保了与不断发展的概念的连续对齐。大量的实验表明,DyMETER在广泛的应用场景中显着优于现有的OAD方法。
摘要:Online anomaly detection (OAD) plays a pivotal role in real-time analytics and decision-making for evolving data streams. However, existing methods often rely on costly retraining and rigid decision boundaries, limiting their ability to adapt both effectively and efficiently to concept drift in dynamic environments. To address these challenges, we propose DyMETER, a dynamic concept adaptation framework for OAD that unifies on-the-fly parameter shifting and dynamic thresholding within a single online paradigm. DyMETER first learns a static detector on historical data to capture recurring central concepts, and then transitions to a dynamic mode to adapt to new concepts as drift occurs. Specifically, DyMETER employs a novel dynamic concept adaptation mechanism that leverages a hypernetwork to generate instance-aware parameter shifts for the static detector, thereby enabling efficient and effective adaptation without retraining or fine-tuning. To achieve robust and interpretable adaptation, DyMETER introduces a lightweight evolution controller to estimate instance-level concept uncertainty for adaptive updates. Further, DyMETER employs a dynamic threshold optimization module to adaptively recalibrates the decision boundary by maintaining a candidate window of uncertain samples, which ensures continuous alignment with evolving concepts. Extensive experiments demonstrate that DyMETER significantly outperforms existing OAD approaches across a wide spectrum of application scenarios.
【4】ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding
标题:ConfLayers:自适应基于信任度的跳层,用于自我推测解码
链接:https://arxiv.org/abs/2604.14612
作者:Walaa Amer,Uday das,Fadi Kurdahi
备注:13 pages, 9 figures
摘要:自推测解码是一种用于大型语言模型的推理技术,旨在加速生成而不牺牲输出质量。它将使用模型的紧凑版本作为草稿模型的快速近似解码与通过完整目标模型的选择性重新评估相结合。一些现有的方法通过动态学习在推理期间跳过哪些层来形成草稿模型,有效地创建较小的子网络以加快计算。然而,使用基于几何的方法来选择要跳过的层通常更简单、更有效。在本文中,我们提出了ConfLayers,一个动态的即插即用的方法来形成草案模型,通过基于信任的中间层跳过自推测解码。该过程迭代地计算所有层的置信度分数,基于自适应阈值选择要跳过的层,评估结果集的性能,并更新最佳选择,直到没有实现进一步的改进或达到最大迭代次数。该框架避免了训练跳层策略的开销和复杂性,并且可以提供更一致的速度-质量权衡,同时保持草稿模型对不同任务和数据集的适应性。ConfLayers在不同模型和数据集上的性能评估表明,我们的新方法比普通LLM生成提供了高达1.4倍的加速。
摘要:Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model with selective re-evaluation by the full target model. Some existing methods form the draft model by dynamically learning which layers to skip during inference, effectively creating a smaller subnetwork to speed up computation. However, using heuristic-based approaches to select layers to skip can often be simpler and more effective. In this paper, we propose ConfLayers, a dynamic plug-and-play approach to forming the draft model in self-speculative decoding via confidence-based intermediate layer skipping. The process iteratively computes confidence scores for all layers, selects layers to skip based on an adaptive threshold, evaluates the performance of the resulting set, and updates the best selection until no further improvement is achieved or a maximum number of iterations is reached. This framework avoids the overhead and complexity of training a layer skipping policy and can provide more consistent speed-quality trade-offs while preserving the adaptivity of the draft model to diverse tasks and datasets. The performance evaluation of ConfLayers across different models and datasets shows that our novel approach offers up to 1.4x speedup over vanilla LLM generation.
【5】Material-Agnostic Zero-Shot Thermal Inference for Metal Additive Manufacturing via a Parametric PINN Framework
标题:通过参数化PINN框架进行金属增材制造的材料不可知零射热推断
链接:https://arxiv.org/abs/2604.14562
作者:Hyeonsu Lee,Jihoon Jeong
摘要:金属增材制造(AM)中精确的热建模对于理解工艺-结构-性能关系至关重要。虽然先前的研究已经探索了在看不见的过程条件下的泛化,但它们通常需要大量的数据集,昂贵的再训练或预训练。由于不同材料依赖的热行为带来的挑战,不同材料之间的泛化也相对未被探索。本文介绍了一种参数物理信息神经网络(PINN)框架,用于在没有标记数据、再训练或预训练的情况下对任意材料进行zero-shot泛化。该框架采用解耦参数PINN架构,分别编码材料属性和时空坐标,通过条件调制将它们融合,以更好地与控制方程和边界条件中材料参数的乘法作用保持一致。物理指导的输出缩放来自罗森塔尔的分析解决方案和混合优化策略进一步纳入,以提高物理一致性,训练稳定性和收敛性。在不同金属合金上的裸板激光粉末床熔合(LPBF)的实验,包括分布内和分布外的情况,证明了有效的zero-shot泛化能力以及优异的训练效率。具体而言,与非参数基线相比,所提出的框架实现了相对L2错误减少64.2%,同时在基线训练时期的仅4.4%内超过其性能。消融研究证实,所提出的框架的组件是广泛适用于其他PINN为基础的方法。总体而言,所提出的框架提供了一个有效的和可扩展的材料无关的解决方案,zero-shot热建模,有助于更灵活和实用的金属AM部署。
摘要:Accurate thermal modeling in metal additive manufacturing (AM) is essential for understanding the process-structure-performance relationship. While prior studies have explored generalization across unseen process conditions, they often require extensive datasets, costly retraining, or pre-training. Generalization across different materials also remains relatively unexplored due to the challenges posed by distinct material-dependent thermal behaviors. This paper introduces a parametric physics-informed neural network (PINN) framework for zero-shot generalization across arbitrary materials without labeled data, retraining, or pre-training. The framework adopts a decoupled parametric PINN architecture that separately encodes material properties and spatiotemporal coordinates, fusing them through conditional modulation to better align with the multiplicative role of material parameters in the governing equation and boundary conditions. Physics-guided output scaling derived from Rosenthal's analytical solution and a hybrid optimization strategy are further incorporated to enhance physical consistency, training stability, and convergence. Experiments on bare plate laser powder bed fusion (LPBF) across diverse metal alloys, including both in-distribution and out-of-distribution cases, demonstrate effective zero-shot generalizability along with superior training efficiency. Specifically, the proposed framework achieved up to a 64.2% reduction in relative L2 error compared to the non-parametric baseline while surpassing its performance within only 4.4% of the baseline training epochs. Ablation studies confirm that the proposed framework's components are broadly applicable to other PINN-based approaches. Overall, the proposed framework provides an efficient and scalable material-agnostic solution for zero-shot thermal modeling, contributing to more flexible and practical deployment in metal AM.
【6】H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection
标题:H2 TLR:用于Few-Shot异常检测的异类超图视觉语言推理
链接:https://arxiv.org/abs/2604.14507
作者:Jianghong Huang,Luping Ji,Weiwei Duan,Mao Ye
备注:9 pages, 5 figures
摘要:异常检测作为一种经典的视觉任务,在工业检测、医学成像等领域有着广泛的应用。在这项任务中,数据稀缺往往是一个经常面临的问题。为了解决这一问题,Few-Shot异常检测(FSAD)方案受到越来越多的关注。近年来,视觉语言模型(VLM)的研究突破了传统的视觉范式,为视觉语言模型的研究提供了新的思路。然而,在现有的基于VLM的FSAD方案中,几乎所有的异常推理仅通过两两特征匹配,忽略了结构依赖性和全局一致性。为了进一步通过VLM来实现FSAD,我们提出了一个异构超图视觉语言推理(H2VLR)框架。通过在一个统一的超图中对视觉区域和语义概念进行联合建模,将FSAD重新表述为视觉语义关系的高阶推理问题。实验比较验证了H2 VLR的有效性和优势。它通常可以在代表性的工业和医疗基准上实现最先进的(SOTA)性能。我们的代码将在接受后发布。
摘要
:As a classic vision task, anomaly detection has been widely applied in industrial inspection and medical imaging. In this task, data scarcity is often a frequently-faced issue. To solve it, the few-shot anomaly detection (FSAD) scheme is attracting increasing attention. In recent years, beyond traditional visual paradigm, Vision-Language Model (VLM) has been extensively explored to boost this field. However, in currently-existing VLM-based FSAD schemes, almost all perform anomaly inference only by pairwise feature matching, ignoring structural dependencies and global consistency. To further redound to FSAD via VLM, we propose a Heterogeneous Hypergraph Vision-Language Reasoning (H2VLR) framework. It reformulates the FSAD as a high-order inference problem of visual-semantic relations, by jointly modeling visual regions and semantic concepts in a unified hypergraph. Experimental comparisons verify the effectiveness and advantages of H2VLR. It could often achieve state-of-the-art (SOTA) performance on representative industrial and medical benchmarks. Our code will be released upon acceptance.
【7】Shapley Value-Guided Adaptive Ensemble Learning for Explainable Financial Fraud Detection with U.S. Regulatory Compliance Validation
标题:Shapley价值观引导的自适应参与学习,通过美国监管合规验证进行可解释的金融欺诈检测
链接:https://arxiv.org/abs/2604.14231
作者:Mohammad Nasir Uddin,Md Munna Aziz
备注:28 pages. Submitted to Engineering Applications of Artificial Intelligence (Elsevier). IEEE-CIS dataset (590,540 transactions). Includes SGAE algorithm, SHAP stability evaluation, and OCC/SR 11-7 regulatory compliance mapping
摘要:金融犯罪每年使美国机构损失超过320亿美元。尽管用于欺诈检测的人工智能工具已经变得更加先进,但它们在现实世界系统中的使用仍然面临着一个主要障碍:这些模型中的许多模型都是黑匣子,无法提供OCC Bulletin 2011-12和Federal Reserve SR 11-7等法规所要求的透明,可审计的解释。这项研究做出了三个主要贡献。首先,它提供了一个全面的评价解释质量的忠实性(充分性和全面性,在k=5,10和15)和稳定性(肯德尔的W在30个自助样本)。与TreeExplainer配对的XGBoost实现了近乎完美的稳定性(W=0.9912),而与DeepExplainer配对的LSTM显示出较弱的结果(W=0.4962)。其次,本文介绍了SHAP引导的自适应增强(SGAE),它根据SHAP属性一致性动态调整每个事务的集成权重,在所有测试模型中实现了最高的AUC-ROC(0.8837保持; 0.9245交叉验证)。第三,在完整的590,540事务IEEE-CIS数据集上提供了LSTM,Transformer和GNN-GraphSAGE的完整三架构评估,GNN-GraphSAGE达到AUC-ROC 0.9248和F1=0.6013。所有结果都直接映射到OCC、SR 11-7和BSA-AML法规遵从性要求。
摘要:Financial crime costs U.S. institutions over $32 billion each year. Although AI tools for fraud detection have become more advanced, their use in real-world systems still faces a major obstacle: many of these models operate as black boxes that cannot provide the transparent, auditable explanations required by regulations such as OCC Bulletin 2011-12 and Federal Reserve SR 11-7. This study makes three main contributions. First, it offers a thorough evaluation of explanation quality across faithfulness (sufficiency and comprehensiveness at k=5, 10, and 15) and stability (Kendall's W across 30 bootstrap samples). XGBoost paired with TreeExplainer achieves near-perfect stability (W=0.9912), while LSTM with DeepExplainer shows weak results (W=0.4962). Second, the paper introduces the SHAP-Guided Adaptive Ensemble (SGAE), which dynamically adjusts per-transaction ensemble weights based on SHAP attribution agreement, achieving the highest AUC-ROC among all tested models (0.8837 held-out; 0.9245 cross-validation). Third, a complete three-architecture evaluation of LSTM, Transformer, and GNN-GraphSAGE on the full 590,540-transaction IEEE-CIS dataset is provided, with GNN-GraphSAGE achieving AUC-ROC 0.9248 and F1=0.6013. All results are mapped directly to OCC, SR 11-7, and BSA-AML regulatory compliance requirements.
强化学习(4篇)
【1】RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning
标题:RL-STPA:适应系统理论危险分析以实现安全关键强化学习
链接:https://arxiv.org/abs/2604.15201
作者:Steven A. Senczyszyn,Timothy C. Havens,Nathaniel Rice,Jason E. Summers,Benjamin D. Werner,Benjamin J. Schumeg
摘要:随着强化学习(RL)部署扩展到安全关键领域,现有的评估方法无法系统地识别由神经网络启用策略的黑盒性质以及训练和部署之间的分布变化引起的危险。本文介绍了强化学习系统理论过程分析(RL-STPA),这是一个框架,它通过三个关键贡献来适应传统STPA的系统风险分析,以解决强化学习的独特挑战:使用时间阶段分析和领域专业知识的分层子任务分解来捕获紧急行为,探索状态-动作空间敏感性的覆盖引导扰动测试,以及通过奖励塑造和课程设计将识别的危险反馈到培训中的迭代检查点。我们在自主无人机导航和着陆的安全关键测试案例中演示了RL-STPA,揭示了标准RL评估可能错过的潜在损失场景。拟议的框架提供了从业人员的工具包,系统的危险分析,定量指标的安全范围评估,并建立操作安全界限的可操作的指导方针。虽然RL-STPA不能为任意的神经策略提供正式的保证,但它提供了一种实用的方法,用于系统地评估和提高RL在安全关键应用中的安全性和鲁棒性,其中详尽的验证方法仍然难以处理。
摘要:As reinforcement learning (RL) deployments expand into safety-critical domains, existing evaluation methods fail to systematically identify hazards arising from the black-box nature of neural network enabled policies and distributional shift between training and deployment. This paper introduces Reinforcement Learning System-Theoretic Process Analysis (RL-STPA), a framework that adapts conventional STPA's systematic hazard analysis to address RL's unique challenges through three key contributions: hierarchical subtask decomposition using both temporal phase analysis and domain expertise to capture emergent behaviors, coverage-guided perturbation testing that explores the sensitivity of state-action spaces, and iterative checkpoints that feed identified hazards back into training through reward shaping and curriculum design. We demonstrate RL-STPA in the safety-critical test case of autonomous drone navigation and landing, revealing potential loss scenarios that can be missed by standard RL evaluations. The proposed framework provides practitioners with a toolkit for systematic hazard analysis, quantitative metrics for safety coverage assessment, and actionable guidelines for establishing operational safety bounds. While RL-STPA cannot provide formal guarantees for arbitrary neural policies, it offers a practical methodology for systematically evaluating and improving RL safety and robustness in safety-critical applications where exhaustive verification methods remain intractable.
【2】LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
标题:LongAct:利用内在激活模式进行长上下文强化学习
链接:https://arxiv.org/abs/2604.14922
作者:Bowen Ping,Zijun Chen,Tingfeng Hui,Qize Yu,Chenxuan Li,Junchi Yan,Baobao Chang
摘要:强化学习(RL)已经成为增强大型语言模型(LLM)推理能力的关键驱动力。虽然最近的进展主要集中在奖励工程或数据合成上,但很少有研究利用模型的内在表征特征来指导训练过程。在本文中,我们首先观察到的查询和关键向量处理长上下文时,高幅度激活的存在。从模型量化中获得灵感-建立了这种高量级激活的关键性-以及长上下文推理固有地表现出稀疏结构的洞察力,我们假设这些权重是有效模型优化的关键驱动因素。基于这一见解,我们提出了LongAct,一种从统一转向显着性指导稀疏更新的策略。通过选择性地仅更新与这些重要激活相关的权重,LongAct在LongBench v2上实现了约8%的改进,并增强了RULER基准测试的泛化能力。此外,我们的方法具有显着的通用性,在不同的RL算法(如GRPO和DAPO)中始终提高性能。广泛的消融研究表明,关注这些显著特征是释放长期背景潜力的关键。
摘要
:Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.
【3】Reinforcement Learning via Value Gradient Flow
标题:通过价值梯度流的强化学习
链接:https://arxiv.org/abs/2604.14265
作者:Haoran Xu,Kaiwen Hu,Somayeh Sojoudi,Amy Zhang
备注:ICLR 2026
摘要:我们研究行为正则化强化学习(RL),其中对参考分布(离线RL中的数据集或LLM RL微调中的基础模型)进行正则化对于防止错误的分布外推引起的值过度优化至关重要。现有的方法要么依赖于重新参数化的政策梯度,这是很难扩展到大型生成模型,或拒绝采样,这可能是过于保守时,试图超越行为支持。在本文中,我们提出了价值梯度流(VGF),一个可扩展的新范式的行为正则化强化学习。VGF将行为正则化RL转换为最优传输问题,该问题将参考分布映射到值诱导的最优策略分布。我们通过离散梯度流来解决这个传输问题,其中值梯度引导从参考分布初始化的粒子。我们的分析表明,VGF通过控制传输预算来隐含地施加正则化。VGF消除了显式的策略参数化,同时保持了表达性和灵活性,这使得通过调整传输预算来实现自适应的测试时间缩放。大量的实验表明,VGF显着优于以前的方法,实现了最先进的结果离线RL基准(D4RL,OGBench)和LLM RL任务。代码和运行可以在https://ryanxhr.github.io/vgf上找到。
摘要:We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at https://ryanxhr.github.io/vgf.
【4】Timescale Separation Enables Deep Reinforcement Learning Control of Rotating Detonation Engine Mode Transitions
标题:时间尺度分离实现旋转雷管发动机模式转变的深度强化学习控制
链接:https://arxiv.org/abs/2604.14398
作者:Kristian Holme,Jean Rabault,Ricardo Vinuesa,Mikael Mortensen
摘要:旋转爆震发动机(RDE)是一种很有前途的推进概念,其可以提供比传统系统更高的热力学效率和比冲,但是非线性现象,包括向振荡或混沌传播模式的转变,可以阻碍实际操作。深度强化学习(DRL)已经成为一种很有前途的方法,用于控制复杂的非线性动力学,例如在RDE中观察到的动力学。然而,RDE系统的多时间尺度性质使得直接应用DRL具有挑战性。我们解决这个挑战,重新制定的DRL问题,在一个移动的参考系,爆炸波模式,使波结构出现准稳定的代理。这种重新表述使快速爆震传播和较慢的操作模式动态之间的尺度分离。我们训练DRL控制器,以调制空间分段注射压力在一维降阶RDE模型,并诱导不同的锁模状态之间的快速转换。在一系列致动周期、初始状态和目标模式中,在移动框架中训练的控制器比在固定框架中训练的控制器更可靠地学习,并且在更宽的致动周期范围内保持有效。这些结果表明,移动参考框架配方可能是有用的相关的多尺度流量控制问题,规模分离应尽可能利用,使DRL控制的多时间尺度系统。
摘要:Rotating detonation engines (RDEs) are a promising propulsion concept that may offer higher thermodynamic efficiency and specific impulse than conventional systems, but nonlinear phenomena, including transitions to oscillatory or chaotic propagation modes, can hinder practical operation. Deep Reinforcement Learning (DRL) has emerged as a promising method for controlling complex nonlinear dynamics such as those observed in RDEs. However, the multi-timescale nature of the RDE system makes direct application of DRL challenging. We address this challenge by reformulating the DRL problem in a moving reference frame that follows the detonation-wave pattern, making the wave structure appear quasi-steady to the agent. This reformulation enables scale separation between fast detonation propagation and slower operating-mode dynamics. We train DRL controllers to modulate spatially segmented injection pressure in a one-dimensional reduced-order RDE model and induce rapid transitions between different mode-locked states. Across a range of actuation periods, initial states, and target modes, controllers trained in the moving frame learn more reliably than those trained in a stationary frame and remain effective over a broader range of actuation periods. These results suggest that symmetry-aware moving reference frame formulations may be useful for related multiscale flow-control problems and that scale separation should be exploited whenever possible to enable DRL control of multi-timescale systems.
符号|符号学习(1篇)
【1】Prism: Symbolic Superoptimization of Tensor Programs
标题:棱镜:张量程序的象征性超优化
链接:https://arxiv.org/abs/2604.15272
作者:Mengdi Wu,Xiaoyu Jiang,Oded Padon,Zhihao Jia
摘要:本文介绍了Prism,第一个用于张量程序的符号超优化器。关键思想是sGraph,一种符号化的分层表示,通过符号化地表示一些执行参数来对大类张量程序进行压缩编码。Prism将优化组织为两级搜索:它构造表示程序族的符号图,然后将它们实例化为具体的实现。该公式使结构化修剪证明次优区域的搜索空间使用符号推理算子语义,代数身份,和硬件约束。 我们开发的技术,有效的符号图生成,等价性验证通过电子图重写,参数实例化,通过自动调整。这些组件使Prism能够将穷举搜索的严格性与现代ML工作负载所需的可扩展性联系起来。对五种常用LLM工作负载的评估表明,Prism比最佳超级优化器实现了高达2.2\times $的加速,比最佳基于编译器的方法实现了4.9\times $的加速,同时将端到端优化时间减少了3.4\times $。
摘要:This paper presents Prism, the first symbolic superoptimizer for tensor programs. The key idea is sGraph, a symbolic, hierarchical representation that compactly encodes large classes of tensor programs by symbolically representing some execution parameters. Prism organizes optimization as a two-level search: it constructs symbolic graphs that represent families of programs, and then instantiates them into concrete implementations. This formulation enables structured pruning of provably suboptimal regions of the search space using symbolic reasoning over operator semantics, algebraic identities, and hardware constraints. We develop techniques for efficient symbolic graph generation, equivalence verification via e-graph rewriting, and parameter instantiation through auto-tuning. Together, these components allow Prism to bridge the rigor of exhaustive search with the scalability required for modern ML workloads. Evaluation on five commonly used LLM workloads shows that Prism achieves up to $2.2\times$ speedup over best superoptimizers and $4.9\times$ over best compiler-based approaches, while reducing end-to-end optimization time by up to $3.4\times$.
医学相关(2篇)
【1】Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring
标题:初始化,然后分类:基于数据库的临床价值集创作自动化
链接
:https://arxiv.org/abs/2604.14616
作者:Sumit Mukherjee,Juan Shu,Nairwita Mazumder,Tate Kernell,Celena Wheeler,Shannon Hastings,Chris Sidey-Gibbons
摘要:临床值集创作--在标准化词汇表中识别定义临床概念的所有代码的任务--是临床质量测量和表型分析中反复出现的瓶颈。一种自然的方法是提示大型语言模型(LLM)直接生成所需的代码,但结构化的临床词汇表很大,版本受控,并且在预训练期间无法可靠地记住。我们提出了检索增强集完成(RASC):从策展语料库中检索$K$最相似的现有值集,形成一个候选池,然后对每个候选代码应用一个分类器。从理论上讲,检索和选择可以通过将有效输出空间从完整的词汇表缩小到更小的检索候选池来降低统计复杂度。我们展示了实用的RASC 11,803公开可用的VSAC值集,构建了第一个大规模的基准这项任务。在SAPBert上微调的交叉编码器实现了AUROC~0.852和值集级别F1~0.298,优于更简单的三层多层感知器(AUROC~0.799,F1~0.250),并且两者都将每个真阳性的不相关候选者的数量从12.3(仅检索)分别减少到约3.2和4.4。GPT-4 o Zero-shot性能达到了F1~0.105,有48.6%的返回码完全不存在于VSAC中。这种性能差距随着值集大小的增加而扩大,这与RASC的理论优势一致。我们在其他两种分类器模型类型(即从预训练的SAPBert和LightGBM模型初始化的交叉编码器)中观察到类似的性能增益,这表明RASC的好处超出了单个模型类。下载和创建基准数据集的代码以及模型训练代码可在以下位置获得:\href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}。
摘要:Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.
【2】Unraveling the Mechanism of Drug Binding to SARS-CoV-2 RNA Pseudoknot with Thermodynamics-Driven Machine Learning
标题:利用热力学驱动的机器学习解开药物与SARS-CoV-2 RNA假结结合的机制
链接:https://arxiv.org/abs/2604.14906
作者:Mariia Ivonina,Jakub Rydzewski
摘要:SARS-CoV-2 RNA假结是抗病毒干预的一个有前途的靶点,因为它调节$-$1程序性核糖体移码($-$1 PRF)的效率,这是一种对病毒蛋白质合成至关重要的机制。假结代表由螺旋茎组成的病毒RNA序列,其采用两种长寿命拓扑结构,螺纹和非螺纹。配体诱导的这种折叠变形被认为是$-$1 PRF对小分子抑制剂敏感性的基础。解决这些失真的无偏分子动力学(MD)需要集体变量(CV),隔离最慢的动态模式的RNA-配体系统的高频波动。在这里,我们使用光谱图(SM),一种药物驱动的机器学习方法,直接从SARS-CoV-2 RNA假结的MD轨迹与$-$1 PRF抑制剂meraflavine和两个相关的类似物中学习这样的CV。我们研究螺纹和无螺纹的假结拓扑结构,并考虑在生理pH值相关的中性和离子化配体形式。自由能景观表明,配体诱导的不稳定是拓扑选择性的:meraflidone及其类似物不稳定的S2干螺纹的假结,而在无螺纹的假结,不稳定转移到S1和S3干。我们发现,两性离子形式的merafenchantine唯一地施加缓慢的动力学,否则毫无特色的无螺纹的伪结。此外,在相同的RNA拓扑结构中,中性和两性离子形式的meraflavin在其机制上有质的不同。总体而言,这些结果阐明了假结拓扑结构,配体类型和质子化状态如何塑造病毒RNA的缓慢构象动力学,并建立生理质子化作为建模RNA靶向药物作用的重要因素。
摘要:The SARS-CoV-2 RNA pseudoknot is a promising target for antiviral intervention, as it regulates the efficiency of $-$1 programmed ribosomal frameshifting ($-$1 PRF), a mechanism that is essential for viral protein synthesis. The pseudoknot represents a viral RNA sequence composed of helical stems that adopts two long-lived topologies, threaded and unthreaded. Ligand-induced distortion of this fold is thought to underlie the susceptibility of $-$1 PRF to small-molecule inhibitors. Resolving these distortions from unbiased molecular dynamics (MD) requires collective variables (CVs) that isolate the slowest dynamic modes of the RNA--ligand system from the high-frequency fluctuations. Here, we use spectral map (SM), a thermodynamics-driven machine-learning method, to learn such CVs directly from MD trajectories of the SARS-CoV-2 RNA pseudoknot in complex with the $-$1 PRF inhibitor merafloxacin and two related analogs. We examine both threaded and unthreaded pseudoknot topologies and consider the neutral and ionized ligand forms relevant at physiological pH. Free-energy landscapes show that ligand-induced destabilization is topology-selective: merafloxacin and its analogs destabilize the S2 stem in the threaded pseudoknot, whereas in the unthreaded pseudoknot, destabilization shifts to the S1 and S3 stems. We find that the zwitterionic form of merafloxacin uniquely imposes slow dynamics on the otherwise featureless unthreaded pseudoknot. Furthermore, the neutral and zwitterionic forms of merafloxacin differ qualitatively in their mechanisms within the same RNA topology. Overall, these results clarify how pseudoknot topology, ligand type, and protonation state shape the slow conformational dynamics of viral RNA and establish physiological protonation as an essential factor for modeling RNA-targeted drug action.
蒸馏|知识提取(2篇)
【1】DLink: Distilling Layer-wise and Dominant Knowledge from EEG Foundation Models
标题:DLink:从脑电基础模型中提取分层和主导知识
链接:https://arxiv.org/abs/2604.15016
作者:Jingyuan Wang,Meiyan Xu,Zhihao Jia,Chenyu Liu,Xinliang Zhou,Ziyu Jia,Yong Li,Fang Li,Junfeng Yao,Yi Ding
摘要:EEG基础模型(FM)实现了强大的跨学科和跨任务的推广,但施加了大量的计算和内存成本,阻碍了嵌入式BCI系统的部署。知识蒸馏是一个自然的解决方案,然而,传统的方法失败的EEG FM,因为任务相关的语义往往分布在中间层,积极的降维可以扭曲振荡结构通过代表性的崩溃和混叠。为了应对这些挑战,我们提出了DLink(Distilling Layer-wise and Dominant Knowledge),用于将知识从大型EEG FM转移到紧凑学生的统一框架,具有三个关键创新:(1)自适应聚合教师层以捕获主导中间表示的动态路由器;(2)具有Mimic-then-Compress流水线的EEG MiC学生,其继承高维教师特征,然后应用结构化时空压缩以避免繁重的分类头;以及(3)频谱蒸馏,其在频域中对准教师-学生表示以规则化压缩并减轻混叠和时间抖动。四个EEG基准测试的实验表明,DLink使紧凑的学生能够超越轻量级基线,同时以更低的模型大小和推理成本接近完全微调的FM性能。
摘要:EEG foundation models (FMs) achieve strong cross-subject and cross-task generalization but impose substantial computational and memory costs that hinder deployment on embedded BCI systems. Knowledge distillation is a natural solution; however, conventional methods fail for EEG FMs because task-relevant semantics are often distributed across intermediate layers, and aggressive dimensionality reduction can distort oscillatory structure via representational collapse and aliasing. To address these challenges, we propose DLink (Distilling Layer-wise and Dominant Knowledge), a unified framework for transferring knowledge from large EEG FMs to compact students with three key innovations: (1) a dynamic Router that adaptively aggregates teacher layers to capture dominant intermediate representations; (2) an EEG MiC student with a Mimic-then-Compress pipeline, which inherits high-dimensional teacher features and then applies structured spatio-temporal compression to avoid a heavy classification head; and (3) spectral distillation that aligns teacher-student representations in the frequency domain to regularize compression and mitigate aliasing and temporal jitter. Experiments on four EEG benchmarks show that DLink enables compact students to outperform lightweight baselines while approaching fully fine-tuned FM performance at substantially lower model size and inference cost.
【2】Attention to Mamba: A Recipe for Cross-Architecture Distillation
标题:关注曼巴:跨建筑蒸馏的秘诀
链接:https://arxiv.org/abs/2604.14191
作者:Abhinav Moudgil,Ningyuan Huang,Eeshan Gunesh Dhekane,Pau Rodríguez,Luca Zappella,Federico Danieli
摘要:状态空间模型(SSM)(如Mamba)已经成为Transformer模型的流行替代方案,因为与基于注意力的模型相比,它们在生成时减少了内存消耗并提高了吞吐量。另一方面,社区已经建立了关于如何训练Transformers的大量知识,并且许多预先训练的Transformer模型是现成的。为了促进SSM的采用,同时利用现有的预训练的Transformers,我们的目标是确定一个有效的配方,将基于注意力的模型提取到一个类似Mamba的架构。然而,在之前的跨架构蒸馏工作中,已经表明从Transformers到Mamba的简单蒸馏过程无法保持原始教师的性能,这一限制通常通过结合Attention和SSM块的混合解决方案来克服。我们工作的关键论点是,通过为Mamba配备一个有原则的初始化,我们可以恢复一个整体上更好的跨架构蒸馏配方。为此,我们提出了一个原则性的两阶段的方法:首先,我们从传统的Transformer提取知识到一个线性化的版本的注意力,使用内核技巧的适应。然后,我们将线性化的版本提取为不使用任何Attention块的自适应Mamba模型。总的来说,经过提炼的Mamba模型能够在下游任务中保持原始的Pythia-1B Transformer性能,保持14.11的困惑度,接近教师的13.86。为了显示我们的配方的有效性,我们在1B规模下进行了彻底的消融,其中10 B令牌改变了序列混合器架构,对模型大小和总蒸馏令牌进行了缩放分析,并对阶段之间的令牌分配进行了敏感性分析。
摘要:State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a naïve distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.
聚类(1篇)
【1】Scalable Model-Based Clustering with Sequential Monte Carlo
标题:使用顺序蒙特卡罗的可扩展基于模型的集群
链接:https://arxiv.org/abs/2604.14810
作者:Connie Trojan,Pavel Myshkov,Paul Fearnhead,James Hensman,Tom Minka,Christopher Nemeth
备注:Accepted at AISTATS 2026. 31 pages, 20 figures
摘要:在在线聚类问题中,在可能的聚类分配上通常存在大量的不确定性,直到观察到更多的数据才能解决这些不确定性。当聚类遵循复杂的分布时,这种困难就更加复杂了,就像文本数据一样。序贯蒙特卡罗(SMC)方法提供了一种自然的方式来表示和更新这种不确定性随着时间的推移,但有禁止大规模的问题的内存要求。我们提出了一种新的SMC算法,聚类问题分解成近似独立的子问题,允许一个更紧凑的算法状态表示。我们的方法是由知识库建设问题的动机,我们表明,我们的方法是能够准确,有效地解决聚类问题,在这种设置和其他传统SMC的斗争。
摘要:In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.
超分辨率|去噪|去模糊|去雾(1篇)
【1】Step-level Denoising-time Diffusion Alignment with Multiple Objectives
标题:具有多目标的分步降噪时间扩散对齐
链接:https://arxiv.org/abs/2604.14379
作者:Qi Zhang,Dawei Wang,Shaofeng Zou
摘要:强化学习(RL)已经成为将扩散模型与人类偏好对齐的强大工具,通常通过在KL正则化约束下优化单个奖励函数。然而,在实践中,人类的偏好本质上是多元的,对齐的模型必须平衡多个下游目标,例如美学质量和文本图像一致性。现有的多目标方法要么依赖于昂贵的多目标RL微调,要么依赖于在去噪时融合单独对齐的模型,但它们通常需要访问奖励值(或其梯度)和/或在所得去噪目标中引入近似误差。在本文中,我们重新审视扩散模型的RL微调的问题,并通过引入一个步骤级RL制定解决识别最优策略的棘手性。在此基础上,我们进一步提出了多目标步长级去噪时间扩散对齐(MSDDA),这是一个用于对齐具有多个目标的扩散模型的免再训练框架,以封闭形式获得最佳反向去噪分布,其中均值和方差直接以单目标基础模型表示。我们证明,这个去噪时间目标是完全等同于步级RL微调,引入没有近似误差。此外,我们提供了数值结果,这表明我们的方法优于现有的去噪时间的方法。
摘要:Reinforcement learning (RL) has emerged as a powerful tool for aligning diffusion models with human preferences, typically by optimizing a single reward function under a KL regularization constraint. In practice, however, human preferences are inherently pluralistic, and aligned models must balance multiple downstream objectives, such as aesthetic quality and text-image consistency. Existing multi-objective approaches either rely on costly multi-objective RL fine-tuning or on fusing separately aligned models at denoising time, but they generally require access to reward values (or their gradients) and/or introduce approximation error in the resulting denoising objectives. In this paper, we revisit the problem of RL fine-tuning for diffusion models and address the intractability of identifying the optimal policy by introducing a step-level RL formulation. Building on this, we further propose Multi-objective Step-level Denoising-time Diffusion Alignment (MSDDA), a retraining-free framework for aligning diffusion models with multiple objectives, obtaining the optimal reverse denoising distribution in closed form, with mean and variance expressed directly in terms of single-objective base models. We prove that this denoising-time objective is exactly equivalent to the step-level RL fine-tuning, introducing no approximation error. Moreover, we provide numerical results, which indicate our method outperforms existing denoising-time approaches.
自动驾驶|车辆|车道检测等(2篇)
【1】Low-Cost System for Automatic Recognition of Driving Pattern in Assessing Interurban Mobility using Geo-Information
标题:使用地理信息评估城际移动性时驾驶模式自动识别的低成本系统
链接:https://arxiv.org/abs/2604.15216
作者:Oscar Romero,Aika Silveira Miura,Lorena Parra,Jaime Lloret
备注:18 pages, 10 figures, 3 tables
摘要:在城市和城市间地区的流动,主要是汽车,是许多人的日常活动。然而,它的一些主要缺点是交通堵塞和事故。新制造的车辆预装了驾驶评估系统,可以防止事故。然而,我们道路上的大多数汽车都没有驾驶员评估系统。在本文中,我们提出了一种识别驾驶风格,使司机达到更安全,更有效的驾驶方法。该系统由两个物理传感器连接到一个设备节点与显示器和扬声器。节点中包含一个人工神经网络(ANN),它分析来自传感器的数据,然后识别驾驶风格。当检测到异常驾驶模式时,扬声器将播放警告信息。原型车在城际道路上进行了组装和测试,特别是在具有三种驾驶风格的传统道路上。收集的数据被用来训练和验证的人工神经网络。结果,在精度方面,表明,更好的精度时,获得的速度,位置(纬度和经度),时间,和转动速度的3轴,提供了83%的平均精度。如果只考虑两种驾驶风格(正常和激进)进行分类,则准确率达到92%。当包含地理信息和时间数据时,分类精度提高了13%,这是本文的主要创新点。
摘要:Mobility in urban and interurban areas, mainly by cars, is a day-to-day activity of many people. However, some of its main drawbacks are traffic jams and accidents. Newly made vehicles have pre-installed driving evaluation systems, which can prevent accidents. However, most cars on our roads do not have driver assessment systems. In this paper, we propose an approach for recognising driving styles and enabling drivers to reach safer and more efficient driving. The system consists of two physical sensors connected to a device node with a display and a speaker. An artificial neural network (ANN) is included in the node, which analyses the data from the sensors, and then recognises the driving style. When an abnormal driving pattern is detected, the speaker will play a warning message. The prototype was assembled and tested using an interurban road, in particular on a conventional road with three driving styles. The gathered data were used to train and validate the ANN. Results, in terms of accuracy, indicate that better accuracy is obtained when the velocity, position (latitude and longitude), time, and turning speed for the 3-axis are used, offering an average accuracy of 83%. If the classification is performed considering just two driving styles, normal and aggressive, then the accuracy reaches 92%. When the geo-information and time data are included, the main novelty of this paper, the classification accuracy is improved by 13%.
【2】Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning
标题:在开辟道路之前先开辟道路:样本高效的蒙特卡洛规划
链接:https://arxiv.org/abs/2604.14974
作者:Jean-Bastien Grill,Michal Valko,Rémi Munos
备注:Published in Neural Information Processing Systems 2016
摘要:你是一个机器人,你生活在一个马尔可夫决策过程(MDP)中,从状态-动作到下一个状态的转换次数是有限的或无限的。你有头脑,所以你在行动之前计划。幸运的是,你的机器人父母为你配备了一个生成模型来做一些蒙特卡洛规划。世界在等着你,你没有时间可以浪费。你希望你的计划是有效的。采样效率高。实际上,您希望通过仅探索通过遵循接近最优的策略可达到的状态的子集来利用MDP的可能结构。您需要保证样本复杂度,这取决于接近最优状态的数量的度量。你想要的东西,这是一个扩展的蒙特卡洛抽样(估计预期)的问题,交替最大化(对行动)和期望(对下一个国家)。但是你不想用指数级的运行时间停止,你想要一些简单的实现和计算效率。你现在就想要。你想要开拓者
摘要:You are a robot and you live in a Markov decision process (MDP) with a finite or an infinite number of transitions from state-action to next states. You got brains and so you plan before you act. Luckily, your roboparents equipped you with a generative model to do some Monte-Carlo planning. The world is waiting for you and you have no time to waste. You want your planning to be efficient. Sample-efficient. Indeed, you want to exploit the possible structure of the MDP by exploring only a subset of states reachable by following near-optimal policies. You want guarantees on sample complexity that depend on a measure of the quantity of near-optimal states. You want something, that is an extension of Monte-Carlo sampling (for estimating an expectation) to problems that alternate maximization (over actions) and expectation (over next states). But you do not want to StOP with exponential running time, you want something simple to implement and computationally efficient. You want it all and you want it now. You want TrailBlazer.
点云|SLAM|雷达|激光|深度RGBD相关(1篇)
【1】Class Unlearning via Depth-Aware Removal of Forget-Specific Directions
标题:通过深度感知删除特定遗忘方向来取消课堂学习
链接:https://arxiv.org/abs/2604.15166
作者:Arman Hatami,Romina Aalishah,Ilya E. Monosov
备注:Accepted to the CVPR 2026 Workshop on Machine Unlearning for Vision (MUV)
摘要:机器非学习旨在从训练模型中删除目标知识,而无需从头开始重新训练。然而,在类遗忘中,降低遗忘类的准确性并不一定意味着真正的遗忘:被遗忘的信息可以保持在内部表征中编码,并且明显的遗忘可能来自分类器头抑制而不是表征删除。我们发现,现有的类学习方法往往表现出弱或负的选择性,保持遗忘类结构的深度表示,或严重依赖于最后一层的偏见转变。然后,我们介绍了DAMP(投影深度感知调制),这是一种一次性的封闭式加权手术方法,它可以从预先训练的网络中删除忘记特定方向,而无需基于梯度的优化。在每个阶段,DAMP在下一个可学习运算符的输入空间中计算类原型,提取遗忘方向作为相对于保留类原型的残差,并应用基于投影的更新来降低下游对这些方向的敏感性。为了保持实用性,DAMP使用了一个无参数的深度感知缩放规则,该规则来自探测器的可分离性,在早期层中应用较小的编辑,在更深的层中应用较大的编辑。该方法自然地扩展到多类遗忘,通过低秩子空间去除。在MNIST、CIFAR-10、CIFAR-100和Tiny ImageNet中,以及在卷积和Transformer架构中,DAMP比以前的一些方法更接近再训练的黄金标准,改善了选择性遗忘,同时更好地保留了保留类性能,并减少了深层中残留的遗忘类结构。
摘要:Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.
联邦学习|隐私保护|加密(2篇)
【1】FedIDM: Achieving Fast and Stable Convergence in Byzantine Federated Learning through Iterative Distribution Matching
标题:FedIDM:通过迭代分布匹配实现拜占庭联邦学习中的快速稳定收敛
链接:https://arxiv.org/abs/2604.15115
作者:He Yang,Dongyi Lv,Wei Xi,Song Ma,Hanlin Gu,Jizhong Zhao
摘要
:现有的拜占庭鲁棒联邦学习(FL)方法收敛速度慢且不稳定。此外,当处理相当大比例的串通恶意客户端时,实现鲁棒性通常需要牺牲模型效用。为了解决这些问题,这项工作介绍了FedIDM,它采用分布匹配构建可信的压缩数据识别和过滤异常客户端。FedIDM由两个主要组件组成:(1)攻击容忍的压缩数据生成,以及(2)具有基于负贡献的拒绝的鲁棒聚合。这些组件不包括以下本地更新:(1)偏离从压缩数据导出的更新方向,或(2)导致压缩数据集上的显著丢失。三个基准数据集的综合评估表明,FedIDM实现快速和稳定的收敛,同时保持可接受的模型效用,在多个国家的最先进的拜占庭攻击,涉及大量的恶意客户端。
摘要:Most existing Byzantine-robust federated learning (FL) methods suffer from slow and unstable convergence. Moreover, when handling a substantial proportion of colluded malicious clients, achieving robustness typically entails compromising model utility. To address these issues, this work introduces FedIDM, which employs distribution matching to construct trustworthy condensed data for identifying and filtering abnormal clients. FedIDM consists of two main components: (1) attack-tolerant condensed data generation, and (2) robust aggregation with negative contribution-based rejection. These components exclude local updates that (1) deviate from the update direction derived from condensed data, or (2) cause a significant loss on the condensed dataset. Comprehensive evaluations on three benchmark datasets demonstrate that FedIDM achieves fast and stable convergence while maintaining acceptable model utility, under multiple state-of-the-art Byzantine attacks involving a large number of malicious clients.
【2】Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations
标题:利用联邦学习中的相关性:机会和实践局限性
链接:https://arxiv.org/abs/2604.14751
作者:Adrian Edin,Michel Kieffer,Mikael Johansson,Zheng Chen
备注:14 pages, 7 figures, submitted for possible publication
摘要:联邦学习(FL)中的通信瓶颈促使人们广泛研究减少客户端设备和中央参数服务器之间交换的数据量的技术。在本文中,我们系统地将梯度和模型压缩方案分为三类,基于它们利用的相关性类型:结构,时间和空间。我们研究了这种相关性的来源,提出了定量指标来衡量其大小,并通过这个统一的基于相关性的框架重新解释现有的压缩方法。我们的实验研究表明,结构,时间和空间的相关性的程度显着不同,这取决于任务的复杂性,模型架构和算法配置。这些发现表明,算法设计者应该仔细评估特定部署场景下的相关性假设,而不是假设它们总是存在。出于这些研究结果,我们提出了两个自适应压缩设计,主动切换不同的压缩模式之间的测量相关强度的基础上,我们评估其性能增益相对于传统的非自适应方法。总之,我们的统一分类法为FL系统开发更有效和特定于应用程序的压缩技术提供了一个干净和原则性的基础。
摘要:The communication bottleneck in federated learning (FL) has spurred extensive research into techniques to reduce the volume of data exchanged between client devices and the central parameter server. In this paper, we systematically classify gradient and model compression schemes into three categories based on the type of correlations they exploit: structural, temporal, and spatial. We examine the sources of such correlations, propose quantitative metrics for measuring their magnitude, and reinterpret existing compression methods through this unified correlation-based framework. Our experimental studies demonstrate that the degrees of structural, temporal, and spatial correlations vary significantly depending on task complexity, model architecture, and algorithmic configurations. These findings suggest that algorithm designers should carefully evaluate correlation assumptions under specific deployment scenarios rather than assuming that they are always present. Motivated by these findings, we propose two adaptive compression designs that actively switch between different compression modes based on the measured correlation strength, and we evaluate their performance gains relative to conventional non-adaptive approaches. In summary, our unified taxonomy provides a clean and principled foundation for developing more effective and application-specific compression techniques for FL systems.
推理|分析|理解|解释(10篇)
【1】xFODE: An Explainable Fuzzy Additive ODE Framework for System Identification
标题:xFODE:一个用于系统识别的可解释模糊添加ODE框架
链接:https://arxiv.org/abs/2604.14883
作者:Ertugrul Kececi,Tufan Kumbasar
备注:in IEEE Conference on Artificial Intelligence, 2026
摘要:深度学习(DL)的最新进展加强了数据驱动的系统识别(SysID),神经和模糊常微分方程(NODE/FODE)模型在非线性动态建模中实现了高精度。然而,在这些框架中的系统状态往往是重建没有明确的物理意义,和输入贡献的状态导数仍然难以解释。为了解决这些限制,我们提出了可解释FODE(xFODE),一个可解释的SysID框架,集成了基于DL的训练。在xFODE中,我们以增量形式定义状态,以提供它们的物理意义。我们采用模糊加法模型近似的状态导数,从而提高每输入的可解释性。为了提供进一步的可解释性,开发了分区策略(PS),使模糊可加性模型的训练具有可解释性。通过在训练过程中构造前件空间,使得对于任何给定的输入,只有两个连续的规则被激活,PS不仅降低了局部推理的复杂性,而且增强了前件空间的可解释性。为了训练xFODE,我们提出了一个支持端到端优化的参数化隶属函数学习的DL框架。在基准SysID数据集中,xFODE与NODE、FODE和NLARX模型的准确性相匹配,同时提供可解释的见解。
摘要:Recent advances in Deep Learning (DL) have strengthened data-driven System Identification (SysID), with Neural and Fuzzy Ordinary Differential Equation (NODE/FODE) models achieving high accuracy in nonlinear dynamic modeling. Yet, system states in these frameworks are often reconstructed without clear physical meaning, and input contributions to the state derivatives remain difficult to interpret. To address these limitations, we propose Explainable FODE (xFODE), an interpretable SysID framework with integrated DL-based training. In xFODE, we define states in an incremental form to provide them with physical meanings. We employ fuzzy additive models to approximate the state derivative, thereby enhancing interpretability per input. To provide further interpretability, Partitioning Strategies (PSs) are developed, enabling the training of fuzzy additive models with explainability. By structuring the antecedent space during training so that only two consecutive rules are activated for any given input, PSs not only yield lower complexity for local inference but also enhance the interpretability of the antecedent space. To train xFODE, we present a DL framework with parameterized membership function learning that supports end-to-end optimization. Across benchmark SysID datasets, xFODE matches the accuracy of NODE, FODE, and NLARX models while providing interpretable insights.
【2】Towards Trustworthy 6G Network Digital Twins: A Framework for Validating Counterfactual What-If Analysis in Edge Computing Resources
标题:迈向值得信赖的6G网络数字双胞胎:验证边缘计算资源中反事实假设分析的框架
链接:https://arxiv.org/abs/2604.14787
作者:Julian Jimenez Agudelo,Paola Soto,Ayat Zaki-Hindi,Jean-Sébastien Sottet,Sébastien Faye,Nina Slamnik-Kriještorac,Johann Marquez-Barja,Miguel Camelo Botero
摘要:网络数字孪生(NDT)为6 G云边缘基础设施提供了安全的假设分析,但采用通常受到从遥测到验证的碎片化工作流程的限制。我们提出了一个数据驱动的NDT框架,该框架通过可扩展的管道扩展了6 G-TWIN,用于将云边缘遥测聚合和语义对齐到统一的数据模型中。我们的贡献包括:(i)可扩展的云边缘遥测收集,(ii)捕获网络缩放行为的机制感知特征工程,以及(iii)基于签名协议和方向灵敏度的验证方法。在Kubernetes管理的集群上进行评估,该框架将性能外推到看不见的高负载状态。结果表明,深度神经网络(DNN)和XGBoost都实现了高回归精度(R2 > 0.99),而XGBoost模型提供了卓越的方向可靠性(Sa > 0.90),使NDT成为在分布外场景中进行主动资源扩展的可靠工具。
摘要
:Network Digital Twins (NDTs) enable safe what-if analysis for 6G cloud-edge infrastructures, but adoption is often limited by fragmented workflows from telemetry to validation. We present a data-driven NDT framework that extends 6G-TWIN with a scalable pipeline for cloud-edge telemetry aggregation and semantic alignment into unified data models. Our contributions include: (i) scalable cloud-edge telemetry collection, (ii) regime-aware feature engineering capturing the network's scaling behavior, and (iii) a validation methodology based on Sign Agreement and Directional Sensitivity. Evaluated on a Kubernetes-managed cluster, the framework extrapolates performance to unseen high-load regimes. Results show both Deep Neural Network (DNN) and XGBoost achieve high regression accuracy (R2 > 0.99), while the XGBoost model delivers superior directional reliability (Sa > 0.90), making the NDT a trustworthy tool for proactive resource scaling in out-of-distribution scenarios.
【3】From Risk to Rescue: An Agentic Survival Analysis Framework for Liquidation Prevention
标题:从风险到拯救:一个清算预防的生存分析框架
链接:https://arxiv.org/abs/2604.14583
作者:Fernando Spadea,Oshani Seneviratne
摘要:像Aave v3这样的去中心化金融(DeFi)贷款协议依赖于过度抵押来获得贷款,但由于市场条件的波动,用户经常面临清算。现有的风险管理工具利用静态的健康因素阈值,这是被动的,无法区分行政“灰尘”清理和真正的破产。在这项工作中,我们提出了一个自治代理,利用时间到事件(生存)分析,并超越预测执行。与被动的风险信号不同,该代理感知风险,模拟反事实期货,并执行忠实于协议的干预措施,以主动防止清算。我们引入了一个来自数值稳定的XGBoost Cox比例风险模型的回报期指标,以规范化不同交易类型的风险,再加上波动调整的趋势得分,以过滤瞬时市场噪音。为了选择最佳干预措施,我们实施了一个反事实优化循环,该循环模拟潜在的用户操作,以找到降低风险所需的最低资本。我们使用高保真,忠实于协议的Aave v3模拟器对4,882个高风险用户配置文件进行了验证。结果表明,代理人的能力,以防止清算的紧迫风险的情况下,静态规则失败,有效地“保存unsavable”,同时保持零恶化率,提供了一个关键的安全保证往往缺少自治的金融代理人。此外,该系统成功区分了可操作的金融风险和可忽略的灰尘事件,在静态规则失效的情况下优化了资本效率。
摘要:Decentralized Finance (DeFi) lending protocols like Aave v3 rely on over-collateralization to secure loans, yet users frequently face liquidation due to volatile market conditions. Existing risk management tools utilize static health-factor thresholds, which are reactive and fail to distinguish between administrative "dust" cleanup and genuine insolvency. In this work, we propose an autonomous agent that leverages time-to-event (survival) analysis and moves beyond prediction to execution. Unlike passive risk signals, this agent perceives risk, simulates counterfactual futures, and executes protocol-faithful interventions to proactively prevent liquidations. We introduce a return period metric derived from a numerically stable XGBoost Cox proportional hazards model to normalize risk across transaction types, coupled with a volatility-adjusted trend score to filter transient market noise. To select optimal interventions, we implement a counterfactual optimization loop that simulates potential user actions to find the minimum capital required to mitigate risk. We validate our approach using a high-fidelity, protocol-faithful Aave v3 simulator on a cohort of 4,882 high-risk user profiles. The results demonstrate the agent's ability to prevent liquidations in imminent-risk scenarios where static rules fail, effectively "saving the unsavable" while maintaining a zero worsening rate, providing a critical safety guarantee often missing in autonomous financial agents. Furthermore, the system successfully differentiates between actionable financial risks and negligible dust events, optimizing capital efficiency where static rules fail.
【4】Generative Augmented Inference
标题:生成增强推理
链接:https://arxiv.org/abs/2604.14575
作者:Cheng Lu,Mengxin Wang,Dennis J. Zhang,Heng Zhang
摘要:数据驱动的运营管理通常依赖于从昂贵的人工生成的标签中估计的参数。大型语言模型(LLM)和其他人工智能系统的最新进展提供了廉价的辅助数据,但带来了新的挑战:人工智能输出不是对目标结果的直接观察,而是可能涉及与人类标签具有复杂和未知关系的高维表示。传统方法利用人工智能预测作为真实标签的直接代理,当这种关系很弱或错误指定时,这种方法可能效率低下或不可靠。我们提出了生成增强推理(GAI),这是一个通用框架,它将AI生成的输出作为信息特征,用于估计人类标记结果的模型。GAI使用正交矩构造,该构造能够实现一致的估计和有效的推断,并在LLM生成的输出和人类标签之间具有灵活的非参数关系。我们建立了渐近正态性,并显示了一个“安全默认”的属性:相对于人类数据的估计,GAI弱提高估计效率下的任意辅助信号,并产生严格的增益时,辅助信息是预测。从经验上看,GAI在不同环境中的表现优于基准。在弱辅助信号的联合分析中,GAI将估计误差降低了约50%,并将人类标记要求降低了75%以上。在零售定价中,所有方法都使用相同的辅助输入,GAI始终优于其他估计方法,突出了其构建的价值,而不是信息的差异。在健康保险选择方面,它将标签要求降低了90%以上,同时保持了决策的准确性。在应用程序中,GAI提高了置信区间覆盖率,而不会膨胀宽度。总的来说,GAI提供了一种原则性和可扩展的方法来集成AI生成的信息。
摘要:Data-driven operations management often relies on parameters estimated from costly human-generated labels. Recent advances in large language models (LLMs) and other AI systems offer inexpensive auxiliary data, but introduce a new challenge: AI outputs are not direct observations of the target outcomes, but could involve high-dimensional representations with complex and unknown relationships to human labels. Conventional methods leverage AI predictions as direct proxies for true labels, which can be inefficient or unreliable when this relationship is weak or misspecified. We propose Generative Augmented Inference (GAI), a general framework that incorporates AI-generated outputs as informative features for estimating models of human-labeled outcomes. GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels. We establish asymptotic normality and show a "safe default" property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive. Empirically, GAI outperforms benchmarks across diverse settings. In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%. In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information. In health insurance choice, it cuts labeling requirements by over 90% while maintaining decision accuracy. Across applications, GAI improves confidence interval coverage without inflating width. Overall, GAI provides a principled and scalable approach to integrating AI-generated information.
【5】Thermodynamic Diffusion Inference with Minimal Digital Conditioning
标题:最小数字条件下的热力学扩散推断
链接:https://arxiv.org/abs/2604.14332
作者:Aditi De
摘要:扩散模型推理和过阻尼朗之万动力学在形式上是相同的。因此,编码得分函数的物理基板仅通过热力学平衡到正确的输出,在推理期间不需要数字运算,并且相对于GPU可能实现$10{,} 000\times $的能量减少。到目前为止,有两个基本的障碍阻止了这种等效性在生产规模上的实现:非本地跳过连接,本地耦合的模拟基板不能表示,和输入调节,其中耦合常数携带大约2 {,} 600\times $太少的信号,无法将系统锚定到特定的输入。 我们解决了这两个障碍。\n\n {分层双线性耦合}将U-Net跳跃连接编码为直接从编码器和解码器Gram矩阵的奇异结构导出的秩为$k$的模块间交互,仅需要$O(Dk)$物理连接而不是$O(D^2)$。一个最小的数字接口--一个4维瓶颈编码器和一个16单元的传输网络,总共有2,560个参数--克服了条件障碍。当对从经过训练的去噪U-Net中提取的激活进行评估时,完整的系统在oracle上限为1.0000的情况下获得了\textbf{0.9906}的解码器余弦相似性,同时在GPU推理上保持了约10^7\times $的理论净节能。这些结果构成了训练重量、生产规模热力学扩散推论的首次演示。
摘要
:Diffusion-model inference and overdamped Langevin dynamics are formally identical. A physical substrate that encodes the score function therefore equilibrates to the correct output by thermodynamics alone, requiring no digital arithmetic during inference and potentially achieving a $10{,}000\times$ reduction in energy relative to a GPU. Two fundamental barriers have until now prevented this equivalence from being realized at production scale: non-local skip connections, which locally coupled analog substrates cannot represent, and input conditioning, in which the coupling constants carry roughly $2{,}600\times$ too little signal to anchor the system to a specific input. We resolve both obstacles. \emph{Hierarchical bilinear coupling} encodes U-Net skip connections as rank-$k$ inter-module interactions derived directly from the singular structure of the encoder and decoder Gram matrices, requiring only $O(Dk)$ physical connections instead of $O(D^2)$. A \emph{minimal digital interface} -- a 4-dimensional bottleneck encoder together with a 16-unit transfer network, totalling \textbf{2,560 parameters} -- overcomes the conditioning barrier. When evaluated on activations drawn from a trained denoising U-Net, the complete system attains a decoder cosine similarity of \textbf{0.9906} against an oracle upper bound of 1.0000, while preserving theoretical net energy savings of approximately $10^7\times$ over GPU inference. These results constitute the first demonstration of trained-weight, production-scale thermodynamic diffusion inference.
【6】Metric-Aware Principal Component Analysis (MAPCA):A Unified Framework for Scale-Invariant Representation Learning
标题:度量感知主成分分析(MAPCA):规模不变表示学习的统一框架
链接:https://arxiv.org/abs/2604.14249
作者:Michael Leznik
备注:12 pages , one figure
摘要:我们介绍度量感知主成分分析(MAPCA),一个统一的框架尺度不变的表示学习的基础上的广义特征问题最大Tr(W^T西格玛W)服从W^T M W = I,其中M是一个对称正定度量矩阵。M的选择决定了表示几何。规范β族M(β)= Sigma^beta,beta in [0,1]提供标准PCA(β =0)和输出白化(β =1)之间的连续谱偏差控制,条件数kappa(β)=(λ_p/λ_p)^(1-β)单调递减至各向同性。对角度量M = D = diag(Sigma)恢复了不变PCA(IPCA),这是一种植根于Frisch(1928)对角回归的方法,作为更广泛框架的一个独特成员。我们证明了尺度不变性保持当且仅当度量变换为M_tilde = CMC下重标度C,一个条件完全满足IPCA,但不是由一般的β-家庭在中间值。 除了其经典的解释,MAPCA提供了一种几何语言,统一了几个自我监督的学习目标。Barlow Twins和ZCA白化对应于beta=1(输出白化); VICReg的方差项对应于对角度量。一个关键的发现是,尽管W-MSE被描述为基于白化的方法,但它对应于M = Sigma^{-1}(beta = -1),完全在频谱压缩范围之外,并且在与Barlow Twins相反的频谱方向上。输入和输出白化之间的这种区别在损失函数的级别上是不可见的,并且仅在MAPCA框架内变得精确。
摘要:We introduce Metric-Aware Principal Component Analysis (MAPCA), a unified framework for scale-invariant representation learning based on the generalised eigenproblem max Tr(W^T Sigma W) subject to W^T M W = I, where M is a symmetric positive definite metric matrix. The choice of M determines the representation geometry. The canonical beta-family M(beta) = Sigma^beta, beta in [0,1], provides continuous spectral bias control between standard PCA (beta=0) and output whitening (beta=1), with condition number kappa(beta) = (lambda_1/lambda_p)^(1-beta) decreasing monotonically to isotropy. The diagonal metric M = D = diag(Sigma) recovers Invariant PCA (IPCA), a method rooted in Frisch (1928) diagonal regression, as a distinct member of the broader framework. We prove that scale invariance holds if and only if the metric transforms as M_tilde = CMC under rescaling C, a condition satisfied exactly by IPCA but not by the general beta-family at intermediate values. Beyond its classical interpretation, MAPCA provides a geometric language that unifies several self-supervised learning objectives. Barlow Twins and ZCA whitening correspond to beta=1 (output whitening); VICReg's variance term corresponds to the diagonal metric. A key finding is that W-MSE, despite being described as a whitening-based method, corresponds to M = Sigma^{-1} (beta = -1), outside the spectral compression range entirely and in the opposite spectral direction to Barlow Twins. This distinction between input and output whitening is invisible at the level of loss functions and becomes precise only within the MAPCA framework.
【7】Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making
标题:用于模拟的可解释和可解释的代理建模:用于决策的可解释人工智能的最新调查和观点
链接:https://arxiv.org/abs/2604.14240
作者:Pramudita Satria Palar,Paul Saves,Muhammad Daffa Robani,Nicolas Verstaevel,Moncef Garouani,Julien Aligon,Koji Shimoyama,Joseph Morlier,Benoit Gaudou
备注:Accepted for publication in Archives of Computational Methods in Engineering, 2026, ID d9d36aab-3723-4a70-b2ce-166435179528
摘要:复杂系统的模拟越来越依赖于复杂但基本上不透明的计算黑箱模拟器。代理模型在降低科学和工程领域复杂系统仿真的计算成本方面发挥着核心作用。尽管如此,它们不可避免地继承并经常加剧这种黑箱性质,模糊了输入变量如何驱动物理反应。相反,可解释人工智能(XAI)提供了强大的工具来解开这些模型。然而,XAI方法与工程特定的约束,如高度相关的输入,动态系统和严格的可靠性要求斗争。因此,代理建模和XAI在很大程度上已经发展成为不同的研究领域,尽管它们具有很强的互补性。为了重新连接这些方法,这项最先进的调查提供了一个结构化的视角,将现有的XAI技术映射到代理建模工作流程的各个阶段,以进行设计和探索。为了使这种合成接地,我们利用基于方程的模拟和基于代理的建模的说明性应用程序。我们调查了广泛的技术,突出了它们在揭示交互和支持人类理解方面的优势。最后,我们确定了紧迫的开放性挑战,包括动力系统的可解释性和混合变量系统的处理,并提出了一个研究议程,使可解释性的核心,从模型构建到决策的模拟驱动的工作流程的嵌入式元素。通过将不透明的仿真器转换为可解释的工具,该议程使从业者能够超越加速仿真,从复杂的系统行为中提取可操作的见解。
摘要:The simulation of complex systems increasingly relies on sophisticated but fundamentally opaque computational black-box simulators. Surrogate models play a central role in reducing the computational cost of complex systems simulations across a wide range of scientific and engineering domains. Notwithstanding, they inevitably inherit and often exacerbate this black-box nature, obscuring how input variables drive physical responses. Conversely, Explainable Artificial Intelligence (XAI) offers powerful tools to unpack these models. Yet, XAI methods struggle with engineering-specific constraints, such as highly correlated inputs, dynamical systems, and rigorous reliability requirements. Consequently, surrogate modeling and XAI have largely evolved as distinct fields of research, despite their strong complementarity. To reconnect these approaches, this state-of-the-art survey provides a structured perspective that maps existing XAI techniques onto the various stages of surrogate modeling workflows for design and exploration. To ground this synthesis, we draw upon illustrative applications across both equation-based simulations and agent-based modeling. We survey a broad spectrum of techniques, highlighting their strengths for revealing interactions and supporting human comprehension. Finally, we identify pressing open challenges, including the explainability of dynamical systems and the handling of mixed-variable systems, and propose a research agenda to make explainability a core, embedded element of simulation-driven workflows from model construction through decision-making. By transforming opaque emulators into explainable tools, this agenda empowers practitioners to move beyond accelerating simulations to extracting actionable insights from complex system behaviors.
【8】Towards Verified and Targeted Explanations through Formal Methods
标题:通过正式方法实现经过验证和有针对性的补偿
链接:https://arxiv.org/abs/2604.14209
作者:Hanchen David Wang,Diego Manzanas Lopez,Preston K. Robinette,Ipek Oguz,Taylor T. Johnson,Meiyi Ma
备注:Paper has been accepted at JAIR
摘要:随着深度神经网络被部署在自动驾驶和医疗诊断等安全关键领域,利益相关者需要可解释的解释,但也需要可靠的正式保证。现有的XAI方法不足:启发式归因技术(例如,LIME,综合边界)突出了有影响力的特征,但没有提供关于决策边界的数学保证,而正式方法验证了鲁棒性,但仍然没有针对性,分析最近的边界,无论它是否代表关键风险。在安全关键系统中,并非所有的错误分类都会带来相同的后果;将“停止”标志与“60 kph”标志混淆要比将其与“禁止通行”标志混淆危险得多。我们介绍ViTaX(验证和有针对性的解释),这是一个正式的XAI框架,可以生成具有数学保证的有针对性的半事实解释。对于给定的输入(类y)和用户指定的关键替代(类t),ViTaX:(1)识别对y->t转换最敏感的最小特征子集,(2)应用正式的可达性分析,以确保通过递归扰动这些特征不会将分类翻转到t。我们正式通过有针对性的ε-鲁棒性,证明是否一个功能子集仍然强大的扰动下对一个特定的目标类。ViTaX是第一个提供正式保证解释模型对用户识别的替代方案的弹性的方法。对MNIST、GTSRB、EMNIST和TaxiNet的评估表明,保真度提高了30%以上,解释基数最小。
摘要
:As deep neural networks are deployed in safety-critical domains such as autonomous driving and medical diagnosis, stakeholders need explanations that are interpretable but also trustworthy with formal guarantees. Existing XAI methods fall short: heuristic attribution techniques (e.g., LIME, Integrated Gradients) highlight influential features but offer no mathematical guarantees about decision boundaries, while formal methods verify robustness yet remain untargeted, analyzing the nearest boundary regardless of whether it represents a critical risk. In safety-critical systems, not all misclassifications carry equal consequences; confusing a "Stop" sign for a "60 kph" sign is far more dangerous than confusing it with a "No Passing" sign. We introduce ViTaX (Verified and Targeted Explanations), a formal XAI framework that generates targeted semifactual explanations with mathematical guarantees. For a given input (class y) and a user-specified critical alternative (class t), ViTaX: (1) identifies the minimal feature subset most sensitive to the y->t transition, and (2) applies formal reachability analysis to guarantee that perturbing these features by epsilon cannot flip the classification to t. We formalize this through Targeted epsilon-Robustness, certifying whether a feature subset remains robust under perturbation toward a specific target class. ViTaX is the first method to provide formally guaranteed explanations of a model's resilience against user-identified alternatives. Evaluations on MNIST, GTSRB, EMNIST, and TaxiNet demonstrate over 30% fidelity improvement with minimal explanation cardinality.
【9】Optimal algorithmic complexity of inference in quantum kernel methods
标题:量子核方法中推理的最佳算法复杂度
链接:https://arxiv.org/abs/2604.15214
作者:Elies Gil-fuster,Seongwook Shin,Sofiene Jerbi,Jens Eisert,Maximilian J. Kramer
备注:26 pages (13+13), 4 figures, comments welcome
摘要:量子核方法是在监督学习中实现量子优势的主要候选方法之一。一个关键的瓶颈是推理的成本:在新数据上评估训练模型需要估计$N$个核值的加权和$\sum_{i=1}^N α_i k(x,x_i)$,以达到加性精度$\vareps $,其中$α$是训练系数的向量。标准方法通过抽样独立地估计每个项,产生查询复杂度为O(N\lVertα\rVert_2^2/\varepsilon^2)$。在这项工作中,我们确定了两个独立的改进轴:(1)如何估计单个核值(采样与量子振幅估计),以及(2)如何近似总和(逐项与通过单个可观察),并系统地分析其所有组合。查询最优组合,编码的完整的推理和作为一个单一的观察和应用量子幅度估计的期望值,实现了查询复杂度为$O(\lVertα\rVert_1/\varepalent)$,从查询计数中删除了对$N$的依赖,并在$\lVertα\rVert_1$和$\varepalent $产生二次改进。我们证明了一个匹配的下界$Ω(\lVertα\rVert_1/\varepalent)$,建立查询最优的我们的方法的对数因子。除了查询复杂性,我们还分析了这些改进如何转化为门成本,并表明,查询最优策略并不总是最佳的门复杂性的角度来看,在实践中。我们的研究结果提供了一个查询优化算法和一个实际上最佳的选择策略取决于硬件的能力,以及一个完整的景观的中间方法,以指导从业者。所有的算法只需要幅度估计作为一个子程序,因此自然的候选人早期容错实现。
摘要:Quantum kernel methods are among the leading candidates for achieving quantum advantage in supervised learning. A key bottleneck is the cost of inference: evaluating a trained model on new data requires estimating a weighted sum $\sum_{i=1}^N α_i k(x,x_i)$ of $N$ kernel values to additive precision $\varepsilon$, where $α$ is the vector of trained coefficients. The standard approach estimates each term independently via sampling, yielding a query complexity of $O(N\lVertα\rVert_2^2/\varepsilon^2)$. In this work, we identify two independent axes for improvement: (1) How individual kernel values are estimated (sampling versus quantum amplitude estimation), and (2) how the sum is approximated (term-by-term versus via a single observable), and systematically analyze all combinations thereof. The query-optimal combination, encoding the full inference sum as the expectation value of a single observable and applying quantum amplitude estimation, achieves a query complexity of $O(\lVertα\rVert_1/\varepsilon)$, removing the dependence on $N$ from the query count and yielding a quadratic improvement in both $\lVertα\rVert_1$ and $\varepsilon$. We prove a matching lower bound of $Ω(\lVertα\rVert_1/\varepsilon)$, establishing query-optimality of our approach up to logarithmic factors. Beyond query complexity, we also analyze how these improvements translate into gate costs and show that the query-optimal strategy is not always optimal in practice from the perspective of gate complexity. Our results provide both a query-optimal algorithm and a practically optimal choice of strategy depending on hardware capabilities, along with a complete landscape of intermediate methods to guide practitioners. All algorithms require only amplitude estimation as a subroutine and are thus natural candidates for early-fault-tolerant implementations.
【10】Combining Bayesian and Frequentist Inference for Laboratory-Specific Performance Guarantees in Copy Number Variation Detection
标题:在拷贝数变异检测中结合Bayesian和Frequentist推理以保证特定于实验室的性能
链接:https://arxiv.org/abs/2604.14305
作者:Austin Talbot,Alex V. Kotlar,Yue Ke
摘要:靶向扩增子组广泛用于肿瘤学诊断,但由于扩增伪影、过程失配异质性和有限的验证样本量,为拷贝数变异(CNV)检测提供每基因性能保证仍然具有挑战性。虽然贝叶斯CNV调用者自然量化每个样本的不确定性,但将其转换为临床验证所需的频率论群体水平保证,覆盖率,假阳性界限和最小可检测拷贝数变化,是一个根本不同的推理问题。我们的经验表明,即使是强大的贝叶斯可信区间,包括粗糙的后验和ESTA调整的间隔,是严重错误校准的面板上的小扩增子计数每个基因。为了解决这个问题,我们提出了一个混合框架,评估贝叶斯后验泛函验证样本和模型的Gamma分布,从而产生的平方损失,有效的频率覆盖的公差区间。三个组成部分使该方法在现实世界的约束下实用:(1)填补,消除真正的CNV阳性样本的影响,而不需要已知的地面真相,(2)正则化,以解决小样本的变化,和(3)基于证据的分层的日志模型证据,以适应不可交换的噪声配置文件所产生的过程不匹配。使用留一法交叉验证在两个靶向扩增子面板上进行评估,所提出的方法在工艺匹配和不匹配条件下在所有基因上实现了个位数的平均绝对覆盖误差,而贝叶斯比较器在临床相关基因(如ERBB2)上显示出超过60%的平均绝对误差。
摘要:Targeted amplicon panels are widely used in oncology diagnostics, but providing per-gene performance guarantees for copy number variant (CNV) detection remains challenging due to amplification artifacts, process-mismatch heterogeneity, and limited validation sample sizes. While Bayesian CNV callers naturally quantify per-sample uncertainty, translating this into the frequentist population-level guarantees required for clinical validation, coverage rates, false-positive bounds, and minimum detectable copy-number changes, is a fundamentally different inferential problem. We show empirically that even robust Bayesian credible intervals, including coarsened posteriors and sandwich-adjusted intervals, are severely miscalibrated on panels with small amplicon counts per gene. To address this, we propose a hybrid framework that evaluates Bayesian posterior functionals on validation samples and models the resulting squared losses with a Gamma distribution, yielding tolerance intervals with valid frequentist coverage. Three components make the method practical under real-world constraints: (1) imputation that removes the influence of true CNV-positive samples without requiring known ground truth, (2) regularization to address small sample variability, and (3) evidence-based stratification on the log model evidence to accommodate non-exchangeable noise profiles arising from process mismatch. Evaluated on two targeted amplicon panels using leave-one-out cross-validation, the proposed method achieves single-digit mean absolute coverage error across all genes under both process-matched and unmatched conditions, whereas Bayesian comparators exhibit mean absolute errors exceeding 60\% on clinically relevant genes such as ERBB2.
检测相关(2篇)
【1】Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
标题:仇恨言语检测任务的现代多语言文本嵌入技术比较
链接:https://arxiv.org/abs/2604.14907
作者:Evaldas Vaiciukynas,Paulius Danenas,Linas Ablonskis,Algirdas Sukys,Edgaras Dambrauskas,Voldemaras Zitkus,Rita Butkiene,Rimantas Butleris
备注:Submitted to Applied Soft Computing (Status: Decision in Process)
摘要:在线仇恨言论和辱骂性语言对内容审核构成了越来越大的挑战,特别是在多语言环境和立陶宛语等低资源语言中。本文研究了现代多语言句子嵌入模型在多大程度上可以支持立陶宛语、俄语和英语的准确仇恨言论检测,以及它们的性能如何取决于下游建模选择和特征维度。我们介绍了LtHate,一个新的立陶宛仇恨言论语料库,来自新闻门户网站和社交网络,并基准测试六个现代多语言编码器(药水,gemma,bge,雪,jina,e5)LtHate,RuToxic和EnSuperset使用统一的Python管道。对于每个嵌入,我们训练一类HBOS异常检测器和两类CatBoost分类器,有和没有主成分分析(PCA)压缩到64维特征向量。在所有数据集中,两类监督模型的性能始终大大优于一类异常检测,最佳配置在立陶宛语(jina)中的准确率高达80.96%,AUC ROC为0.887,在俄语(e5)中的准确率为92.19%,AUC ROC为0.978,在英语(e5 with PCA)中的准确率为77.21%,AUC ROC为0.859。PCA压缩在监督设置中保留了几乎所有的区分能力,同时对无监督异常检测情况显示出一些负面影响。这些结果展示了现代多语言句子嵌入与梯度提升决策树相结合如何为多语言仇恨语音检测应用提供强大的软计算解决方案。
摘要
:Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.
【2】Asynchronous Probability Ensembling for Federated Disaster Detection
标题:联邦灾难检测的同步概率集成
链接:https://arxiv.org/abs/2604.14450
作者:Emanuel Teixeira Martins,Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,Rodolfo S. Villaça,Augusto Neto,Flávio de Oliveira Silva
备注:Paper accepted for publication at 31st IEEE Symposium on Computers and Communications (ISCC) 2026
摘要:灾难决策支持系统(DDSS)中快速准确的应急处理通常受到网络延迟和次优应用准确性的阻碍。虽然联合学习(FL)解决了其中的一些问题,但它受到异构卷积神经网络(CNN)架构之间的高通信成本和严格同步要求的限制。为了克服这些挑战,本文提出了一个分散的集成框架的基础上异步概率聚合和反馈蒸馏。通过将交换单元从模型权重转移到类概率向量,我们的方法保持了数据隐私,降低了数量级的通信要求,并提高了整体准确性。这种方法使不同的CNN设计能够异步协作,即使在资源受限的环境中也能提高灾害图像识别性能。实验测试表明,所提出的方法优于传统的个人骨干和标准的联邦方法,建立一个可扩展的和资源感知的解决方案,实时灾难响应。
摘要:Quick and accurate emergency handling in Disaster Decision Support Systems (DDSS) is often hampered by network latency and suboptimal application accuracy. While Federated Learning (FL) addresses some of these issues, it is constrained by high communication costs and rigid synchronization requirements across heterogeneous convolutional neural network (CNN) architectures. To overcome these challenges, this paper proposes a decentralized ensembling framework based on asynchronous probability aggregation and feedback distillation. By shifting the exchange unit from model weights to class-probability vectors, our method maintains data privacy, reduces communication requirements by orders of magnitude, and improves overall accuracy. This approach enables diverse CNN designs to collaborate asynchronously, enhancing disaster image identification performance even in resource-constrained settings. Experimental tests demonstrate that the proposed method outperforms traditional individual backbones and standard federated approaches, establishing a scalable and resource-aware solution for real-time disaster response.
分类|识别(3篇)
【1】MambaSL: Exploring Single-Layer Mamba for Time Series Classification
标题:MambaSL:探索用于时间序列分类的单层Mamba
链接:https://arxiv.org/abs/2604.15174
作者:Yoo-Min Jung,Leekyung Kim
备注:accepted at ICLR 2026
摘要:尽管最近在状态空间模型(SSM),如曼巴跨越各种序列域的进展,其独立的时间序列分类(TSC)的能力的研究仍然有限。我们提出了MambaSL,一个框架,最低限度地重新设计的选择性SSM和投影层的单层曼巴,由四个TSC特定的假设指导。为了解决基准测试的局限性-受限的配置,部分东安格利亚大学(UEA)数据集覆盖范围,以及可重复性不足的设置-我们在统一的协议下重新评估了所有30个UEA数据集的20个强基线。因此,MambaSL实现了最先进的性能,具有统计学上显著的平均改进,同时通过所有评估模型的公共检查点确保再现性。连同可视化,这些结果证明了基于Mamba的架构作为TSC骨干的潜力。
摘要:Despite recent advances in state space models (SSMs) such as Mamba across various sequence domains, research on their standalone capacity for time series classification (TSC) has remained limited. We propose MambaSL, a framework that minimally redesigns the selective SSM and projection layers of a single-layer Mamba, guided by four TSC-specific hypotheses. To address benchmarking limitations -- restricted configurations, partial University of East Anglia (UEA) dataset coverage, and insufficiently reproducible setups -- we re-evaluate 20 strong baselines across all 30 UEA datasets under a unified protocol. As a result, MambaSL achieves state-of-the-art performance with statistically significant average improvements, while ensuring reproducibility via public checkpoints for all evaluated models. Together with visualizations, these results demonstrate the potential of Mamba-based architectures as a TSC backbone.
【2】Tight Sample Complexity Bounds for Best-Arm Identification Under Bounded Systematic Bias
标题:有界系统偏差下最佳臂识别的紧样本复杂性界限
链接:https://arxiv.org/abs/2604.14345
作者:Tianhao Qian
备注:10 pages, 5 figures
摘要:随着搜索深度在自主推理和具体规划中的增加,候选动作空间呈指数级扩展,严重加重了计算预算。虽然启发式修剪是一种常见的对策,但当代理模型(如LLM)表现出系统评估偏差时,它的操作没有正式的安全保证。本文框架的节点扩展过程作为一个本地化的最佳臂识别(BAI)问题的动态边界,受到有界的系统偏差$L$。通过对Lambert W函数进行反求,我们建立了一个加性样本复杂度为O(Δ-4L)^{-2}),这表明安全节点消除仅在经验回报差距超过4L时才是可行的。我们补充这与信息理论的下限$Ω((Δ-2L)^{-2})$确认有偏搜索的结构限制。随后的合成树和复杂的推理任务的评估表明,坚持这一局部安全边界成功地保持最佳轨迹,同时最大限度地提高样本分配效率。
摘要:As search depth increases in autonomous reasoning and embodied planning, the candidate action space expands exponentially, heavily taxing computational budgets. While heuristic pruning is a common countermeasure, it operates without formal safety guarantees when surrogate models (like LLMs) exhibit systematic evaluation biases. This paper frames the node expansion process as a localized Best-Arm Identification (BAI) problem over dynamic frontiers, subject to a bounded systematic bias $L$. By inverting the Lambert W function, we establish an additive sample complexity of $\mathcal{O}((Δ-4L)^{-2})$, which indicates that safe node elimination is only feasible when the empirical reward gap exceeds $4L$. We complement this with an information-theoretic lower bound of $Ω((Δ-2L)^{-2})$ to confirm the structural limits of biased search. Subsequent evaluations on both synthetic trees and complex reasoning tasks demonstrate that adhering to this local safety boundary successfully preserves optimal trajectories while maximizing sample allocation efficiency.
【3】Expert-Guided Class-Conditional Goodness-of-Fit Scores for Interpretable Classification with Informative Missingness: An Application to Seismic Monitoring
标题:具有信息缺失的可解释分类的专家引导类条件匹配优度分数:在地震监测中的应用
链接:https://arxiv.org/abs/2604.14809
作者:Shahar Cohen,David M. Steinberg,Yael Radzyner,Yochai Ben Horin
备注:50 pages, 8 figures
摘要:我们研究了分类问题的三个关键挑战:普遍的信息缺失,部分先验专家知识的整合到学习过程中,需要解释的决策规则。我们提出了一个框架,编码先验知识,通过一个或多个类的专家指导类条件模型,并使用该模型来构建一个小的可解释的拟合优度功能。这些特征量化了观察到的数据与专家模型的一致程度,隔离了数据不同方面的贡献,包括观察到的和缺失的成分。这些特征与简单的判别式分类器中的一些透明辅助摘要相结合,产生易于检查和证明的决策规则。我们在用于评估《全面禁止核试验条约》遵守情况的地震监测范围内制定和应用这一框架。我们表明,该方法具有强大的潜力,作为一个透明的筛选工具,减少专家分析师的工作量。一个旨在隔离所提出的框架的贡献的模拟表明,这种可解释的专家指导的方法甚至可以优于强大的标准机器学习分类器,特别是当训练样本很小时。
摘要:We study a classification problem with three key challenges: pervasive informative missingness, the integration of partial prior expert knowledge into the learning process, and the need for interpretable decision rules. We propose a framework that encodes prior knowledge through an expert-guided class-conditional model for one or more classes, and use this model to construct a small set of interpretable goodness-of-fit features. The features quantify how well the observed data agree with the expert model, isolating the contributions of different aspects of the data, including both observed and missing components. These features are combined with a few transparent auxiliary summaries in a simple discriminative classifier, resulting in a decision rule that is easy to inspect and justify. We develop and apply the framework in the context of seismic monitoring used to assess compliance with the Comprehensive Nuclear-Test-Ban Treaty. We show that the method has strong potential as a transparent screening tool, reducing workload for expert analysts. A simulation designed to isolate the contribution of the proposed framework shows that this interpretable expert-guided method can even outperform strong standard machine-learning classifiers, particularly when training samples are small.
表征(1篇)
【1】STEP-Parts: Geometric Partitioning of Boundary Representations for Large-Scale CAD Processing
标题:STEP-Components:用于大规模CAD处理的边界表示的几何划分
链接:https://arxiv.org/abs/2604.14927
作者:Shen Fan,Mikołaj Kida,Przemyslaw Musialski
摘要:许多CAD学习流程将边界表示(B-Reps)离散为三角形网格,丢弃了分析表面结构和拓扑邻接,从而削弱了一致的实例级分析。我们提出了STEP零件,一个确定性的CAD到监督工具链,直接从原始STEP B-Reps中提取几何实例分区,并通过保留的源面对应关系将其转移到镶嵌载体,从而产生实例标签和元数据用于下游学习和评估。仅当相邻的B-Rep面共享相同的分析基本体类型并满足近切线连续性条件时,该构造才会合并它们。在ABC上,相同的原始二面角是强烈的双峰,产生一个阈值不敏感的低角度制度的部分提取。由于分区是在内部B-Rep拓扑上定义的,而不是在特定的三角剖分上定义的,因此生成的边界在细分的变化下保持稳定。应用于ABC的DeepCAD子集,流水线在消费者CPU上在六小时内处理约180{,}000个模型。我们发布的代码和预先计算的标签,并表明,STEP-Parts既作为一个镶嵌鲁棒的几何参考,并作为一个有用的监督源在两个下游探头:一个隐式的重建-分割网络和一个基于点的骨干网级。
摘要:Many CAD learning pipelines discretize Boundary Representations (B-Reps) into triangle meshes, discarding analytic surface structure and topological adjacency and thereby weakening consistent instance-level analysis. We present STEP-Parts, a deterministic CAD-to-supervision toolchain that extracts geometric instance partitions directly from raw STEP B-Reps and transfers them to tessellated carriers through retained source-face correspondence, yielding instance labels and metadata for downstream learning and evaluation. The construction merges adjacent B-Rep faces only when they share the same analytic primitive type and satisfy a near-tangent continuity criterion. On ABC, same-primitive dihedral angles are strongly bimodal, yielding a threshold-insensitive low-angle regime for part extraction. Because the partition is defined on intrinsic B-Rep topology rather than on a particular triangulation, the resulting boundaries remain stable under changes in tessellation. Applied to the DeepCAD subset of ABC, the pipeline processes approximately 180{,}000 models in under six hours on a consumer CPU. We release code and precomputed labels, and show that STEP-Parts serves both as a tessellation-robust geometric reference and as a useful supervision source in two downstream probes: an implicit reconstruction--segmentation network and a dataset-level point-based backbone.
3D|3D重建等相关(1篇)
【1】ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
标题:ELMoE-3D:利用MoE的内在弹性,在本地服务中实现混合绑定使能的自我推测解码
链接:https://arxiv.org/abs/2604.14626
作者:Yuseon Choi,Jingu Lee,Jungjun Oh,Sunjoo Whang,Byeongcheol Kim,Minsung Kim,Hoi-Jun Yoo,Sangjin Kim
摘要:混合专家(MoE)模型已经成为大规模语言模型的主导架构,但内部部署服务从根本上讲仍然是内存受限的,因为XML将稀疏的每令牌计算转变为密集的内存激活。以内存为中心的架构(PIM、NMP)提高了带宽,但在MoE的低运算强度和高批量大小下,计算利用率不足。推测解码(SD)以空闲计算换取更少的目标调用,然而即使对于被拒绝的令牌,验证也必须加载专家,这严重限制了其在MoE中的益处,特别是在低批量时。我们提出了ELMoE-3D,一个基于混合绑定(HB)的HW-SW协同设计框架,该框架将基于缓存的加速和推测性解码相结合,以提供跨批量大小的整体加速。我们确定了两个内在的弹性轴的MoE专家和位,并共同扩展它们来构建弹性自推测解码(弹性SD),这既是一个专家缓存和高度对齐的自草案模型加速高HB带宽。我们的LSB增强位切片架构利用位切片表示中的固有冗余来原生地支持位嵌套执行。在我们的3D堆栈硬件上,ELMoE-3D在批量大小为1-16的xPU上实现了平均$6.6\times$加速和$4.4\times$能效增益,并在性能最佳的先前加速器基线上实现了$2.2\times$加速和$1.4\times$能效增益。
摘要:Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average $6.6\times$ speedup and $4.4\times$ energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers $2.2\times$ speedup and $1.4\times$ energy efficiency gain over the best-performing prior accelerator baseline.
编码器(3篇)
【1】Assessing the Potential of Masked Autoencoder Foundation Models in Predicting Downhole Metrics from Surface Drilling Data
标题:评估掩蔽自动编码器基础模型在根据地面钻探数据预测井眼深度方面的潜力
链接:https://arxiv.org/abs/2604.15169
作者:Aleksander Berezowski,Hassan Hassanzadeh,Gouri Ginde
摘要
:石油和天然气钻井作业从地面传感器产生大量的时间序列数据,但由于缺乏标记的井下测量,关键井下指标的准确实时预测仍然具有挑战性。这项系统的映射研究回顾了2015年至2025年期间发表的13篇论文,以评估掩蔽自动编码器基础模型(MAEFM)从地面钻井数据预测井下指标的潜力。审查确定了八个常用的地面指标和七个目标井下指标。目前的方法主要采用神经网络架构,如人工神经网络(ANN)和长短期记忆(LSTM)网络,但没有研究探索MAEFM,尽管它们在时间序列建模中表现出有效性。MAEFM通过对大量未标记数据进行自我监督预训练,提供了明显的优势,从而实现了多任务预测和改进的跨井泛化。这项研究表明,MAEFM代表了一个技术上可行但尚未探索的钻井分析机会,建议未来对现有模型进行经验验证,并探索其在石油和天然气作业中的更广泛适用性。
摘要:Oil and gas drilling operations generate extensive time-series data from surface sensors, yet accurate real-time prediction of critical downhole metrics remains challenging due to the scarcity of labelled downhole measurements. This systematic mapping study reviews thirteen papers published between 2015 and 2025 to assess the potential of Masked Autoencoder Foundation Models (MAEFMs) for predicting downhole metrics from surface drilling data. The review identifies eight commonly collected surface metrics and seven target downhole metrics. Current approaches predominantly employ neural network architectures such as artificial neural networks (ANNs) and long short-term memory (LSTM) networks, yet no studies have explored MAEFMs despite their demonstrated effectiveness in time-series modeling. MAEFMs offer distinct advantages through self-supervised pre-training on abundant unlabeled data, enabling multi-task prediction and improved generalization across wells. This research establishes that MAEFMs represent a technically feasible but unexplored opportunity for drilling analytics, recommending future empirical validation of their performance against existing models and exploration of their broader applicability in oil and gas operations.
【2】Improving Sparse Autoencoder with Dynamic Attention
标题:用动态注意力改进稀疏自动编码器
链接:https://arxiv.org/abs/2604.14925
作者:Dongsheng Wang,Jinsen Zhang,Dawei Su,Hui Huang
摘要:最近,稀疏自动编码器(SAE)已经成为一种很有前途的技术,用于通过将特征分解为稀疏的概念集来解释基础模型中的激活。然而,确定每个神经元的最佳稀疏水平在实践中仍然具有挑战性:过度的稀疏可能导致重建效果不佳,而稀疏不足可能会损害可解释性。虽然现有的激活函数(如ReLU和TopK)提供了一定的稀疏性保证,但它们通常需要额外的稀疏性正则化或精选超参数。在本文中,我们表明,动态稀疏注意力机制,使用sparsemax可以桥接这种权衡,由于他们的能力,以数据依赖的方式来确定激活数。具体来说,我们首先探索一类新的SAE的基础上的交叉注意架构的潜在功能的查询和可学习的字典作为关键字和值矩阵。为了鼓励稀疏模式学习,我们采用了一种基于sparsemax的注意力策略,该策略根据每个神经元的复杂性自动推断出一组稀疏的元素,从而产生一个更灵活和通用的激活函数。通过综合评估和可视化,我们表明,我们的方法成功地实现了较低的重建损失,同时产生高质量的概念,特别是在前n个分类任务。
摘要:Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for each neuron remains challenging in practice: excessive sparsity can lead to poor reconstruction, whereas insufficient sparsity may harm interpretability. While existing activation functions such as ReLU and TopK provide certain sparsity guarantees, they typically require additional sparsity regularization or cherry-picked hyperparameters. We show in this paper that dynamically sparse attention mechanisms using sparsemax can bridge this trade-off, due to their ability to determine the activation numbers in a data-dependent manner. Specifically, we first explore a new class of SAEs based on the cross-attention architecture with the latent features as queries and the learnable dictionary as the key and value matrices. To encourage sparse pattern learning, we employ a sparsemax-based attention strategy that automatically infers a sparse set of elements according to the complexity of each neuron, resulting in a more flexible and general activation function. Through comprehensive evaluation and visualization, we show that our approach successfully achieves lower reconstruction loss while producing high-quality concepts, particularly in top-n classification tasks.
【3】Magnitude Is All You Need? Rethinking Phase in Quantum Encoding of Complex SAR Data
标题:你需要的就是规模吗?复杂SAR数据量子编码的重新思考阶段
链接:https://arxiv.org/abs/2604.14229
作者:Sakthi Prabhu Gunasekar,Prasanna Kumar R
备注:8 pages, 4 figures. Under review for IEEE QCE 2026
摘要:合成孔径雷达(SAR)数据本质上是复值的,而量子机器学习(QML)模型自然在复希尔伯特空间中运行。这种明显的一致性表明,将幅度和相位信息结合到量子编码中可以提高SAR自动目标识别(ATR)的性能。在这项工作中,我们通过比较五种量子编码策略系统地评估了这一假设:仅幅度,联合复杂,基于I/Q,预处理相位和纯量子,在MSTAR基准数据集的统一实验框架下。 与预期相反,我们观察到一个一致的模式:在混合量子经典架构中,仅幅度编码优于所有复值策略,在3类任务上实现了99.57%的准确率,在8类任务上实现了71.19%的准确率,而相位感知方法提供了可忽略不计的(~0%)或负的改进。相比之下,在只有184-224个可训练参数且没有经典组件的纯量子架构中,相位信息变得至关重要,有助于提高高达21.65%的精度。 这些结果表明,相位信息的效用不是固有的数据,但严重依赖于模型的架构。混合模型依赖于补偿丢失的相位信息的经典组件,而纯量子模型需要相位来构建区分性表示。我们的研究结果提供了实用的设计准则编码复杂的数据在QML和突出的编码架构协同设计在NISQ时代的重要性。
摘要:Synthetic Aperture Radar (SAR) data is inherently complex-valued, while quantum machine learning (QML) models naturally operate in complex Hilbert spaces. This apparent alignment suggests that incorporating both magnitude and phase information into quantum encoding should improve performance in SAR Automatic Target Recognition (ATR). In this work, we systematically evaluate this assumption by comparing five quantum encoding strategies: magnitude-only, joint complex, I/Q-based, preprocessed phase, and pure quantum, under a unified experimental framework on the MSTAR benchmark dataset. Contrary to expectation, we observe a consistent pattern: in hybrid quantum-classical architectures, magnitude-only encoding outperforms all complex-valued strategies, achieving 99.57% accuracy on a 3-class task and 71.19% on an 8-class task, while phase-aware methods provide negligible (~0%) or negative improvements. In contrast, in purely quantum architectures with only 184-224 trainable parameters and no classical components, phase information becomes essential, contributing up to 21.65% improvement in accuracy. These results reveal that the utility of phase information is not inherent to the data, but depends critically on the model architecture. Hybrid models rely on classical components that compensate for missing phase information, whereas purely quantum models require phase to construct discriminative representations. Our findings provide practical design guidelines for encoding complex-valued data in QML and highlight the importance of encoding-architecture co-design in the NISQ era.
优化|敛散性(8篇)
【1】Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier
标题:使用日志屏障的强盗反馈矩阵博弈中的最佳最后时刻收敛
链接:https://arxiv.org/abs/2604.15242
作者:Come Fiegel,Pierre Menard,Tadashi Kozuno,Michal Valko,Vianney Perchet
摘要:研究了零和矩阵对策中极小极大策略的学习问题。Fiegel et al.(2025)最近通过证明Omega(t^{-1/4})的可利用性差距的下限,证明了当参与者不耦合时,在这种情况下实现最后一次收敛更难。一些在线镜像下降算法在文献中提出了这个问题,但还没有真正达到这个速度。我们证明了使用对数屏障正则化,以及双聚焦分析,允许这种O-波浪线(t^{-1/4})以高概率收敛。我们还扩展了我们的想法设置的广泛形式的游戏,证明了一个绑定相同的速度。
摘要
:We study the problem of learning minimax policies in zero-sum matrix games. Fiegel et al. (2025) recently showed that achieving last-iterate convergence in this setting is harder when the players are uncoupled, by proving a lower bound on the exploitability gap of Omega(t^{-1/4}). Some online mirror descent algorithms were proposed in the literature for this problem, but none have truly attained this rate yet. We show that the use of a log-barrier regularization, along with a dual-focused analysis, allows this O-tilde(t^{-1/4}) convergence with high-probability. We additionally extend our idea to the setting of extensive-form games, proving a bound with the same rate.
【2】When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
标题:当平坦极小值失败时:描述FP 32收敛后的int 4量化崩溃
链接:https://arxiv.org/abs/2604.15167
作者:Marcus Armstrong
摘要:训练后量化(PTQ)假设收敛良好的模型是量化就绪模型。我们表明,这种假设失败的结构化,可衡量的,以前没有特征的方式。使用应用于所有154个公开可用的Pythia-160 m训练检查点的免校准的每组INT 4探针,我们确定了一个三相发散结构:快速学习阶段,其中FP 32困惑度和量化鲁棒性一起改善,亚稳定平台持续大约70,000步,其中FP 32困惑度停滞但INT 4间隙保持有界,以及一个爆炸性的发散阶段,其中INT 4缺口从11%增加到517%,而FP 32困惑几乎没有移动。重要的是,这种分歧不是在学习率开始衰减时开始的,而是在FP 32困惑收敛时开始的,这意味着收敛后的权重更新,而不仅仅是衰减幅度,是直接原因。我们进一步表明,INT 8量化在所有三个阶段都是完全免疫的,将该机制具体限制在16级INT 4网格的粗糙度上,并通过直接峰度测量排除权重离群值累积作为机制。最后,我们从预发散检查点进行了一个受控分叉实验,比较了九次独立运行中的三种学习率计划(余弦延续,SGDR热重启和我们提出的振荡锁定)。SGDR一致地加速了发散(0/9成对战胜余弦),而OLI的稳定冷相位平均将INT 4差距减少了2.2个百分点(t =-5.46,p < 0.0001),这表明时间表幅度校准,而不是振荡本身,决定了扰动是有益还是有害。我们的代码、探针实现和所有154检查点审计结果都是公开发布的。
摘要:Post-training quantization (PTQ) assumes that a well-converged model is a quantization-ready model. We show this assumption fails in a structured, measurable, and previously uncharacterized way. Using a calibration-free per-group INT4 probe applied to all 154 publicly available Pythia-160m training checkpoints, we identify a three-phase divergence structure: a rapid-learning phase where both FP32 perplexity and quantization robustness improve together, a meta-stable plateau lasting roughly 70,000 steps where FP32 perplexity stagnates but INT4 gap remains bounded, and an explosive divergence phase where the INT4 gap compounds from 11% to 517% while FP32 perplexity barely moves. Critically, this divergence begins not when the learning rate starts decaying, but precisely when FP32 perplexity converges a finer-grained onset predictor that implies post-convergence weight updates, rather than decay magnitude alone, are the proximate cause. We further show that INT8 quantization is entirely immune throughout all three phases, constraining the mechanism to the coarseness of the 16-level INT4 grid specifically, and rule out weight outlier accumulation as the mechanism via direct kurtosis measurement. Finally, we conduct a controlled fork experiment from the pre-divergence checkpoint comparing three learning rate schedules (cosine continuation, SGDR warm restarts, and our proposed Oscillatory Lock-In) across nine independent runs. SGDR uniformly accelerates divergence (0/9 pairwise wins against cosine), while OLI's settled cool phases reduce the INT4 gap by 2.2 percentage points on average (t = -5.46, p < 0.0001), demonstrating that schedule amplitude calibration, not oscillation alone, determines whether perturbation helps or hurts. Our code, probe implementation, and all 154-checkpoint audit results are released publicly.
【3】Beyond Importance Sampling: Rejection-Gated Policy Optimization
标题:超越重要性抽样:拒绝门控政策优化
链接:https://arxiv.org/abs/2604.14895
作者:Ziwu Sun,Zhen Gao,Jiyong Zhang,Jiaheng Li
备注:27 pages, includes theoretical analysis and experiments
摘要:我们提出了一个新的角度对政策优化:而不是重新加权所有样本的重要性比,优化器应该选择哪些样本是值得信赖的,足以推动政策更新。在此基础上,我们引入了拒绝门策略优化(Rejection Gated Policy Optimization,RNTH),它将重要性采样率r_theta = pi_theta / pi_old替换为[0,1]范围内的平滑可微接受门alpha_theta(s,a)= g(r_theta(s,a))。与之前在训练之前将拒绝采样作为数据级启发式应用的工作不同,Rynth将拒绝提升为优化原则:门直接参与梯度计算,并与策略一起隐式更新。RNTR提供了一个统一的框架:TRPO、PPO和REINFORCE的策略梯度都对应于有效梯度权重w(r)= g '(r)* r的特定选择。我们证明,RISK保证有限的,有界的梯度方差,即使当重要性采样率是重尾(其中IS方差发散)。我们进一步表明,Ruppers只会产生一个有界的,可控的偏见,并提供了一个近似的单调的政策改善保证类似于TRPO。RRFT在计算成本上与PPO相匹配,不需要二阶优化,并且自然地扩展到RLHF风格的偏好对齐。Qwen 2.5 -1. 5 B在线偏好微调中-人类HH-RLHF指导(n = 3个种子),Rounds使用双比率门,将学习锚定到先前的策略和参考模型,实现帕累托主导的结果:在线强化学习方法中回报最高的(与PPO-RLHF相比为+14.8%),与参考模型的KL偏差最低(与PPO-RLHF相比为-16.0%,与GRPO相比为-53.1%)。
摘要:We propose a new perspective on policy optimization: rather than reweighting all samples by their importance ratios, an optimizer should select which samples are trustworthy enough to drive a policy update. Building on this view, we introduce Rejection-Gated Policy Optimization (RGPO), which replaces the importance sampling ratio r_theta = pi_theta / pi_old with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) in the range [0, 1]. Unlike prior work that applies rejection sampling as a data-level heuristic before training, RGPO elevates rejection to an optimization principle: the gate participates directly in gradient computation and is implicitly updated alongside the policy. RGPO provides a unified framework: the policy gradients of TRPO, PPO, and REINFORCE all correspond to specific choices of the effective gradient weight w(r) = g'(r) * r. We prove that RGPO guarantees finite, bounded gradient variance even when importance sampling ratios are heavy-tailed (where IS variance diverges). We further show that RGPO incurs only a bounded, controllable bias and provides an approximate monotonic policy improvement guarantee analogous to TRPO. RGPO matches PPO in computational cost, requires no second-order optimization, and extends naturally to RLHF-style preference alignment. In online preference fine-tuning of Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF (n = 3 seeds), RGPO uses a dual-ratio gate that anchors learning to both the previous policy and the reference model, achieving a Pareto-dominant outcome: the highest reward among online RL methods (+14.8% vs. PPO-RLHF) and the lowest KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).
【4】Regret Tail Characterization of Optimal Bandit Algorithms with Generic Rewards
标题:具有一般奖励的最佳Bandit算法的遗憾尾特征
链接:https://arxiv.org/abs/2604.14876
作者:Subhodip Panda,Shubhada Agrawal
摘要:研究了随机多臂强盗模型中期望渐近最优算法的后悔尾部行为。虽然最小化预期的遗憾是经典的目标,最近的工作表明,即使这样的算法可以表现出沉重的遗憾尾巴,招致巨大的遗憾与不可忽略的概率。现有的尖锐特征的后悔尾巴在很大程度上限于参数设置,如单参数指数家庭。 在这项工作中,我们扩展的$\KLinf$-UCB算法的广泛的非参数类的报酬分布满足温和的假设,并建立其渐近最优性的期望。然后,我们分析了其后悔的尾部行为,并得出了一个新的上界的后悔尾部概率。作为特殊情况,我们的结果恢复遗憾尾保证有界支持和重尾(力矩有界)强盗模型。此外,对于有限支持的奖励分布的特殊情况,我们的上限完全匹配已知的下限。因此,我们的研究结果提供了一个统一的和紧密的表征后悔尾巴的渐近最优KL为基础的UCB算法,超越参数模型。
摘要:We study the tail behavior of regret in stochastic multi-armed bandits for algorithms that are asymptotically optimal in expectation. While minimizing expected regret is the classical objective, recent work shows that even such algorithms can exhibit heavy regret tails, incurring large regret with non-negligible probability. Existing sharp characterizations of regret tails are largely restricted to parametric settings, such as single-parameter exponential families. In this work, we extend the $\KLinf$-UCB algorithm of to a broad nonparametric class of reward distributions satisfying mild assumptions, and establish its asymptotic optimality in expectation. We then analyze the tail behavior of its regret and derive a novel upper bound on the regret tail probability. As special cases, our results recover regret-tail guarantees for both bounded-support and heavy-tailed (moment-bounded) bandit models. Moreover, for the special case of finitely-supported reward distributions, our upper bound matches the known lower bound exactly. Our results thus provide a unified and tight characterization of regret tails for asymptotically optimal KL-based UCB algorithms, going beyond parametric models.
【5】Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization
标题:沃瑟斯坦强化学习的公式。政策优化的最佳交通视角
链接:https://arxiv.org/abs/2604.14765
作者:Mathias Dus
摘要:我们提出了一个强化学习(RL)的几何框架,将策略视为到行动概率的Wasserstein空间的映射。首先,我们定义了一个黎曼结构诱导平稳分布,证明其存在性在一般情况下。然后,我们定义了策略的切空间并描述了测地线的特征,特别是解决了从状态空间映射到动作空间上概率测度切空间的向量场的可测性。接下来,我们用公式表示一个一般的RL优化问题,并使用Otto演算构造一个梯度流。我们计算能量的梯度和Hessian,提供正式的二阶分析。最后,我们用低维问题的数值例子说明了该方法,直接从我们的理论形式主义计算梯度。对于高维问题,我们参数化的政策使用神经网络和优化的基础上遍历近似的成本。
摘要:We present a geometric framework for Reinforcement Learning (RL) that views policies as maps into the Wasserstein space of action probabilities. First, we define a Riemannian structure induced by stationary distributions, proving its existence in a general context. We then define the tangent space of policies and characterize the geodesics, specifically addressing the measurability of vector fields mapped from the state space to the tangent space of probability measures over the action space. Next, we formulate a general RL optimization problem and construct a gradient flow using Otto's calculus. We compute the gradient and the Hessian of the energy, providing a formal second-order analysis. Finally, we illustrate the method with numerical examples for low-dimensional problems, computing the gradient directly from our theoretical formalism. For high-dimensional problems, we parameterize the policy using a neural network and optimize it based on an ergodic approximation of the cost.
【6】Mean Flow Policy Optimization
标题:平均流量政策优化
链接:https://arxiv.org/abs/2604.14698
作者:Xiaoyi Dong,Xi Sheryl Zhang,Jian Cheng
摘要:扩散模型最近成为在线强化学习(RL)的表达策略表示。然而,它们的迭代生成过程引入了大量的训练和推理开销。为了克服这一限制,我们建议使用MeanFlow模型(一类基于流的生成模型)来表示策略,以提高基于扩散的RL方法的训练和推理效率。为了促进探索,我们通过软策略迭代在最大熵RL框架下优化MeanFlow策略,并解决MeanFlow策略特有的两个关键挑战:动作可能性评估和软策略改进。在MuJoCo和DeepMind Control Suite基准测试上的实验表明,我们的方法平均流策略优化(MFPO)实现了与当前基于扩散的基线相当或超过当前基于扩散的基线的性能,同时大大减少了训练和推理时间。我们的代码可在https://github.com/MFPolicy/MFPO上获得。
摘要:Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.
【7】Zeroth-Order Optimization at the Edge of Stability
标题:稳定边缘的零阶优化
链接:https://arxiv.org/abs/2604.14669
作者:Minhak Song,Liang Zhang,Bingcong Li,Niao He,Michael Muehlebach,Sewoong Oh
备注:38 pages
摘要:当梯度不可用或过于昂贵时,零阶(ZO)方法被广泛使用,包括黑盒学习和大型模型的内存效率微调,但它们在深度学习中的优化动态仍然没有得到充分的探索。在这项工作中,我们提供了一个明确的步长条件,准确地捕捉(均方)的线性稳定性的一个家庭的ZO方法的基础上,标准的两点估计。我们的特征揭示了一个鲜明的对比,一阶(FO)方法:而FO的稳定性是完全由最大的海森特征值,均方稳定的ZO方法取决于整个海森谱。由于在实际的神经网络训练中,计算完整的Hessian谱是不可行的,我们进一步推导出了仅依赖于最大特征值和Hessian迹的易于处理的稳定性界。根据经验,我们发现全批ZO方法在稳定性的边缘运行:ZO-GD,ZO-GDM和ZO-Adam在一系列深度学习训练问题中始终稳定在预测的稳定性边界附近。我们的研究结果突出了一个隐式正则化效应特定于ZO方法,其中大步长主要正则化Hessian迹,而在FO方法中,它们正则化顶部特征值。
摘要:Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.
【8】Amortized Optimal Transport from Sliced Potentials
标题:分层潜力的摊销最优运输
链接:https://arxiv.org/abs/2604.15114
作者:Minh-Phuc Truong,Khai Nguyen
备注:26 pages, 11 figures, 10 tables
摘要:我们提出了一种新的摊销优化方法预测最优运输(OT)计划在多对措施,利用Kantorovich潜力来自切片OT。我们介绍了两种摊销策略:基于回归的摊销(RA-OT)和基于目标的摊销(OA-OT)。在RA-OT中,我们制定了一个功能回归模型,将Kantorovich电位从原来的OT问题作为响应,并从切片OT作为预测因子,并通过最小二乘法估计这些模型。在OA-OT中,我们通过优化Kantorovich对偶目标来估计函数模型的参数。在这两种方法中,预测的OT计划随后从估计的潜力中恢复。作为摊销OT方法,RA-OT和OA-OT都通过重用从先前实例中学习的信息来快速近似新的解决方案,从而有效地解决了跨不同度量对的重复OT问题。此外,通过利用切片OT所提供的结构,所提出的模型更加简约,独立于测量的具体结构,例如离散情况下的原子数,同时实现高精度。我们证明了我们的方法的有效性任务,包括MNIST数字运输,颜色转移,供应需求运输球形数据,和小批量OT条件流匹配。
摘要
:We propose a novel amortized optimization method for predicting optimal transport (OT) plans across multiple pairs of measures by leveraging Kantorovich potentials derived from sliced OT. We introduce two amortization strategies: regression-based amortization (RA-OT) and objective-based amortization (OA-OT). In RA-OT, we formulate a functional regression model that treats Kantorovich potentials from the original OT problem as responses and those obtained from sliced OT as predictors, and estimate these models via least-squares methods. In OA-OT, we estimate the parameters of the functional model by optimizing the Kantorovich dual objective. In both approaches, the predicted OT plan is subsequently recovered from the estimated potentials. As amortized OT methods, both RA-OT and OA-OT enable efficient solutions to repeated OT problems across different measure pairs by reusing information learned from prior instances to rapidly approximate new solutions. Moreover, by exploiting the structure provided by sliced OT, the proposed models are more parsimonious, independent of specific structures of the measures, such as the number of atoms in the discrete case, while achieving high accuracy. We demonstrate the effectiveness of our approaches on tasks including MNIST digit transport, color transfer, supply-demand transportation on spherical data, and mini-batch OT conditional flow matching.
预测|估计(5篇)
【1】Assessing the Performance-Efficiency Trade-off of Foundation Models in Probabilistic Electricity Price Forecasting
标题:概率电价预测中基础模型的绩效-效率权衡评估
链接:https://arxiv.org/abs/2604.14739
作者:Jan Niklas Lettner,Hadeer El Ashhab,Veit Hagenmeyer,Benjamin Schäfer
备注:Submitted to the 7th International Workshop on Energy Data and Analytics (EDA), held in conjunction with ACM e-Energy 2026
摘要:大规模可再生能源的部署给电力系统带来了显著的波动性,将电网运行变成了一个复杂的随机优化问题。准确的电价预测(EPF)不仅对支持运营决策(如最优报价策略和平衡电力准备)至关重要,而且对降低经济风险和提高市场效率也至关重要。概率预测特别有价值,因为它们量化了可再生能源的不确定性,市场耦合和监管变化,使市场参与者能够做出明智的决策,最大限度地减少损失并优化预期收入。然而,它仍然是一个悬而未决的问题,采用哪些模型来产生准确的预测。这些应该是特定于任务的机器学习(ML)模型还是时间序列基础模型(TSFM)?在这项工作中,我们比较了四种模型的日前概率EPF(PEPF)在欧洲投标区:一个确定性的NHITS骨干分位数回归平均(NHITS+QRA)和一个有条件的归一化流预报(NF)进行了比较,两个TSFM,即Moirai和ChronosX。一方面,我们发现TSFM在CRPS、能量分数和预测区间校准方面优于从头开始训练的特定任务深度学习模型。另一方面,我们发现,配置良好的特定于任务的模型,特别是NHITS与QRA相结合,实现非常接近TSFM的性能,在某些情况下,例如,当提供额外的信息功能组或通过从其他欧洲市场的Few-Shot学习适应时,它们甚至可以超过TSFM。总体而言,我们的研究结果表明,虽然TSFMs提供表达建模能力,传统的模型仍然具有很强的竞争力,强调需要权衡计算费用对PEPF的边际性能改进。
摘要:Large-scale renewable energy deployment introduces pronounced volatility into the electricity system, turning grid operation into a complex stochastic optimization problem. Accurate electricity price forecasting (EPF) is essential not only to support operational decisions, such as optimal bidding strategies and balancing power preparation, but also to reduce economic risk and improve market efficiency. Probabilistic forecasts are particularly valuable because they quantify uncertainty stemming from renewable intermittency, market coupling, and regulatory changes, enabling market participants to make informed decisions that minimize losses and optimize expected revenues. However, it remains an open question which models to employ to produce accurate forecasts. Should these be task-specific machine learning (ML) models or Time Series Foundation Models (TSFMs)? In this work, we compare four models for day-ahead probabilistic EPF (PEPF) in European bidding zones: a deterministic NHITS backbone with Quantile-Regression Averaging (NHITS+QRA) and a conditional Normalizing-Flow forecaster (NF) are compared with two TSFMs, namely Moirai and ChronosX. On the one hand, we find that TSFMs outperform task-specific deep learning models trained from scratch in terms of CRPS, Energy Score, and predictive interval calibration across market conditions. On the other hand, we find that well-configured task-specific models, particularly NHITS combined with QRA, achieve performance very close to TSFMs, and in some scenarios, such as when supplied with additional informative feature groups or adapted via few-shot learning from other European markets, they can even surpass TSFMs. Overall, our findings show that while TSFMs offer expressive modeling capabilities, conventional models remain highly competitive, emphasizing the need to weigh computational expense against marginal performance improvements in PEPF.
【2】The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction
标题:声学Camerage现象:重新评估用于金融风险预测的语音特征
链接:https://arxiv.org/abs/2604.14619
作者:Dhruvin Dungrani,Disha Dungrani
摘要:In computational paralinguistics, detecting cognitive load and deception from speech signals is a heavily researched domain. Recent efforts have attempted to apply these acoustic frameworks to corporate earnings calls to predict catastrophic stock market volatility. In this study, we empirically investigate the limits of acoustic feature extraction (pitch, jitter, and hesitation) when applied to highly trained speakers in in-the-wild teleconference environments. Utilizing a two-stream late-fusion architecture, we contrast an acoustic-based stream with a baseline Natural Language Processing (NLP) stream. The isolated NLP model achieved a recall of 66.25% for tail-risk downside events. Surprisingly, integrating acoustic features via late fusion significantly degraded performance, reducing recall to 47.08%. We identify this degradation as Acoustic Camouflage, where media-trained vocal regulation introduces contradictory noise that disrupts multimodal meta-learners. We present these findings as a boundary condition for speech processing applications in high-stakes financial forecasting.
【3】Physics-Informed Machine Learning for Pouch Cell Temperature Estimation
标题:用于袋式电池温度估计的物理信息机器学习
链接:https://arxiv.org/abs/2604.14566
作者:Zheng Liu
备注:4 pages, 2 figures
摘要:Accurate temperature estimation of pouch cells with indirect liquid cooling is essential for optimizing battery thermal management systems for transportation electrification. However, it is challenging due to the computational expense of finite element simulations and the limitations of data-driven models. This paper presents a physics-informed machine learning (PIML) framework for the efficient and reliable estimation of steady-state temperature profiles. The PIML approach integrates the governing heat transfer equations directly into the neural network's loss function, enabling high-fidelity predictions with significantly faster convergence than purely data-driven methods. The framework is evaluated on a dataset of varying cooling channel geometries. Results demonstrate that the PIML model converges more rapidly and achieves markedly higher accuracy, with a 49.1% reduction in mean squared error over the data-driven model. Validation against independent test cases further confirms its superior performance, particularly in regions away from the cooling channels. These findings underscore the potential of PIML for surrogate modeling and design optimization in battery systems.
【4】CSRA: Controlled Spectral Residual Augmentation for Robust Sepsis Prediction
标题:CSRA:受控光谱残留增强,用于稳健的败血症预测
链接:https://arxiv.org/abs/2604.14532
作者:Honglin Guo,Rihao Chang,He Jiao,Weizhi Nie,Zhongheng Zhang,Yuehao Shen
摘要:Accurate prediction of future risk and disease progression in sepsis is clinically important for early warning and timely intervention in intensive care. However, short-window sepsis prediction remains challenging, because shorter observation windows provide limited historical evidence, whereas longer prediction horizons reduce the number of patient trajectories with valid future supervision. To address this problem, we propose CSRA, a Controlled Spectral Residual Augmentation framework for short-window multi-system ICU time series. CSRA first groups variables by clinical systems and extracts system-level and global representations. It then performs input-adaptive residual perturbation in the spectral domain to generate structured and clinically plausible trajectory variations. To improve augmentation stability and controllability, CSRA is trained end-to-end with the downstream predictor under a unified objective, together with anchor consistency loss and controller regularization. Experiments on a MIMIC-IV sepsis cohort across multiple downstream models show that CSRA is consistently competitive and often superior, reducing regression error by 10.2\% in MSE and 3.7\% in MAE over the non-augmentation baseline, while also yielding consistent gains on classification. CSRA further maintains more favorable performance under shorter observation windows, longer prediction horizons, and smaller training data scales, while also remaining effective on an external clinical dataset~(ZiGongICUinfection), indicating stronger robustness and generalizability in clinically constrained settings.
【5】Differentially Private Conformal Prediction
标题:差异私人保形预测
链接:https://arxiv.org/abs/2604.14621
作者:Jiamei Wu,Ce Zhang,Zhipeng Cai,Jingsen Kong,Bei Jiang,Linglong Kong,Lingchen Kong
摘要:Conformal prediction (CP) has attracted broad attention as a simple and flexible framework for uncertainty quantification through prediction sets. In this work, we study how to deploy CP under differential privacy (DP) in a statistically efficient manner. We first introduce differential CP, a non-splitting conformal procedure that avoids the efficiency loss caused by data splitting and serves as a bridge between oracle CP and private conformal inference. By exploiting the stability properties of DP mechanisms, differential CP establishes a direct connection to oracle CP and inherits corresponding validity behavior. Building on this idea, we develop Differentially Private Conformal Prediction (DPCP), a fully private procedure that combines DP model training with a private quantile mechanism for calibration. We establish the end-to-end privacy guarantee of DPCP and investigate its coverage properties under additional regularity conditions. We further study the efficiency of both differential CP and DPCP under empirical risk minimization and general regression models, showing that DPCP can produce tighter prediction sets than existing private split conformal approaches under the same privacy budget. Numerical experiments on synthetic and real datasets demonstrate the practical effectiveness of the proposed methods.
其他神经网络|深度学习|模型|建模(26篇)
【1】Benchmarking Optimizers for MLPs in Tabular Deep Learning
标题:制表式深度学习中MLP的基准优化器
链接:https://arxiv.org/abs/2604.15297
作者:Yury Gorishniy,Ivan Rubachev,Dmitrii Feoktistov,Artem Babenko
备注:Code: https://github.com/yandex-research/tabular-dl-optimizers
摘要:MLP is a heavily used backbone in modern deep learning (DL) architectures for supervised learning on tabular data, and AdamW is the go-to optimizer used to train tabular DL models. Unlike architecture design, however, the choice of optimizer for tabular DL has not been examined systematically, despite new optimizers showing promise in other domains. To fill this gap, we benchmark \Noptimizers optimizers on \Ndatasets tabular datasets for training MLP-based models in the standard supervised learning setting under a shared experiment protocol. Our main finding is that the Muon optimizer consistently outperforms AdamW, and thus should be considered a strong and practical choice for practitioners and researchers, if the associated training efficiency overhead is affordable. Additionally, we find exponential moving average of model weights to be a simple yet effective technique that improves AdamW on vanilla MLPs, though its effect is less consistent across model variants.
【2】A Nonlinear Separation Principle: Applications to Neural Networks, Control and Learning
标题:非线性分离原理:神经网络、控制和学习的应用
链接:https://arxiv.org/abs/2604.15238
作者:Anand Gokhale,Anton V. Proskurnikov,Yu Kawano,Francesco Bullo
备注:arXiv admin note: text overlap with arXiv:2604.00119
摘要:This paper investigates continuous-time and discrete-time firing-rate and Hopfield recurrent neural networks (RNNs), with applications in nonlinear control design and implicit deep learning. First, we introduce a nonlinear separation principle that guarantees global exponential stability for the interconnection of a contracting state-feedback controller and a contracting observer, alongside parametric extensions for robustness and equilibrium tracking. Second, we derive sharp linear matrix inequality (LMI) conditions that guarantee the contractivity of both firing rate and Hopfield neural network architectures. We establish structural relationships among these certificates-demonstrating that continuous-time models with monotone non-decreasing activations maximize the admissible weight space, and extend these stability guarantees to interconnected systems and Graph RNNs. Third, we combine our separation principle and LMI framework to solve the output reference tracking problem for RNN-modeled plants. We provide LMI synthesis methods for feedback controllers and observers, and rigorously design a low-gain integral controller to eliminate steady-state error. Finally, we derive an exact, unconstrained algebraic parameterization of our contraction LMIs to design highly expressive implicit neural networks, achieving competitive accuracy and parameter efficiency on standard image classification benchmarks.
【3】Metric-agnostic Learning-to-Rank via Boosting and Rank Approximation
标题:通过提升和排名逼近进行度量不可知的排名学习
链接:https://arxiv.org/abs/2604.15101
作者:Camilo Gomez,Pengyang Wang,Yanjie Fu
备注:Published in IEEE ICDM 2023. 6 pages
摘要:Learning-to-Rank (LTR) is a supervised machine learning approach that constructs models specifically designed to order a set of items or documents based on their relevance or importance to a given query or context. Despite significant success in real-world information retrieval systems, current LTR methods rely on one prefix ranking metric (e.g., such as Normalized Discounted Cumulative Gain (NDCG) or Mean Average Precision (MAP)) for optimizing the ranking objective function. Such metric-dependent setting limits LTR methods from two perspectives: (1) non-differentiable problem: directly optimizing ranking functions over a given ranking metric is inherently non-smooth, making the training process unstable and inefficient; (2) limited ranking utility: optimizing over one single metric makes it difficult to generalize well to other ranking metrics of interest. To address the above issues, we propose a novel listwise LTR framework for efficient and generalizable ranking purpose. Specifically, we propose a new differentiable ranking loss that combines a smooth approximation to the ranking operator with the average mean square loss per query. Then, we adapt gradient-boosting machines to minimize our proposed loss with respect to each list, a novel contribution. Finally, extensive experimental results confirm that our method outperforms the current state-of-the-art in information retrieval measures with similar efficiency.
【4】When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning
标题:当公平性存在分歧时:评估机器学习中人口公平性评估的可靠性
链接:https://arxiv.org/abs/2604.15038
作者:Khalid Adnan Alsayed
备注:15 pages, 4 figues, 5 tables
摘要:The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.
【5】MLDAS: Machine Learning Dynamic Algorithm Selection for Software-Defined Networking Security
标题:MLDAS:软件定义网络安全的机器学习动态算法选择
链接:https://arxiv.org/abs/2604.14957
作者:Pablo Benlloch,Oscar Romero,Antonio Leon,Jaime Lloret
备注:22 pages, 15 figures, 12 tables
摘要:Network security is a critical concern in the digital landscape of today, with users demanding secure browsing experiences and protection of their personal data. This study explores the dynamic integration of Machine Learning (ML) algorithms with Software-Defined Networking (SDN) controllers to enhance network security through adaptive decision mechanisms. The proposed approach enables the system to dynamically choose the most suitable ML algorithm based on the characteristics of the observed network traffic. This work examines the role of Intrusion Detection Systems (IDS) as a fundamental component of secure communication networks and discusses the limitations of SDN-based attack detection mechanisms. The proposed framework uses adaptive model selection to maintain reliable intrusion detection under varying network conditions. The study highlights the importance of analyzing traffic-type-based metrics to define effective classification rules and enhance the performance of ML models. Additionally, it addresses the risks of overfitting and underfitting, underscoring the critical role of hyperparameter tuning in optimizing model accuracy and generalization. The central contribution of this work is an automated mechanism that adaptively selects the most suitable ML algorithm according to real-time network conditions, prioritizing detection robustness and operational feasibility within SDN environments.
【6】SOLIS: Physics-Informed Learning of Interpretable Neural Surrogates for Nonlinear Systems
标题:SOLIS:非线性系统可解释神经代理的物理知情学习
链接:https://arxiv.org/abs/2604.14879
作者:Murat Furkan Mansur,Tufan Kumbasar
备注:in the International Joint Conference on Neural Networks, 2026
摘要:Nonlinear system identification must balance physical interpretability with model flexibility. Classical methods yield structured, control-relevant models but rely on rigid parametric forms that often miss complex nonlinearities, whereas Neural ODEs are expressive yet largely black-box. Physics-Informed Neural Networks (PINNs) sit between these extremes, but inverse PINNs typically assume a known governing equation with fixed coefficients, leading to identifiability failures when the true dynamics are unknown or state-dependent. We propose \textbf{SOLIS}, which models unknown dynamics via a \emph{state-conditioned second-order surrogate model} and recasts identification as learning a Quasi-Linear Parameter-Varying (Quasi-LPV) representation, recovering interpretable natural frequency, damping, and gain without presupposing a global equation. SOLIS decouples trajectory reconstruction from parameter estimation and stabilizes training with a cyclic curriculum and \textbf{Local Physics Hints} windowed ridge-regression anchors that mitigate optimization collapse. Experiments on benchmarks show accurate parameter-manifold recovery and coherent physical rollouts from sparse data, including regimes where standard inverse methods fail.
【7】Constraint-based Pre-training: From Structured Constraints to Scalable Model Initialization
标题:基于约束的预训练:从结构化约束到可扩展模型训练
链接:https://arxiv.org/abs/2604.14769
作者:Fu Feng,Yucheng Xie,Ruixiao Shi,Jing Wang,Xin Geng
摘要:The pre-training and fine-tuning paradigm has become the dominant approach for model adaptation. However, conventional pre-training typically yields models at a fixed scale, whereas practical deployment often requires models of varying sizes, exposing its limitations when target model scales differ from those used during pre-training. To address this, we propose an innovative constraint-based pre-training paradigm that imposes structured constraints during pre-training to disentangle size-agnostic knowledge into reusable weight templates, while assigning size-specific adaptation to lightweight weight scalers, thereby reformulating variable-sized model initialization as a multi-task adaptation problem. Within this paradigm, we further introduce WeiT, which employs Kronecker-based constraints to regularize the pre-training process. Specifically, model parameters are represented as compositions of weight templates via concatenation and weighted aggregation, with adaptive connections governed by lightweight weight scalers whose parameters are learned from limited data. This design enables flexible and efficient construction of model weights across diverse downstream scales. Extensive experiments demonstrate the efficiency and effectiveness of WeiT, achieving state-of-the-art performance in initializing models with varying depths and widths across a broad range of perception and embodied learning tasks, including Image Classification, Image Generation, and Embodied Control. Moreover, its effectiveness generalizes to both Transformer-based and Convolution-based architectures, consistently enabling faster convergence and improved performance even under full training.
【8】HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
标题:HAMSA:通过SpectralPulseNet的免扫描视觉状态空间模型
链接:https://arxiv.org/abs/2604.14724
作者:Badri N. Patro,Vijay S. Agneeswaran
摘要:Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.
【9】AIPC: Agent-Based Automation for AI Model Deployment with Qualcomm AI Runtime
标题:AIPC:通过高通AI收件箱实现基于代理的人工智能模型部署自动化
链接:https://arxiv.org/abs/2604.14661
作者:Jianhao Su,Zhanwei Wu,ShengTing Huang,Weidong Feng
备注:19 pages, 1 figure, technical report
摘要:Edge AI model deployment is a multi-stage engineering process involving model conversion, operator compatibility handling, quantization calibration, runtime integration, and accuracy validation. In practice, this workflow is long, failure-prone, and heavily dependent on deployment expertise, particularly when targeting hardware-specific inference runtimes. This technical report presents AIPC (AI Porting Conversion), an AI agent-driven approach for constrained automation of AI model deployment. AIPC decomposes deployment into standardized, verifiable stages and injects deployment-domain knowledge into agent execution through Agent Skills, helper scripts, and a stage-wise validation loop. This design reduces both the expertise barrier and the engineering time required for hardware deployment. Using Qualcomm AI Runtime (QAIRT) as the primary scenario, this report examines automated deployment across representative vision, multimodal, and speech models. In the cases covered here, AIPC can complete deployment from PyTorch to runnable QNN/SNPE inference within 7-20 minutes for structurally regular vision models, with indicative API costs roughly in the range of USD 0.7-10. For more complex models involving less-supported operators, dynamic shapes, or autoregressive decoding structures, fully automated deployment may still require further advances, but AIPC already provides practical support for execution, failure localization, and bounded repair.
【10】Tight Bounds for Learning Polyhedra with a Margin
标题:有余量学习多边形的严格界限
链接:https://arxiv.org/abs/2604.14614
作者:Shyamal Patel,Santosh Vempala
摘要:We give an algorithm for PAC learning intersections of $k$ halfspaces with a $ρ$ margin to within error $\varepsilon$ that runs in time $\textsf{poly}(k, \varepsilon^{-1}, ρ^{-1}) \cdot \exp \left(O(\sqrt{n \log(1/ρ) \log k})\right)$. Notably, this improves on prior work which had an exponential dependence on either $k$ or $ρ^{-1}$ and matches known cryptographic and Statistical Query lower bounds up to the logarithmic factors in $k$ and $ρ$ in the exponent. Our learning algorithm extends to the more general setting when we are only promised that most points have distance at least $ρ$ from the boundary of the polyhedron, making it applicable to continuous distributions as well.
【11】DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance
标题:DEEP-GAP:对图形处理器架构性能中执行并行主义的深度学习评估
链接:https://arxiv.org/abs/2604.14552
作者:Kathiravan Palaniappan
备注:16 pages, 42 figures. Evaluation of inference performance on NVIDIA T4 and L4 GPUs across precision modes (FP32, FP16, INT8)
摘要
:Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software support. Its successor, the NVIDIA L4 GPU, introduces improvements in Tensor Core throughput, cache capacity, memory bandwidth, and parallel execution capability. However, limited empirical evidence quantifies the practical inference performance gap between these two generations under controlled and reproducible conditions. This work introduces DEEP-GAP, a systematic evaluation extending the GDEV-AI methodology to GPU inference. Using identical configurations and workloads, we evaluate ResNet18, ResNet50, and ResNet101 across FP32, FP16, and INT8 precision modes using PyTorch and TensorRT. Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines. L4 achieves up to 4.4x higher throughput than T4 while reaching peak efficiency at smaller batch sizes between 16 and 32, improving latency-throughput tradeoffs for latency-sensitive workloads. T4 remains competitive for large batch workloads where cost or power efficiency is important. DEEP-GAP provides practical guidance for selecting precision modes, batch sizes, and GPU architectures for modern inference deployments.
【12】CI-CBM: Class-Incremental Concept Bottleneck Model for Interpretable Continual Learning
标题:CI-CBN:可解释持续学习的类增量概念瓶颈模型
链接:https://arxiv.org/abs/2604.14519
作者:Amirhosein Javadi,Tuomas Oikarinen,Tara Javidi,Tsui-Wei Weng
备注:31 pages, 6 figures. Published in Transactions on Machine Learning Research (TMLR), 04/2026
摘要:Catastrophic forgetting remains a fundamental challenge in continual learning, in which models often forget previous knowledge when fine-tuned on a new task. This issue is especially pronounced in class incremental learning (CIL), which is the most challenging setting in continual learning. Existing methods to address catastrophic forgetting often sacrifice either model interpretability or accuracy. To address this challenge, we introduce ClassIncremental Concept Bottleneck Model (CI-CBM), which leverage effective techniques, including concept regularization and pseudo-concept generation to maintain interpretable decision processes throughout incremental learning phases. Through extensive evaluation on seven datasets, CI-CBM achieves comparable performance to black-box models and outperforms previous interpretable approaches in CIL, with an average 36% accuracy gain. CICBM provides interpretable decisions on individual inputs and understandable global decision rules, as shown in our experiments, thereby demonstrating that human understandable concepts can be maintained during incremental learning without compromising model performance. Our approach is effective in both pretrained and non-pretrained scenarios; in the latter, the backbone is trained from scratch during the first learning phase. Code is publicly available at github.com/importAmir/CI-CBM.
【13】Improving Machine Learning Performance with Synthetic Augmentation
标题:通过综合增强提高机器学习性能
链接:https://arxiv.org/abs/2604.14498
作者:Mel Sohm,Charles Dezons,Sami Sellami,Oscar Ninou,Axel Pincon
摘要:Synthetic augmentation is increasingly used to mitigate data scarcity in financial machine learning, yet its statistical role remains poorly understood. We formalize synthetic augmentation as a modification of the effective training distribution and show that it induces a structural bias--variance trade-off: while additional samples may reduce estimation error, they may also shift the population objective whenever the synthetic distribution deviates from regions relevant under evaluation. To isolate informational gains from mechanical sample-size effects, we introduce a size-matched null augmentation and a finite-sample, non-parametric block permutation test that remains valid under weak temporal dependence. We evaluate this framework in both controlled Markov-switching environments and real financial datasets, including high-frequency option trade data and a daily equity panel. Across generators spanning bootstrap, copula-based models, variational autoencoders, diffusion models, and TimeGAN, we vary augmentation ratio, model capacity, task type, regime rarity, and signal-to-noise. We show that synthetic augmentation is beneficial only in variance-dominant regimes, such as persistent volatility forecasting-while it deteriorates performance in bias-dominant settings, including near-efficient directional prediction. Rare-regime targeting can improve domain-specific metrics but may conflict with unconditional permutation inference. Our results provide a structural perspective on when synthetic data improves financial learning performance and when it induces persistent distributional distortion.
【14】Quantization of Spiking Neural Networks Beyond Accuracy
标题:超出准确性的尖峰神经网络量化
链接:https://arxiv.org/abs/2604.14487
作者:Evan Gibson Smith,Jacob Whitehill,Fatemeh Ganji
摘要:Quantization is a natural complement to the sparse, event-driven computation of Spiking Neural Networks, reducing memory bandwidth and arithmetic cost for deployment on resource-constrained hardware. However, existing SNN quantization evaluation focuses almost exclusively on accuracy, overlooking whether a quantized network preserves the firing behavior of its full-precision counterpart. We demonstrate that quantization method, clipping range, and bit-width can produce substantially different firing distributions at equivalent accuracy, differences invisible to standard metrics but relevant to deployment, where firing activity governs effective sparsity, state storage, and event-processing load. To capture this gap, we propose Earth Mover's Distance as a diagnostic metric for firing distribution divergence, and apply it systematically across weight and membrane quantization on SEW-ResNet architectures trained on CIFAR-10 and CIFAR-100. We find that uniform quantization induces distributional drift even when accuracy is preserved, while LQ-Net style learned quantization maintains firing behavior close to the full-precision baseline. Our results suggest that behavior preservation should be treated as an evaluation criterion alongside accuracy, and that EMD provides a principled tool for assessing it.
【15】Non-intrusive Learning of Physics-Informed Spatio-temporal Surrogate for Accelerating Design
标题:加速设计的物理信息时空代理的非侵入式学习
链接:https://arxiv.org/abs/2604.14424
作者:Sudeepta Mondal,Soumalya Sarkar
摘要
:Most practical engineering design problems involve nonlinear spatio-temporal dynamical systems. Multi-physics simulations are often performed to capture the fine spatio-temporal scales which govern the evolution of these systems. However, these simulations are often high-fidelity in nature, and can be computationally very expensive. Hence, generating data from these expensive simulations becomes a bottleneck in an end-to-end engineering design process. Spatio-temporal surrogate modeling of these dynamical systems has been a popular data-driven solution to tackle this computational bottleneck. This is because accurate machine learning models emulating the dynamical systems can be orders of magnitude faster than the actual simulations. However, one key limitation of purely data-driven approaches is their lack of generalizability to inputs outside the training distribution. In this paper, we propose a physics-informed spatio-temporal surrogate modeling (PISTM) framework constrained by the physics of the underlying dynamical system. The framework leverages state-of-the-art advancements in the field of Koopman autoencoders to learn the underlying spatio-temporal dynamics in a non-intrusive manner, coupled with a spatio-temporal surrogate model which predicts the behavior of the Koopman operator in a specified time window for unknown operating conditions. We evaluate our framework on a prototypical fluid flow problem of interest: two-dimensional incompressible flow around a cylinder.
【16】Modular Continual Learning via Zero-Leakage Reconstruction Routing and Autonomous Task Discovery
标题:通过零泄漏重建路由和自主任务发现的模块化连续学习
链接:https://arxiv.org/abs/2604.14375
作者:Noureddine Kermiche
摘要:Catastrophic forgetting remains a primary hurdle in sequential task learning for artificial neural networks. We propose a silicon-native modular architecture that achieves structural parameter isolation using Task-Specific Experts and a distributed, outlier-based Gatekeeper. Moving beyond traditional sequential consolidation, our framework utilizes a Simultaneous Pipeline where Teacher learning, Student distillation, and Router manifold acquisition occur in parallel while raw data is present in a localized training session. This approach ensures computational efficiency and complies with privacy mandates like GDPR by deleting raw data as soon as a task is learned. We demonstrate that a Tight-Bottleneck Autoencoder (TB-AE) can effectively distinguish semantically crowded manifolds in high-dimensional latent spaces, overcoming the posterior collapse inherent to standard variational methods. By establishing strict topological boundaries, our TB-AE resolves latent space crowding in 4096-D LLM embeddings to provide a robust, unsupervised novelty signal. Furthermore, we validate an Autonomous Retrieval mechanism that confidently identifies returning manifolds, enabling stable lifelong learning without redundant module instantiation. Empirical results demonstrate that our ``Live Distillation'' approach acts as a natural regularizer, achieving strong retention across computer vision and natural language processing domains without suffering a student fidelity gap.
【17】Quantum-inspired tensor networks in machine learning models
标题:机器学习模型中的量子张量网络
链接:https://arxiv.org/abs/2604.14287
作者:Guillermo Valverde,Igor García-Olaizola,Giannicola Scarpa,Alejandro Pozas-Kerstjens
备注:28 pages, 11 figures, article class. The interactive version of the graph can be found at https://github.com/gvalverde21/research-graph-TensorNetworks-MachineLearning
摘要:Tensor networks were developed in the context of many-body physics as compressed representations of multiparticle quantum states. These representations mitigate the exponential complexity of many-body systems by capturing only the most relevant dependencies. Due to the formal similarity between quantum entanglement and statistical correlations, tensor networks have recently been integrated in machine learning, operating both as alternative learning architectures and as decompositions of components of neural networks. The expectation is that the theoretical understanding of tensor networks developed within quantum many-body physics leads to novel methods that offer advantages in terms of computational efficiency, explainability, or privacy. Here we review the use of tensor networks in the context of machine learning, providing a critical assessment of the state of the art, the potential advantages, and the challenges that must be overcome.
【18】GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
标题:GUI受干扰:领域随机化揭示了GUI基础模型中的系统脆弱性
链接:https://arxiv.org/abs/2604.14262
作者:Yangyue Wang,Harshvardhan Sikka,Yash Mathur,Tony Zhou,Jinu Nyachhyon,Pranav Guruprasad
备注:26 Pages, 17 Figures, 9 Tables
摘要:GUI grounding models report over 85% accuracy on standard benchmarks, yet drop 27-56 percentage points when instructions require spatial reasoning rather than direct element naming. Current benchmarks miss this because they evaluate each screenshot once with a single fixed instruction. We introduce GUI-Perturbed, a controlled perturbation framework that independently varies visual scenes and instructions to measure grounding robustness. Evaluating three 7B models from the same architecture lineage, we find that relational instructions cause systematic accuracy collapse across all models, a 70% browser zoom produces statistically significant degradation, and rank-8 LoRA fine-tuning with augmented data degrades performance rather than improving it. By perturbing along independent axes, GUI-Perturbed isolates which specific capability axes are affected-spatial reasoning, visual robustness, reasoning calibration-providing diagnostic signal that aggregate benchmarks cannot. We release the dataset, augmentation pipeline, and a fine-tuned model.
【19】Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades
标题:先校准后委托:通过级联模型进行安全监控并提供风险和预算保证
链接:https://arxiv.org/abs/2604.14251
作者:Edoardo Pona,Milad Kazemi,Mehran Hosseini,Yali Du,David Watson,Osvaldo Simeone,Nicola Paoletti
摘要
:Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertainty is a poor proxy for delegation benefit, as it ignores whether the expert would actually correct the error. To address this problem, we introduce Calibrate-Then-Delegate (CTD), a model-cascade approach that provides probabilistic guarantees on the computation cost while enabling instance-level (streaming) decisions. CTD builds on a novel delegation value (DV) probe, a lightweight model operating on the same internal representations as the safety probe that directly predicts the benefit of escalation. To enforce budget constraints, CTD calibrates a threshold on the DV signal using held-out data via multiple hypothesis testing, yielding finite-sample guarantees on the delegation rate. Evaluated on four safety datasets, CTD consistently outperforms uncertainty-based delegation at every budget level, avoids harmful over-delegation, and adapts budget allocation to input difficulty without requiring group labels.
【20】Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees
标题:悲观对手带来遗憾和违规保证下的乐观政策学习
链接:https://arxiv.org/abs/2604.14243
作者:Sourav Ganguly,Kartik Pandit,Arnob Ghosh
摘要:Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, $s_{h+1} = f(s_h, a_h, \bar{a}_h)+ω_h$ where $\bar{a}_h$ is the adversary/external action, $a_h$ is the agent's action, and $ω_h$ is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy} $\barπ$ that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.
【21】Cloning is as Hard as Learning for Stabilizer States
标题:克隆与稳定状态学习一样困难
链接:https://arxiv.org/abs/2604.15269
作者:Nikhil Bansal,Matthias C. Caro,Gaurav Mahajan
备注:10 + 33 + 8 pages
摘要:The impossibility of simultaneously cloning non-orthogonal states lies at the foundations of quantum theory. Even when allowing for approximation errors, cloning an arbitrary unknown pure state requires as many initial copies as needed to fully learn the state. Rather than arbitrary unknown states, modern quantum learning theory often considers structured classes of states and exploits such structure to develop learning algorithms that outperform general-state tomography. This raises the question: How do the sample complexities of learning and cloning relate for such structured classes? We answer this question for an important class of states. Namely, for $n$-qubit stabilizer states, we show that the optimal sample complexity of cloning is $Θ(n)$. Thus, also for this structured class of states, cloning is as hard as learning. To prove these results, we use representation-theoretic tools in the recently proposed Abelian State Hidden Subgroup framework and a new structured version of the recently introduced random purification channel to relate stabilizer state cloning to a variant of the sample amplification problem for probability distributions that was recently introduced in classical learning theory. This allows us to obtain our cloning lower bounds by proving new sample amplification lower bounds for classes of distributions with an underlying linear structure. Our results provide a more fine-grained perspective on No-Cloning theorems, opening up connections from foundations to quantum learning theory and quantum cryptography.
【22】Learning to Concatenate Quantum Codes
标题:学习级联量子代码
链接:https://arxiv.org/abs/2604.14931
作者:Nico Meyer,Christopher Mutschler,Dominik Seuß,Andreas Maier,Daniel D. Scherer
备注:7 pages, 5 figures, 1 table
摘要:Concatenating quantum error correction codes scales error correction capability by driving logical error rates down double-exponentially across levels. However, the noise structure shifts under concatenation, making it hard to choose an optimal code sequence. We automate this choice by estimating the effective noise channel after each level and selecting the next code accordingly. In particular, we use learning-based methods to tailor small, non-additive encoders when the noise exhibits sufficient structure, then switch to standard codes once the noise is nearly uniform. In simulations, this level-wise adaptation achieves a target logical error rate with far fewer qubits than concatenating stabilizer codes alone--reducing qubit counts by up to two orders of magnitude for strongly structured noise. Therefore, this hybrid, learning-based strategy offers a promising tool for early fault-tolerant quantum computing.
【23】Doubly Outlier-Robust Online Infinite Hidden Markov Model
标题:双离群鲁棒在线无限隐马尔科夫模型
链接:https://arxiv.org/abs/2604.14322
作者:Horace Yiu,Leandro Sánchez-Betancourt,Álvaro Cartea,Gerardo Duran-Martin
摘要
:We derive a robust update rule for the online infinite hidden Markov model (iHMM) for when the streaming data contains outliers and the model is misspecified. Leveraging recent advances in generalised Bayesian inference, we define robustness via the posterior influence function (PIF), and provide conditions under which the online iHMM has bounded PIF. Imposing robustness inevitably induces an adaptation lag for regime switching. Our method, which is called Batched Robust iHMM (BR-iHMM), balances adaptivity and robustness with two additional tunable parameters. Across limit order book data, hourly electricity demand, and a synthetic high-dimensional linear system, BR-iHMM reduces one-step-ahead forecasting error by up to 67% relative to competing online Bayesian methods. Together with theoretical guarantees of bounded PIF, our results highlight the practicality of our approach for both forecasting and interpretable online learning.
【24】A deep learning framework for glomeruli segmentation with boundary attention
标题:具有边界注意力的小球分割深度学习框架
链接:https://arxiv.org/abs/2604.14263
作者:Behnaz Elhaminia,Catherine King,Jiaqi Lv,Lorraine Harper,Paul Moss,Owen Cain,Dimitrios Chanouzas,Shan E Ahmed Raza
摘要:Accurate detection and segmentation of glomeruli in kidney tissue are essential for diagnostic applications. Traditional deep learning methods primarily rely on semantic segmentation, which often fails to precisely delineate adjacent glomeruli. To address this challenge, we propose a novel glomerulus detection and segmentation model that emphasises boundary separation. Leveraging pathology foundation models, the proposed U-Net-based architecture incorporates a specialised attention decoder designed to highlight critical regions and improve instancelevel segmentation. Experimental evaluations demonstrate that our approach surpasses state-of-the-art methods in both Dice score and Intersection over Union, indicating superior performance in glomerular delineation.
【25】Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay
标题:通过功能连接矩阵生成回放进行基于fMRI的脑部疾病诊断的持续学习
链接:https://arxiv.org/abs/2604.14259
作者:Qianyu Chen,Shujian Yu
备注:manuscript accepted by CVPR 2026, code is available from \url{https://github.com/4me808/FORGE}
摘要:Functional magnetic resonance imaging (fMRI) is widely used for studying and diagnosing brain disorders, with functional connectivity (FC) matrices providing powerful representations of large-scale neural interactions. However, existing diagnostic models are trained either on a single site or under full multi-site access, making them unsuitable for real-world scenarios where clinical data arrive sequentially from different institutions. This results in limited generalization and severe catastrophic forgetting. This paper presents the first continual learning framework specifically designed for fMRI-based diagnosis across heterogeneous clinical sites. Our framework introduces a structure-aware variational autoencoder that synthesizes realistic FC matrices for both patient and control groups. Built on this generative backbone, we develop a multi-level knowledge distillation strategy that aligns predictions and graph representations between new-site data and replayed samples. To further enhance efficiency, we incorporate a hierarchical contextual bandit scheme for adaptive replay sampling. Experiments on multi-site datasets for major depressive disorder (MDD), schizophrenia (SZ), and autism spectrum disorder (ASD) show that the proposed generative model enhances data augmentation quality, and the overall continual learning framework substantially outperforms existing methods in mitigating catastrophic forgetting. Our code is available at https://github.com/4me808/FORGE.
【26】Polyformer: a generative framework for thermodynamic modeling of polymeric molecules
标题:Polyformer:聚合物分子热力学建模的生成式框架
链接:https://arxiv.org/abs/2604.14241
作者:Alessio Valentini,David Pekker,Chungwen Liang,Todd Martinez,Swagatam Mukhopadhyay
备注:9+epsilon pages+references+appendix, 6 figures
摘要:The classic paradigm of structural biology is that the sequence of a biomolecule (protein, nucleic acid, lipid, etc) determines its conformation (shape) which determines its biological function. Protein folding programs like AlphaFold address this paradigm by predicting the single best conformation given a sequence that defines the molecule. However, biomolecules are not static structures, and their conformational ensemble determines their function. We present the Polyformer -- a generative framework for thermodynamic modeling of polymeric molecules. Given the sequence and temperature (or another thermodynamic variable), the Polyformer generates conformations faithful to the molecule's thermodynamic conformational ensemble. It is the first generative model that solves three problems simultaneously: how does a molecule fold, what is its conformational ensemble, and how does the conformational ensemble change as we change physical temperature. As a concrete test case, we apply Polyformer to protein domains with 50-111 residues and report good agreement of model predictions to Molecular Dynamics (MD) trajectories.
其他(34篇)
【1】Context Over Content: Exposing Evaluation Faking in Automated Judges
标题:背景胜过内容:揭露自动法官中的评估造假
链接:https://arxiv.org/abs/2604.15224
作者:Manan Gupta,Inderjeet Nair,Lu Wang,Dhruv Kumar
备注:Under Review
摘要:The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent $\textit{leniency bias}$: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching $ΔV = -9.8 pp$ (a $30\%$ relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge's own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ($\mathrm{ERR}_J = 0.000$ across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.
【2】AdaSplash-2: Faster Differentiable Sparse Attention
标题:AdaSplash-2:更快的差异稀疏注意力
链接:https://arxiv.org/abs/2604.15180
作者:Nuno Gonçalves,Hugo Pitorro,Vlad Niculae,Edoardo Ponti,Lei Li,Andre Martins,Marcos Treviso
摘要:Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context training. A promising line of work is $α$-entmax attention, a differentiable sparse alternative to softmax that enables input-dependent sparsity yet has lagged behind softmax due to the computational overhead necessary to compute the normalizer $τ$. In this paper, we introduce AdaSplash-2, which addresses this limitation through a novel histogram-based initialization that reduces the number of iterations needed to compute $τ$ to typically 1--2. The key idea is to compute a coarse histogram of attention scores on the fly and store it in on-chip SRAM, yielding a more accurate initialization that enables fast forward and backward computation. Combined with a sparsity-aware GPU implementation that skips zero blocks with low overhead, AdaSplash-2 matches or improves per-step training time relative to FlashAttention-2 when block sparsity is moderate-to-high (e.g., $>$60\%), which often occurs at long-context lengths. On downstream tasks, models trained with our efficient $α$-entmax attention match softmax baselines at short-context lengths and achieve substantial gains in long-context settings.
【3】Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
标题:超越独立帧:用于多视图超声心动图的潜在注意力掩蔽自动编码器
链接:https://arxiv.org/abs/2604.15096
作者:Simon Böhi,Irene Cannistraci,Sergio Muñoz Gonzalez,Moritz Vandenhirtz,Sonia Laguna,Samuel Ruiperez-Campillo,Max Krähenmann,Andrea Agostini,Ece Ozkan,Thomas M. Sutter,Julia E. Vogt
备注:Accepted as a workshop paper at the ICLR 2026 Workshop on Foundation Models for Science
摘要:Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.
【4】An Intelligent Robotic and Bio-Digestor Framework for Smart Waste Management
标题:智能废物管理的智能机器人和生物数字化框架
链接:https://arxiv.org/abs/2604.14882
作者:Radhika Khatri,Adit Tewari,Nikhil Sharma,M. B. Srinivas
备注:8 pages, 10 figures, submitted to 7th International Conference on Smart Systems and Inventive Technology (ICSSIT 2026)
摘要:Rapid urbanization and continuous population growth have made municipal solid waste management increasingly challenging. These challenges highlight the need for smarter and automated waste management solutions. This paper presents the design and evaluation of an integrated waste management framework that combines two connected systems, a robotic waste segregation module and an optimized bio-digestor. The robotic waste segregation system uses a MyCobot 280 Jetson Nano robotic arm along with YOLOv8 object detection and robot operating system (ROS)-based path planning to identify and sort waste in real time. It classifies waste into four different categories with high precision, reducing the need for manual intervention. After segregation, the biodegradable waste is transferred to a bio-digestor system equipped with multiple sensors. These sensors continuously monitor key parameters, including temperature, pH, pressure, and motor revolutions per minute. The Particle Swarm Optimization (PSO) algorithm, combined with a regression model, is used to dynamically adjust system parameters. This intelligent optimization approach ensures stable operation and maximizes digestion efficiency under varying environmental conditions. System testing under dynamic conditions demonstrates a sorting accuracy of 98% along with highly efficient biological conversion. The proposed framework offers a scalable, intelligent, and practical solution for modern waste management, making it suitable for both residential and industrial applications.
【5】Curvature-Aligned Probing for Local Loss-Landscape Stabilization
标题:局部景观损失稳定的曲线对齐探测
链接:https://arxiv.org/abs/2604.14870
作者:Nikita Kiselev,Andrey Grabovoy
备注:Submitted to NeurIPS 2026
摘要
:Local loss-landscape stabilization under sample growth is typically measured either pointwise or through isotropic averaging in the full parameter space. Despite practical value, both choices probe directions that contribute little to the dominant local deformation of strongly anisotropic neural landscapes. We recast stabilization as an observational problem and introduce a unified family of criteria parameterized by an aggregation order and a probing distribution; within this family we propose a curvature-aligned criterion $Δ_2^{(D)}$ that probes the loss increment field in the top-$D$ eigenspace of the empirical Hessian near a trained solution. Solely from a local quadratic model, we prove that $Δ_2^{(D)}$ preserves the $O(k^{-2})$ mean-squared rate of the full-space criterion while replacing ambient-dimension curvature dependence with dependence on the subspace dimension $D$; a corollary gives a closed-form spectral expression and a proposition identifies the top-$D$ eigenspace as extremal within the eigenspace-aligned family. We also derive scalable estimators based on Hessian-vector products, subspace Monte Carlo, and a closed-form Gaussian-moment proxy. On a decoder-only transformer, a curvature-aligned probe occupying a tiny fraction of parameter space already reproduces the full-space mean-squared signal to within numerical noise throughout the validated local regime, and the closed-form estimator is orders of magnitude faster than direct Monte Carlo after subspace construction.
【6】Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
标题:Nautilus:一款自动调度张量调整器,用于高效拼接的图形处理器
链接:https://arxiv.org/abs/2604.14825
作者:Yifan Zhao,Yuchen Yang,Matei Budiu,Sasa Misailovic
摘要:We present Nautilus, a novel tensor compiler that moves toward fully automated math-to-kernel optimization. Nautilus compiles a high-level algebraic specification of tensor operators into efficient tiled GPU kernels. Nautilus's successive lowering design allows high-level optimizations, expression rewrites, and tile optimizations to be jointly applied in a single end-to-end system. Nautilus presents a novel auto-scheduler that discovers sequences of high-level optimizations, while preserving the regular program structure needed by tile optimizers. Nautilus's auto-scheduler captures complex interactions and trade-offs in the high-level optimizations, including aggressive global transformations like advanced reduction fusion. Nautilus is the first end-to-end tensor compiler capable of starting from a math-like description of attention and automatically discovering FlashAttention-3-like kernels, offloading the entire burden of optimization from the programmer to the compiler. Across five transformer-based models and 150 evaluation configurations on NVIDIA GH200 and RTX 5090 GPUs, Nautilus achieves up to 23% higher throughput than state-of-the-art compilers on GH200 and up to 42% on RTX 5090, while matching or exceeding manually written cuDNN kernels on many long-sequence configurations.
【7】RELOAD: A Robust and Efficient Learned Query Optimizer for Database Systems
标题:READD:一个强大且高效的数据库系统学习查询优化器
链接:https://arxiv.org/abs/2604.14725
作者:Seokwon Lee,Jaeyoung Sim,Sihyun Kim,Yuhsing Li,Yiwen Zhu,Kwanghyun Park
备注:This work is currently under review
摘要:Recent advances in query optimization have shifted from traditional rule-based and cost-based techniques towards machine learning-driven approaches. Among these, reinforcement learning (RL) has attracted significant attention due to its ability to optimize long-term performance by learning policies over query planning. However, existing RL-based query optimizers often exhibit unstable performance at the level of individual queries, including severe performance regressions, and require prolonged training to reach the plan quality of expert, cost-based optimizers. These shortcomings make learned query optimizers difficult to deploy in practice and remain a major barrier to their adoption in production database systems. To address these challenges, we present RELOAD, a robust and efficient learned query optimizer for database systems. RELOAD focuses on (i) robustness, by minimizing query-level performance regressions and ensuring consistent optimization behavior across executions, and (ii) efficiency, by accelerating convergence to expert-level plan quality. Through extensive experiments on standard benchmarks, including Join Order Benchmark, TPC-DS, and Star Schema Benchmark, RELOAD demonstrates up to 2.4x higher robustness and 3.1x greater efficiency compared to state-of-the-art RL-based query optimization techniques.
【8】A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation
标题:GPT-2中注意力下降的机械解释:一个回路,缓解的更广泛影响
链接:https://arxiv.org/abs/2604.14722
作者:Yuval Ran-Milo,Hila Ofek,Shahar Mendel
备注:9 pages, 8 figures
摘要:Transformers commonly exhibit an attention sink: disproportionately high attention to the first position. We study this behavior in GPT-2-style models with learned query biases and absolute positional embeddings. Combining structural analysis with causal interventions, validated across natural-language, mathematical, and code inputs, we find that the sink arises from the interaction among (i) a learned query bias, (ii) the first-layer MLP transformation of the positional encoding, and (iii) structure in the key projection. Crucially, each component we identify is individually dispensable: architectures omitting each of them robustly exhibit sinks. This indicates that attention sinks may arise through distinct circuits across architectures. These findings inform mitigation of sinks, and motivate broader investigation into why sinks emerge.
【9】Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents
标题:分层可变性:持久自我修改代理的连续性和治理
链接:https://arxiv.org/abs/2604.14717
作者:Krti Tallam
备注:17 pages, 2 figures, 3 tables. self-modifying agents; AI governance; identity drift; persistent memory; runtime adaptation; model editing Primary: cs.AI Cross-list: cs.LG, cs.CY
摘要:Persistent language-model agents increasingly combine tool use, tiered memory, reflective prompting, and runtime adaptation. In such systems, behavior is shaped not only by current prompts but by mutable internal conditions that influence future action. This paper introduces layered mutability, a framework for reasoning about that process across five layers: pretraining, post-training alignment, self-narrative, memory, and weight-level adaptation. The central claim is that governance difficulty rises when mutation is rapid, downstream coupling is strong, reversibility is weak, and observability is low, creating a systematic mismatch between the layers that most affect behavior and the layers humans can most easily inspect. I formalize this intuition with simple drift, governance-load, and hysteresis quantities, connect the framework to recent work on temporal identity in language-model agents, and report a preliminary ratchet experiment in which reverting an agent's visible self-description after memory accumulation fails to restore baseline behavior. In that experiment, the estimated identity hysteresis ratio is 0.68. The main implication is that the salient failure mode for persistent self-modifying agents is not abrupt misalignment but compositional drift: locally reasonable updates that accumulate into a behavioral trajectory that was never explicitly authorized.
【10】Gating Enables Curvature: A Geometric Expressivity Gap in Attention
标题:门控实现弯曲:注意力的几何表现力差距
链接:https://arxiv.org/abs/2604.14702
作者:Satwik Bathula,Anand A. Joshi
备注:41 pages, 9 figures
摘要:Multiplicative gating is widely used in neural architectures and has recently been applied to attention layers to improve performance and training stability in large language models. Despite the success of gated attention, the mathematical implications of gated attention mechanisms remain poorly understood. We study attention through the geometry of its representations by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher--Rao geometry. We show that ungated attention operator is restricted to intrinsically flat statistical manifolds due to its affine structure, while multiplicative gating enables non-flat geometries, including positively curved manifolds that are unattainable in the ungated setting. These results establish a geometric expressivity gap between ungated and gated attention. Empirically, we show that gated models exhibit higher representation curvature and improved performance on tasks requiring nonlinear decision boundaries whereas they provide no consistent advantage on tasks with linear decision boundaries. Furthermore, we identify a structured regime in which curvature accumulates under composition, yielding a systematic depth amplification effect.
【11】AgentGA: Evolving Code Solutions in Agent-Seed Space
标题:AgentGA:Agent-Seed空间中不断发展的代码解决方案
链接:https://arxiv.org/abs/2604.14655
作者:David Y. Y. Tan,Kellie Chin,Jingxian Zhang
备注:24 pages including appendix, 4 figures, 1 table
摘要:We present AgentGA, a framework that evolves autonomous code-generation runs by optimizing the agent seed: the task prompt plus optional parent archives that initialize a fresh workspace. The outer loop searches over these reusable starting conditions rather than editing code directly. Each generation launches a fresh autonomous run from a reset workspace, while selected parent archives provide inherited artifacts that descendants can inspect and reuse. AgentGA couples a population-level genetic algorithm with long-horizon agents; selection uses deterministic 1:1 elite tournaments and operator allocation is adapted online with a modified Hedge controller. We instantiate the approach for tabular AutoML on the 16-competition Weco-Kaggle Lite benchmark. On the 10 benchmark runs reported here, AgentGA averages 74.52% Exceeds % of Human versus 54.15% for AIDE. Across 1135 parent-child comparisons, descendants given parent archives outperform runs started from scratch, indicating that inherited artifacts improve later autonomous runs. These findings support agent-seed optimization as a practical design point for autonomous code-search systems.
【12】Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting
标题:亲眼目睹:保留可见的,生成不可见的,以进行视频外观绘制
链接:https://arxiv.org/abs/2604.14648
作者:Inseok Jeon,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Suhwan Cho,Sangyoun Lee
备注:8 pages, 8 figures (main paper); 9 pages, 10 figures (supplementary). Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, Findings
摘要:Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.
【13】A Synonymous Variational Perspective on the Rate-Distortion-Perception Tradeoff
标题:利率-失真-感知权衡的同义变分视角
链接:https://arxiv.org/abs/2604.14603
作者:Zijian Liang,Kai Niu,Changshuo Wang,Jin Xu,Ping Zhang
备注:23 pages, 6 figures. This paper is submitted to the special issue on "Data Compression: Classical Theories Meet Modern Advances" of the IEEE Journal of Selected Areas in Information Theory (IEEE JSAIT)
摘要:The fundamental limit of natural signal compression has traditionally been characterized by classical rate-distortion (RD) theory through the tradeoff between coding rate and reconstruction distortion, while the rate-distortion-perception (RDP) framework introduces a divergence-based measure of perceptual quality as a modeling principle rather than a theoretically-derived principle, leaving its theoretical origin unclear. In this paper, motivated by a synonymity-based semantic information perspective, we reformulate perceptual reconstruction as recovering any admissible sample within an ideal synonymous set (synset) associated with the source, rather than the source sample itself, and correspondingly establish a synonymous source coding architecture. On this basis, we develop a synonymous variational inference (SVI) analysis framework with a synonymous variational lower bound (SVLBO) for tractable analysis of synset-oriented compression. Within this framework, we establish a synonymity-perception consistency principle, showing that optimal identification of semantic information is theoretically consistent with perceptual optimization. Based on its derivation result, we prove a synonymous RDP tradeoff for the proposed synonymous source coding. These analytical results show that the distributional divergence term arises naturally from the synset-based reconstruction objective, clarify its compatibility with existing RDP formulations and classical RD theory, and suggest the potential advantages of synonymous source coding.
【14】CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization
标题:CLion:具有增强概括性的高效谨慎Lion优化器
链接:https://arxiv.org/abs/2604.14587
作者:Feihu Huang,Guanyi Zhang,Songcan Chen
备注:30 pages
摘要:Lion optimizer is a popular learning-based optimization algorithm in machine learning, which shows impressive performance in training many deep learning models. Although convergence property of the Lion optimizer has been studied, its generalization analysis is still missing. To fill this gap, we study generalization property of the Lion via algorithmic stability based on the mathematical induction. Specifically, we prove that the Lion has a generalization error of $O(\frac{1}{Nτ^T})$, where $N$ is training sample size, and $τ>0$ denotes the smallest absolute value of non-zero element in gradient estimator, and $T$ is the total iteration number. In addition, we obtain an interesting byproduct that the SignSGD algorithm has the same generalization error as the Lion. To enhance generalization of the Lion, we design a novel efficient Cautious Lion (i.e., CLion) optimizer by cautiously using sign function. Moreover, we prove that our CLion has a lower generalization error of $O(\frac{1}{N})$ than $O(\frac{1}{Nτ^T})$ of the Lion, since the parameter $τ$ generally is very small. Meanwhile, we study convergence property of our CLion optimizer, and prove that our CLion has a fast convergence rate of $O(\frac{\sqrt{d}}{T^{1/4}})$ under $\ell_1$-norm of gradient for nonconvex stochastic optimization, where $d$ denotes the model dimension. Extensive numerical experiments demonstrate effectiveness of our CLion optimizer.
【15】VoxSafeBench: Not Just What Is Said, but Who, How, and Where
标题:VoxSafeBench:不仅仅是说了什么,还有谁、如何以及在哪里
链接:https://arxiv.org/abs/2604.14548
作者:Yuxiang Wang,Hongyu Liu,Yijiang Xu,Qinke Ni,Li Wang,Wan Lin,Kunyu Feng,Dekun Chen,Xu Tan,Lei Wang,Jie Shi,Zhizheng Wu
摘要:As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign request into one that is unsafe, unfair, or privacy-violating. Existing benchmarks, however, largely focus on basic audio comprehension, study individual risks in isolation, or conflate content that is inherently harmful with content that only becomes problematic due to its acoustic context. We introduce VoxSafeBench, among the first benchmarks to jointly evaluate social alignment in SLMs across three dimensions: safety, fairness, and privacy. VoxSafeBench adopts a Two-Tier design: Tier1 evaluates content-centric risks using matched text and audio inputs, while Tier2 targets audio-conditioned risks in which the transcript is benign but the appropriate response hinges on the speaker, paralinguistic cues, or the surrounding environment. To validate Tier2, we include intermediate perception probes and confirm that frontier SLMs can successfully detect these acoustic cues yet still fail to act on them appropriately. Across 22 tasks with bilingual coverage, we find that safeguards appearing robust on text often degrade in speech: safety awareness drops for speaker- and scene-conditioned risks, fairness erodes when demographic differences are conveyed vocally, and privacy protections falter when contextual cues arrive acoustically. Together, these results expose a pervasive speech grounding gap: current SLMs frequently recognize the relevant social norm in text but fail to apply it when the decisive cue must be grounded in speech. Code and data are publicly available at: https://amphionteam.github.io/VoxSafeBench_demopage/
【16】On the Expressive Power and Limitations of Multi-Layer SSMs
标题:多层SSSM的表现能力和局限性
链接:https://arxiv.org/abs/2604.14501
作者:Nikola Zubić,Qian Li,Yuyi Wang,Davide Scaramuzza
备注:25 pages, 6 theorems
摘要:We study the expressive power and limitations of multi-layer state-space models (SSMs). First, we show that multi-layer SSMs face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Then, we examine the role of chain-of-thought (CoT), showing that offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. Indeed, with online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Finally, we investigate the tradeoff between width and precision, showing that these resources are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed. Overall, our results offer a unified perspective on how depth, finite precision, and CoT shape the power and limits of SSMs.
【17】Scouting By Reward: VLM-TO-IRL-Driven Player Selection For Esports
标题:奖励侦察:VLM-TO-IRL驱动的电子竞技球员选择
链接:https://arxiv.org/abs/2604.14474
作者:Qing Yan,Wenyu Yang,Yufei Wang,Wenhao Ma,Linchong Hu,Yifei Jin,Anton Dahbura
摘要:Traditional esports scouting workflows rely heavily on manual video review and aggregate performance metrics, which often fail to capture the nuanced decision-making patterns necessary to determine if a prospect fits a specific tactical archetype. To address this, we reframe style-based player evaluation in esports as an Inverse Reinforcement Learning (IRL) problem. In this paper, we introduce a novel player selection framework that learns professional-specific reward functions from logged gameplay demonstrations, allowing organizations to rank candidates by their stylistic alignment with a target star player. Our proposed architecture utilizes a multimodal, two-branch intake: one branch encodes structured state-action trajectories derived from high-resolution in-game telemetry, while the second encodes temporally aligned tactical pseudo-commentary generated by Vision-Language Models (VLMs) from broadcast footage. These representations are fused and evaluated via a Generative Adversarial Imitation Learning (GAIL) objective, where a discriminator learns to capture the unique mechanical and tactical signatures of elite professionals. By transitioning from generic skill estimation to scouting "by reward," this framework provides a scalable, workflow-aware digital twin system that enables data-driven roster construction and targeted talent discovery across massive candidate pools.
【18】Auxiliary Finite-Difference Residual-Gradient Regularization for PINNs
标题:PINN的辅助伪差剩余梯度正规化
链接:https://arxiv.org/abs/2604.14472
作者:Stavros Kassinos
备注:18 pages, 5 figures, 10 tables
摘要:Physics-informed neural networks (PINNs) are often selected by a single scalar loss even when the quantity of interest is more specific. We study a hybrid design in which the governing PDE residual remains automatic-differentiation (AD) based, while finite differences (FD) appear only in a weak auxiliary term that penalizes gradients of the sampled residual field. The FD term regularizes the residual field without replacing the PDE residual itself. We examine this idea in two stages. Stage 1 is a controlled Poisson benchmark comparing a baseline PINN, the FD residual-gradient regularizer, and a matched AD residual-gradient baseline. Stage 2 transfers the same logic to a three-dimensional annular heat-conduction benchmark (PINN3D), where baseline errors concentrate near a wavy outer wall and the auxiliary grid is implemented as a body-fitted shell adjacent to the wall. In Stage 1, the FD regularizer reproduces the main effect of residual-gradient control while exposing a trade-off between field accuracy and residual cleanliness. In Stage 2, the shell regularizer improves the application-facing quantities, namely outer-wall flux and boundary-condition behavior. Across seeds 0-5 and 100k epochs, the most reliable tested configuration is a fixed shell weight of 5e-4 under the Kourkoutas-beta optimizer regime: relative to a matched run without the shell term, it reduces the mean outer-wall BC RMSE from 1.22e-2 to 9.29e-4 and the mean wall-flux RMSE from 9.21e-3 to 9.63e-4. Adam with beta2=0.999 becomes usable when the initial learning rate is reduced to 1e-3, although its shell benefit is less robust than under Kourkoutas-beta. Overall, the results support a targeted view of hybrid PINNs: an auxiliary-only FD regularizer is most valuable when it is aligned with the physical quantity of interest, here the outer-wall flux.
【19】Bias in Surface Electromyography Features across a Demographically Diverse Cohort
标题:人口统计学差异队列中表面肌电图特征的偏倚
链接:https://arxiv.org/abs/2604.14460
作者:Aditi Agrawal,Celine John Philip,Giancarlo K. Sagastume,Marcus A. Battraw,Wilsaan M. Joiner,Jonathon S. Schofield,Lee M. Miller,Richard S. Whittle
备注:17 pages, 4 Figures
摘要:Neuromotor decoding from upper-limb electromyography (sEMG) can enhance human-machine interfaces and offer a more natural means of controlling prosthetic limbs, virtual reality, and household electronics. Unfortunately, current sEMG technology does not always perform consistently across users because individual differences such as age and body mass index, among many others, can substantially alter signal quality. This variability makes sEMG characteristics highly idiosyncratic, often necessitating laborious personalization and iterative tuning to achieve reliable performance. This variability has particular import for sEMG-based assistive devices and neural interfaces, where demographic biases in sEMG features could undermine broad and fair deployment. In this study, we explore how demographic differences affect the sEMG signals produced and their implications for machine learning-based gesture decoding. We analyze the data set provided by, in which we derive 147 common sEMG features extracted from 81 demographically diverse individuals performing discrete hand gestures. Using mixed-effects linear models and partial least squares (PLS) analysis, which take into consideration demographic variables (including age, sex, height, weight, skin properties, subcutaneous fat, and hair density), we identify that 33\% (49 of 147) of commonly used sEMG features show significant associations with demographic characteristics. These results may help guide the development of fair and unbiased sEMG-based neural interfaces across a diverse population.
【20】Path-Sampled Integrated Gradients
标题:路径抽样综合性状
链接:https://arxiv.org/abs/2604.14338
作者:Firuz Kamalov,Fadi Thabtah,R. Sivaraj,Neda Abdelhamid
摘要:We introduce path-sampled integrated gradients (PS-IG), a framework that generalizes feature attribution by computing the expected value over baselines sampled along the linear interpolation path. We prove that PS-IG is mathematically equivalent to path-weighted integrated gradients, provided the weighting function matches the cumulative distribution function of the sampling density. This equivalence allows the stochastic expectation to be evaluated via a deterministic Riemann sum, improving the error convergence rate from $O(m^{-1/2})$ to $O(m^{-1})$ for smooth models. Furthermore, we demonstrate analytically that PS-IG functions as a variance-reducing filter against gradient noise - strictly lowering attribution variance by a factor of 1/3 under uniform sampling - while preserving key axiomatic properties such as linearity and implementation invariance.
【21】When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse
标题:当缺失成为结构:金融KOL话语中的意向保留政策完成
链接:https://arxiv.org/abs/2604.14333
作者:Yuncong Liu,Yuan Wan,Zhou Jiang,Yao Lu
备注:Main paper with supplementary material included
摘要:Key Opinion Leader (KOL) discourse on social media is widely consumed as investment guidance, yet turning it into executable trading strategies without injecting assumptions about unspecified execution decisions remains an open problem. We observe that the gaps in KOL statements are not random deficiencies but a structured separation: KOLs express directional intent (what to buy or sell and why) while leaving execution decisions (when, how much, how long) systematically unspecified. Building on this observation, we propose an intent-preserving policy completion framework that treats KOL discourse as a partial trading policy and uses offline reinforcement learning to complete the missing execution decisions around the KOL-expressed intent. Experiments on multimodal KOL discourse from YouTube and X (2022-2025) show that KICL achieves the best return and Sharpe ratio on both platforms while maintaining zero unsupported entries and zero directional reversals, and ablations confirm that the full framework yields an 18.9% return improvement over the KOL-aligned baseline.
【22】Heat and Matérn Kernels on Matchings
标题
:热火队和马滕·科内尔队的比赛
链接:https://arxiv.org/abs/2604.14331
作者:Dmitry Eremeev,Salem Said,Viacheslav Borovitskiy
摘要:Applying kernel methods to matchings is challenging due to their discrete, non-Euclidean nature. In this paper, we develop a principled framework for constructing geometric kernels that respect the natural geometry of the space of matchings. To this end, we first provide a complete characterization of stationary kernels, i.e. kernels that respect the inherent symmetries of this space. Because the class of stationary kernels is too broad, we specifically focus on the heat and Matérn kernel families, adding an appropriate inductive bias of smoothness to stationarity. While these families successfully extend widely popular Euclidean kernels to matchings, evaluating them naively incurs a prohibitive super-exponential computational cost. To overcome this difficulty, we introduce and analyze a novel, sub-exponential algorithm leveraging zonal polynomials for efficient kernel evaluation. Finally, motivated by the known bijective correspondence between matchings and phylogenetic trees-a crucial data modality in biology-we explore whether our framework can be seamlessly transferred to the space of trees, establishing novel negative results and identifying a significant open problem.
【23】GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
标题:GFT:从模仿到奖励,具有无偏见的群体优势和动态系数纠正
链接:https://arxiv.org/abs/2604.14258
作者:Wangjie Gan,Miao Pan,Linbo Xi,Wenqi Zhang,Jintao Chen,Jianwei Yin,Xuhong Zhang
摘要:Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
【24】Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
标题:唤醒睡眠专家:缓解MoE幻觉的反事实途径
链接:https://arxiv.org/abs/2604.14246
作者:Wentao Hu,Yanbo Zhai,Xiaohui Hu,Mingkuan Zhao,Shanhong yu,Xue Liu,Kaidong Yu,Shuangyong Song,Xuelong Li
备注:14 pages, 6 figures, 6 tables
摘要:Sparse Mixture-of-Experts (MoE) models have achieved remarkable scalability, yet they remain vulnerable to hallucinations, particularly when processing long-tail knowledge. We identify that this fragility stems from static Top-$k$ routing: routers tend to favor high-frequency patterns over rare factual associations. Consequently, ``specialist experts'' possessing critical long-tail knowledge are often assigned low gating scores and remain ``dormant'' -- under-prioritized for specific tokens despite their proven causal importance on other inputs. To address this, we propose Counterfactual Routing (CoR), a training-free inference framework designed to awaken these dormant experts. CoR integrates layer-wise perturbation analysis with the Counterfactual Expert Impact (CEI) metric to dynamically shift computational resources from syntax-dominant to knowledge-intensive layers while maintaining a constant total activation count, effectively retrieving causally decisive experts via virtual ablation. Extensive experiments on TruthfulQA, FACTOR, and TriviaQA demonstrate that CoR improves factual accuracy by 3.1\% on average without increasing the inference budget, establishing a superior Pareto frontier compared to static scaling strategies.
【25】Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
标题:深入研究Claude Code:当今和未来人工智能代理系统的设计空间
链接:https://arxiv.org/abs/2604.14228
作者:Jiacheng Liu,Xiaohan Zhao,Xinyi Shang,Zhiqiang Shen
备注:Tech report. Code at: https://github.com/VILA-Lab/Dive-into-Claude-Code
摘要:Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open-source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while-loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append-oriented session storage. A comparison with OpenClaw, a multi-channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per-action safety classification to perimeter-level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context-window extensions to gateway-wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.
【26】Neuro-Oracle: A Trajectory-Aware Agentic RAG Framework for Interpretable Epilepsy Surgical Prognosis
标题:Neuro-Oracle:用于可解释癫痫手术预后的轨迹感知统计RAG框架
链接:https://arxiv.org/abs/2604.14216
作者:Aizierjiang Aiersilan,Mohamad Koubeissi
摘要:Predicting post-surgical seizure outcomes in pharmacoresistant epilepsy is a clinical challenge. Conventional deep-learning approaches operate on static, single-timepoint pre-operative scans, omitting longitudinal morphological changes. We propose \emph{Neuro-Oracle}, a three-stage framework that: (i) distils pre-to-post-operative MRI changes into a compact 512-dimensional trajectory vector using a 3D Siamese contrastive encoder; (ii) retrieves historically similar surgical trajectories from a population archive via nearest-neighbour search; and (iii) synthesises a natural-language prognosis grounded in the retrieved evidence using a quantized Llama-3-8B reasoning agent. Evaluations are conducted on the public EPISURG dataset ($N{=}268$ longitudinally paired cases) using five-fold stratified cross-validation. Since ground-truth seizure-freedom scores are unavailable, we utilize a clinical proxy label based on the resection type. We acknowledge that the network representations may potentially learn the anatomical features of the resection cavities (i.e., temporal versus non-temporal locations) rather than true prognostic morphometry. Our current evaluation thus serves mainly as a proof-of-concept for the trajectory-aware retrieval architecture. Trajectory-based classifiers achieve AUC values between 0.834 and 0.905, compared with 0.793 for a single-timepoint ResNet-50 baseline. The Neuro-Oracle agent (M5) matches the AUC of purely discriminative trajectory classifiers (0.867) while producing structured justifications with zero observed hallucinations under our audit protocol. A Siamese Diversity Ensemble (M6) of trajectory-space classifiers attains an AUC of 0.905 without language-model overhead.
【27】The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
标题:魔鬼在梯度纠缠中:用于稳健广义类别发现的能量感知梯度协调器
链接:https://arxiv.org/abs/2604.14176
作者:Haiyang Zheng,Nan Pu,Yaqi Cai,Teng Long,Wenjing Li,Nicu Sebe,Zhun Zhong
备注:Accepted by CVPR26
摘要:Generalized Category Discovery (GCD) leverages labeled data to categorize unlabeled samples from known or unknown classes. Most previous methods jointly optimize supervised and unsupervised objectives and achieve promising results. However, inherent optimization interference still limits their ability to improve further. Through quantitative analysis, we identify a key issue, i.e., gradient entanglement, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representation-subspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy-Aware Gradient Coordinator (EAGC), a plug-and-play gradient-level module that explicitly regulates the optimization process. EAGC comprises two components: Anchor-based Gradient Alignment (AGA) and Energy-aware Elastic Projection (EEP). AGA introduces a reference model to anchor the gradient directions of labeled samples, preserving the discriminative structure of known classes against the interference of unlabeled gradients. EEP softly projects unlabeled gradients onto the complement of the known-class subspace and derives an energy-based coefficient to adaptively scale the projection for each unlabeled sample according to its degree of alignment with the known subspace, thereby reducing subspace overlap without suppressing unlabeled samples that likely belong to known classes. Experiments show that EAGC consistently boosts existing methods and establishes new state-of-the-art results. Code is available at https://haiyangzheng.github.io/EAGC.
【28】Decoupling Scores and Text: The Politeness Principle in Peer Review
标题:分数和文本脱钩:同行评审中的礼貌原则
链接:https://arxiv.org/abs/2604.14162
作者:Yingxuan Wen
摘要:Authors often struggle to interpret peer review feedback, deriving false hope from polite comments or feeling confused by specific low scores. To investigate this, we construct a dataset of over 30,000 ICLR 2021-2025 submissions and compare acceptance prediction performance using numerical scores versus text reviews. Our experiments reveal a significant performance gap: score-based models achieve 91% accuracy, while text-based models reach only 81% even with large language models, indicating that textual information is considerably less reliable. To explain this phenomenon, we first analyze the 9% of samples that score-based models fail to predict, finding their score distributions exhibit high kurtosis and negative skewness, which suggests that individual low scores play a decisive role in rejection even when the average score falls near the borderline. We then examine why text-based accuracy significantly lags behind scores from a review sentiment perspective, revealing the prevalence of the Politeness Principle: reviews of rejected papers still contain more positive than negative sentiment words, masking the true rejection signal and making it difficult for authors to judge outcomes from text alone.
【29】Structural interpretability in SVMs with truncated orthogonal polynomial kernels
标题:具有截短的垂直多项核的支持者模型中的结构可解释性
链接:https://arxiv.org/abs/2604.15285
作者:Víctor Soto-Larrosa,Nuria Torrado,Edmundo J. Huertas
摘要:We study post-training interpretability for Support Vector Machines (SVMs) built from truncated orthogonal polynomial kernels. Since the associated reproducing kernel Hilbert space is finite-dimensional and admits an explicit tensor-product orthonormal basis, the fitted decision function can be expanded exactly in intrinsic RKHS coordinates. This leads to Orthogonal Representation Contribution Analysis (ORCA), a diagnostic framework based on normalized Orthogonal Kernel Contribution (OKC) indices. These indices quantify how the squared RKHS norm of the classifier is distributed across interaction orders, total polynomial degrees, marginal coordinate effects, and pairwise contributions. The methodology is fully post-training and requires neither surrogate models nor retraining. We illustrate its diagnostic value on a synthetic double-spiral problem and on a real five-dimensional echocardiogram dataset. The results show that the proposed indices reveal structural aspects of model complexity that are not captured by predictive accuracy alone.
【30】MinShap: A Modified Shapley Value Approach for Feature Selection
标题:MinShap:用于特征选择的修改Shapley值方法
链接:https://arxiv.org/abs/2604.15107
作者:Chenghui Zheng,Garvesh Raskutti
摘要:Feature selection is a classical problem in statistics and machine learning, and it continues to remain an extremely challenging problem especially in the context of unknown non-linear relationships with dependent features. On the other hand, Shapley values are a classic solution concept from cooperative game theory that is widely used for feature attribution in general non-linear models with highly-dependent features. However, Shapley values are not naturally suited for feature selection since they tend to capture both direct effects from each feature to the response and indirect effects through other features. In this paper, we combine the advantages of Shapley values and adapt them to feature selection by proposing \emph{MinShap}, a modification of the Shapley value framework along with a suite of other related algorithms. In particular for MinShap, instead of taking the average marginal contributions over permutations of features, considers the minimum marginal contribution across permutations. We provide a theoretical foundation motivated by the faithfulness assumption in DAG (directed acyclic graphical models), a guarantee for the Type I error of MinShap, and show through numerical simulations and real data experiments that MinShap tends to outperform state-of-the-art feature selection algorithms such as LOCO, GCM and Lasso in terms of both accuracy and stability. We also introduce a suite of algorithms related to MinShap by using the multiple testing/p-value perspective that improves performance in lower-sample settings and provide supporting theoretical guarantees.
【31】Deployment of AI-Assisted Interventions: Capacity Constraints and Noisy Compliance
标题:人工智能辅助干预措施的部署:容量限制和噪音合规性
链接:https://arxiv.org/abs/2604.14370
作者:Carri W. Chan,Yi Han,Hannah Li,Benjamin L. Ranard
摘要:AI tools increasingly guide targeted interventions in healthcare, education, and recruiting. Algorithms score individuals, trigger outreach to those above a threshold (e.g., high-risk or high-value), and encourage them to request service; then providers deliver service to those who request. Standard practice sets the threshold and selects the algorithm to maximize predictive accuracy, assuming that better predictions yield better outcomes. We show that this approach is suboptimal when limited service capacity and probabilistic behavioral responses influence who receives service. In such settings, the optimal score threshold must balance two effects: ensuring all capacity is filled (utilization) and ensuring high-value individuals are served despite competition between requests (cannibalization). We characterize the optimal threshold and prove that policies based solely on predictive accuracy are generally suboptimal. Further, because optimal thresholds vary with service capacity, algorithm selection metrics like AUC, which weight all thresholds equally, are misaligned with operational performance. We introduce a new metric--Operational AUC (OpAUC)--and show it leads to optimal algorithm selection. Finally, we conduct a case study on sepsis early warning data and illustrate the magnitude of improvement that can be achieved from improved threshold and algorithm selection.
【32】PROXIMA: A Reliability Scoring Framework for Proxy Metrics in Online Controlled Experiments
标题:PROXIMA:在线控制实验中代理收件箱的可靠性评分框架
链接:https://arxiv.org/abs/2604.14352
作者:Avinash Amudala
备注:14 pages. Sole-author submission. Independent research. Companion code at https://github.com/Avinash-Amudala/PROXIMA. Zenodo archive: 10.5281/zenodo.15483241. Related US provisional patent application: 63/974,569 (filed Feb 3, 2026)
摘要:Online A/B testing at scale relies on proxy metrics -- short-term, easily-measured signals used in place of slow-moving long-term outcomes. When the proxy-outcome relationship is heterogeneous across user segments, aggregate correlation can mask directional failures akin to Simpson's Paradox, leading to costly ship/no-ship errors. We introduce PROXIMA (Proxy Metric Validation Framework for Online Experiments), a lightweight diagnostic framework that scores proxy reliability through a composite of three complementary dimensions: normalised effect correlation, directional accuracy, and segment-level fragility rate. Unlike surrogate-index approaches that predict long-term treatment effects, PROXIMA directly audits whether a candidate proxy leads to correct launch decisions and flags the user segments where it fails. We validate PROXIMA on two public datasets -- the Criteo Uplift corpus (14M observations, advertising) and KuaiRec (7K users, video recommendation) -- using 80 simulated A/B tests. Early engagement metrics achieve a composite reliability of 0.80 on Criteo and 0.62 on KuaiRec, yielding 98.4% average decision agreement with an oracle policy. Fragility analysis reveals that recommendation domains exhibit substantially higher segment-level heterogeneity (68% fragility) than advertising (13%), yet directional accuracy remains above 96% in both cases. A sensitivity analysis over the weight space confirms that no single component suffices and that the composite provides substantially better discrimination between reliable and unreliable proxies than correlation alone. Code and reproduction scripts are available at: https://github.com/Avinash-Amudala/PROXIMA
【33】Predictions of charge density distributions for nuclei with $Z \geq 8$
链接:https://arxiv.org/abs/2604.05312
作者:Yun Dong Wang,Tian Shuai Shang,Hui Hui Xie,Peng Xiang Du,Jian Li,Haozhao Liang
备注:56 pages, 4 tables, 3 figures
摘要:A deep neural network (DNN) has been developed to accurately predict nuclear charge density distributions for nuclei with proton numbers $Z \geq 8$. By incorporating essential nuclear structure features, the model achieves a significant improvement in predictive accuracy over conventional methods. The charge density distributions are analyzed using a Fourier-Bessel (FB) series expansion, and the DNN is trained on a comprehensive dataset derived from relativistic continuum Hartree-Bogoliubov (RCHB) theory calculations. The model demonstrates exceptional performance, with root-mean-square deviations of 0.0123 fm and 0.0198 fm for charge radii on the training and validation sets, respectively, remarkably surpassing the precision of the original RCHB calculations. Beyond advancing nuclear physics research, this high-precision model provides critical data for applications in atomic physics, nuclear astrophysics, and related fields.
【34】Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening
标题:具有物理感知深度复合核的引导载体的高斯过程回归以增强听力
链接:https://arxiv.org/abs/2509.02571
作者:Diego Di Carlo,Shoichi Koyama,Nugraha Aditya Arie,Fontaine Mathieu,Bando Yoshiaki,Yoshii Kazuyoshi
摘要:This paper investigates continuous representations of steering vectors over frequency and microphone/source positions for augmented listening (e.g., spatial filtering and binaural rendering), enabling user-parameterized control of the reproduced sound field. Steering vectors have typically been used for representing the spatial response of a microphone array as a function of the look-up direction. The basic algebraic representation of these quantities assuming an idealized environment cannot deal with the scattering effect of the sound field. One may thus collect a discrete set of real steering vectors measured in dedicated facilities and super-resolve (i.e., upsample) them. Recently, physics-aware deep learning methods have been effectively used for this purpose. Such deterministic super-resolution, however, suffers from the overfitting problem due to the non-uniform uncertainty over the measurement space. To solve this problem, we integrate an expressive representation based on the neural field (NF) into the principled probabilistic framework based on the Gaussian process (GP). Specifically, we propose a physics-aware composite kernel that models the directional incoming waves and the subsequent scattering effect. Our comprehensive comparative experiment showed the effectiveness of the proposed method under data insufficiency conditions. In downstream tasks such as speech enhancement and binaural rendering using the simulated data of the SPEAR challenge, the oracle performances were attained with less than ten times fewer measurements.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递