点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计193篇
大模型相关(33篇)
【1】Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
标题:分层GRPO:在LLM搜索代理的强化学习中处理结构异源
链接:https://arxiv.org/abs/2510.06214
作者:Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia
摘要:大型语言模型(LLM)代理越来越依赖于外部工具,如搜索引擎来解决复杂的多步骤问题,强化学习(RL)已成为训练它们的关键范式。然而,搜索代理的轨迹在结构上是异构的,其中搜索调用的数量、位置和结果的变化导致根本上不同的答案方向和奖励分布。标准的政策梯度方法,使用一个单一的全球基线,遭受我们确定和形式化的跨阶层偏见-一个“苹果到橘子”的比较异质轨迹。这种跨阶层偏见扭曲了信用分配,阻碍了对复杂的多步搜索策略的探索。为了解决这个问题,我们提出了分层GRPO,其中心组件,分层优势归一化(SAN),分区轨迹到均匀层的基础上,他们的结构特性和计算优势本地每个层。这确保了轨迹仅针对其真正的对等体进行评估。我们的分析证明,SAN消除了跨层偏差,在每个层内产生有条件无偏的单位方差估计,并保留了标准归一化所享有的全局无偏性和单位方差属性,从而产生更纯净和规模稳定的学习信号。为了提高有限样本制度下的实际稳定性,我们进一步线性混合SAN与全局估计。在不同的单跳和多跳问答基准上进行的大量实验表明,分层GRPO的性能始终大大优于GRPO,最高可达11.3分,实现了更高的训练奖励,更高的训练稳定性和更有效的搜索策略。这些结果建立分层作为一个原则性的补救措施,结构异质性RL的LLM搜索代理。
摘要:Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, and reinforcement learning (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross-stratum bias-an "apples-to-oranges" comparison of heterogeneous trajectories. This cross-stratum bias distorts credit assignment and hinders exploration of complex, multi-step search strategies. To address this, we propose Stratified GRPO, whose central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on their structural properties and computes advantages locally within each stratum. This ensures that trajectories are evaluated only against their true peers. Our analysis proves that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates inside each stratum, and retains the global unbiasedness and unit-variance properties enjoyed by standard normalization, resulting in a more pure and scale-stable learning signal. To improve practical stability under finite-sample regimes, we further linearly blend SAN with the global estimator. Extensive experiments on diverse single-hop and multi-hop question-answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.
【2】LLMs as Policy-Agnostic Teammates: A Case Study in Human Proxy Design for Heterogeneous Agent Teams
标题:法学硕士作为政策不可知的团队成员:异类代理团队的人力代理设计案例研究
链接:https://arxiv.org/abs/2510.06151
作者:Aju Ani Justus, Chris Baber
备注:This is a preprint of a paper presented at the \textit{European Conference on Artificial Intelligence (ECAI 2025)}. It is made publicly available for the benefit of the research community and should be regarded as a preprint rather than a formally reviewed publication
摘要:异质代理团队建模的一个关键挑战是训练代理与策略不可访问或非固定的队友(如人类)合作。传统方法依赖于昂贵的人在回路数据,这限制了可扩展性。我们建议使用大型语言模型(LLM)作为与政策无关的人类代理来生成模拟人类决策的合成数据。为了评估这一点,我们进行了三个实验,在网格世界捕获游戏的启发雄鹿狩猎,博弈论范式,平衡风险和回报。在实验1中,我们将30名人类参与者和2名专家法官的决策与LLaMA 3.1和Mixtral 8x22B模型的输出进行比较。LLM在游戏状态观察和奖励结构的提示下,与参与者相比,与专家更紧密地联系在一起,表现出应用基本决策标准的一致性。实验2修改提示,以诱导风险敏感的策略(例如“规避风险”)。LLM输出反映了人类参与者的可变性,在规避风险和寻求风险行为之间转换。最后,实验3在动态网格世界中测试LLM,LLM代理生成运动动作。LLM产生类似于人类参与者路径的轨迹。虽然LLM还不能完全复制人类的适应性,但它们的自主引导的多样性为模拟与策略无关的队友提供了可扩展的基础。
摘要:A critical challenge in modelling Heterogeneous-Agent Teams is training agents to collaborate with teammates whose policies are inaccessible or non-stationary, such as humans. Traditional approaches rely on expensive human-in-the-loop data, which limits scalability. We propose using Large Language Models (LLMs) as policy-agnostic human proxies to generate synthetic data that mimics human decision-making. To evaluate this, we conduct three experiments in a grid-world capture game inspired by Stag Hunt, a game theory paradigm that balances risk and reward. In Experiment 1, we compare decisions from 30 human participants and 2 expert judges with outputs from LLaMA 3.1 and Mixtral 8x22B models. LLMs, prompted with game-state observations and reward structures, align more closely with experts than participants, demonstrating consistency in applying underlying decision criteria. Experiment 2 modifies prompts to induce risk-sensitive strategies (e.g. "be risk averse"). LLM outputs mirror human participants' variability, shifting between risk-averse and risk-seeking behaviours. Finally, Experiment 3 tests LLMs in a dynamic grid-world where the LLM agents generate movement actions. LLMs produce trajectories resembling human participants' paths. While LLMs cannot yet fully replicate human adaptability, their prompt-guided diversity offers a scalable foundation for simulating policy-agnostic teammates.
【3】lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models
标题:lm-Meter:揭开设备上语言模型的收件箱推理延迟
链接:https://arxiv.org/abs/2510.06126
作者:Haoxin Wang, Xiaolong Tu, Hongyu Ke, Huirong Chai, Dawei Chen, Kyungtae Han
备注:This is the preprint version of the paper accepted to The 10th ACM/IEEE Symposium on Edge Computing (SEC 2025)
摘要:大型语言模型(LLM)越来越多地集成到日常应用程序中,但其普遍的基于云的部署引起了人们对数据隐私和长期可持续性的日益关注。在移动和边缘设备上本地运行LLM(设备上的LLM)提供了增强隐私,可靠性和降低通信成本的承诺。然而,实现这一愿景仍然具有挑战性,因为大量的内存和计算需求,以及对资源受限硬件上的性能效率权衡的有限可见性。我们提出了lm-Meter,这是第一个为设备上LLM推理量身定制的轻量级在线延迟分析器。lm-Meter捕获两个阶段的细粒度实时延迟(例如,嵌入、预填充、解码、softmax、采样)和内核级,而无需辅助设备。我们在商业移动平台上实现了lm-Meter,并以最小的系统开销展示了其高分析精度,例如,在最受约束的Powersave调控器下,预填充吞吐量仅减少2.58%,解码吞吐量仅减少0.99%。利用lm-Meter,我们进行了全面的实证研究,揭示了设备上LLM推理中的阶段和内核级瓶颈,量化了准确性-效率权衡,并确定了系统优化机会。lm-Meter为受限平台上LLM的运行时行为提供了前所未有的可见性,为知情优化奠定了基础,并加速了设备上LLM系统的民主化。代码和教程可在https://github.com/amai-gsu/LM-Meter上获得。
摘要:Large Language Models (LLMs) are increasingly integrated into everyday applications, but their prevalent cloud-based deployment raises growing concerns around data privacy and long-term sustainability. Running LLMs locally on mobile and edge devices (on-device LLMs) offers the promise of enhanced privacy, reliability, and reduced communication costs. However, realizing this vision remains challenging due to substantial memory and compute demands, as well as limited visibility into performance-efficiency trade-offs on resource-constrained hardware. We propose lm-Meter, the first lightweight, online latency profiler tailored for on-device LLM inference. lm-Meter captures fine-grained, real-time latency at both phase (e.g., embedding, prefill, decode, softmax, sampling) and kernel levels without auxiliary devices. We implement lm-Meter on commercial mobile platforms and demonstrate its high profiling accuracy with minimal system overhead, e.g., only 2.58% throughput reduction in prefill and 0.99% in decode under the most constrained Powersave governor. Leveraging lm-Meter, we conduct comprehensive empirical studies revealing phase- and kernel-level bottlenecks in on-device LLM inference, quantifying accuracy-efficiency trade-offs, and identifying systematic optimization opportunities. lm-Meter provides unprecedented visibility into the runtime behavior of LLMs on constrained platforms, laying the foundation for informed optimization and accelerating the democratization of on-device LLM systems. Code and tutorials are available at https://github.com/amai-gsu/LM-Meter.
【4】Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences
标题:Moloch的讨价还价:LLM争夺观众时出现的错位
链接:https://arxiv.org/abs/2510.06105
作者:Batu El, James Zou
摘要:大型语言模型(LLM)越来越多地影响信息的创建和传播方式,从使用它们制作说服性广告的公司,到优化消息传递以获得选票的竞选活动,再到提高参与度的社交媒体影响者。这些设置本质上是竞争性的,卖家,候选人和有影响力的人都在争夺观众的认可,但人们对竞争反馈回路如何影响LLM行为的了解仍然很少。我们表明,优化LLM竞争的成功可能会无意中驱动错位。使用这些场景的模拟环境,我们发现,6.3%的销售增长伴随着14.0%的欺骗性营销增长;在选举中,4.9%的选票份额增长与22.3%的虚假信息和12.5%的民粹主义言论相吻合;在社交媒体上,参与度提高了7.5%,虚假信息增加了188.6%,有害行为增加了16.3%。我们称这种现象为Moloch's Bargain for AI--以结盟为代价取得竞争成功。即使模型被明确指示要保持真实性和基础性,这些错位的行为也会出现,这揭示了当前一致性保障措施的脆弱性。我们的研究结果强调了市场驱动的优化压力如何系统地侵蚀一致性,创造一场逐底竞争,并表明人工智能系统的安全部署将需要更强的治理和精心设计的激励措施,以防止竞争动态破坏社会信任。
摘要:Large language models (LLMs) are increasingly shaping how information is created and disseminated, from companies using them to craft persuasive advertisements, to election campaigns optimizing messaging to gain votes, to social media influencers boosting engagement. These settings are inherently competitive, with sellers, candidates, and influencers vying for audience approval, yet it remains poorly understood how competitive feedback loops influence LLM behavior. We show that optimizing LLMs for competitive success can inadvertently drive misalignment. Using simulated environments across these scenarios, we find that, 6.3% increase in sales is accompanied by a 14.0% rise in deceptive marketing; in elections, a 4.9% gain in vote share coincides with 22.3% more disinformation and 12.5% more populist rhetoric; and on social media, a 7.5% engagement boost comes with 188.6% more disinformation and a 16.3% increase in promotion of harmful behaviors. We call this phenomenon Moloch's Bargain for AI--competitive success achieved at the cost of alignment. These misaligned behaviors emerge even when models are explicitly instructed to remain truthful and grounded, revealing the fragility of current alignment safeguards. Our findings highlight how market-driven optimization pressures can systematically erode alignment, creating a race to the bottom, and suggest that safe deployment of AI systems will require stronger governance and carefully designed incentives to prevent competitive dynamics from undermining societal trust.
【5】The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
标题:一致审核员:用于确定和完善LLM目标的Bayesian框架
链接:https://arxiv.org/abs/2510.06096
作者:Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
备注:Preprint
摘要
:大型语言模型(LLM)隐式优化的目标仍然是危险的不透明,这使得可信的对齐和审计成为一个巨大的挑战。虽然反向强化学习(IRL)可以从行为中推断出奖励函数,但现有方法要么产生单一的、过度自信的奖励估计,要么无法解决任务的基本模糊性(不可识别性)。本文介绍了一个原则性的审计框架,重新框架奖励推理从一个简单的估计任务,以全面的过程进行验证。我们的框架利用贝叶斯IRL不仅可以恢复目标上的分布,而且可以实现三个关键的审计功能:(i)通过在连续几轮证据上展示后验收缩来量化和系统地减少不可识别性;(ii)提供可操作的,不确定性感知的诊断,暴露虚假的捷径并识别推断目标不可信的分布提示;以及(iii)通过证明精炼的、低不确定性的奖励可以直接用于RLHF来验证策略级效用,以实现与地面实况对齐过程相当的训练动态和毒性降低。从经验上讲,我们的框架成功地审计了解毒的LLM,产生了一个校准良好的和可解释的目标,加强了对齐保证。总的来说,这项工作为审计人员、安全团队和监管机构提供了一个实用的工具包,以验证LLM真正想要实现的目标,使我们朝着更值得信赖和更负责任的AI方向发展。
摘要:The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.
【6】Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
标题:从失败中学习:通过故障感知反向RL了解LLM对齐
链接:https://arxiv.org/abs/2510.06092
作者:Nyal Patel, Matthieu Bou, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
备注:Preprint
摘要:基于人类反馈的强化学习(RLHF)使大型语言模型(LLM)与人类偏好保持一致,但它们内在化的潜在奖励信号仍然隐藏,这对可解释性和安全性构成了关键挑战。现有的方法试图使用反向强化学习(IRL)来提取这些潜在的激励,但平等地对待所有偏好对,往往忽略了最具信息性的信号:这些例子提取的奖励模型错误分类或分配几乎相等的分数,我们称之为失败。我们引入了一种新的IRL算法,专注于错误分类或困难的例子,以恢复潜在的奖励定义模型的行为。通过从这些失败中学习,我们的失败意识IRL提取奖励函数,更好地反映RLHF背后的真正目标。我们证明,当应用于LLM解毒时,故障感知IRL在多个指标上优于现有的IRL基线,而不需要外部分类器或监督。至关重要的是,失败意识IRL产生的奖励可以更好地捕捉RLHF期间学到的真正激励,从而比标准IRL更有效地进行重新RLHF培训。这建立了故障感知IRL作为一个强大的,可扩展的方法,审计模型对齐和减少IRL过程中的歧义。
摘要:Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with human preferences, yet the underlying reward signals they internalize remain hidden, posing a critical challenge for interpretability and safety. Existing approaches attempt to extract these latent incentives using Inverse Reinforcement Learning (IRL), but treat all preference pairs equally, often overlooking the most informative signals: those examples the extracted reward model misclassifies or assigns nearly equal scores, which we term \emph{failures}. We introduce a novel \emph{failure-aware} IRL algorithm that focuses on misclassified or difficult examples to recover the latent rewards defining model behaviors. By learning from these failures, our failure-aware IRL extracts reward functions that better reflect the true objectives behind RLHF. We demonstrate that failure-aware IRL outperforms existing IRL baselines across multiple metrics when applied to LLM detoxification, without requiring external classifiers or supervision. Crucially, failure-aware IRL yields rewards that better capture the true incentives learned during RLHF, enabling more effective re-RLHF training than standard IRL. This establishes failure-aware IRL as a robust, scalable method for auditing model alignment and reducing ambiguity in the IRL process.
【7】Medical Vision Language Models as Policies for Robotic Surgery
标题:医学视觉语言模型作为机器人手术的政策
链接:https://arxiv.org/abs/2510.06064
作者:Akshay Muppidi, Martin Radfar
备注:IEEE CAI 2025
摘要:基于视觉的邻近策略优化(PPO)与基于视觉观察的机器人腹腔镜手术任务斗争,由于视觉输入的高维性质,手术环境中奖励的稀疏性,以及从原始视觉数据中提取任务相关特征的困难。我们介绍了一种简单的方法集成MedFlamingo,医疗领域特定的视觉语言模型,与PPO。我们的方法进行了评估五个不同的腹腔镜手术任务环境中LapGym,只使用内窥镜视觉观察。与标准的基于视觉的PPO和OpenFlamingo PPO基线相比,MedFlamingo PPO的性能和收敛速度更快,在所有环境中实现了超过70%的任务成功率,与基线相比,改进范围从66.67%到1114.29%。通过每集处理一次任务观察和指令以生成高级规划令牌,我们的方法有效地将医学专业知识与实时视觉反馈相结合。我们的研究结果突出了专业医学知识在机器人手术规划和决策中的价值。
摘要:Vision-based Proximal Policy Optimization (PPO) struggles with visual observation-based robotic laparoscopic surgical tasks due to the high-dimensional nature of visual input, the sparsity of rewards in surgical environments, and the difficulty of extracting task-relevant features from raw visual data. We introduce a simple approach integrating MedFlamingo, a medical domain-specific Vision-Language Model, with PPO. Our method is evaluated on five diverse laparoscopic surgery task environments in LapGym, using only endoscopic visual observations. MedFlamingo PPO outperforms and converges faster compared to both standard vision-based PPO and OpenFlamingo PPO baselines, achieving task success rates exceeding 70% across all environments, with improvements ranging from 66.67% to 1114.29% compared to baseline. By processing task observations and instructions once per episode to generate high-level planning tokens, our method efficiently combines medical expertise with real-time visual feedback. Our results highlight the value of specialized medical knowledge in robotic surgical planning and decision-making.
【8】BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining
标题
:BLISS:语言模型预训练中数据选择的轻量级二层影响评分方法
链接:https://arxiv.org/abs/2510.06048
作者:Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, Mingrui Liu
摘要:有效的数据选择对于预训练大型语言模型(LLM),提高效率和改进下游任务的泛化至关重要。然而,现有的方法通常需要利用外部预训练模型,这使得很难将数据选择的影响与外部预训练模型的影响分开。此外,如果模型被训练为收敛,他们往往会忽视选定数据的长期影响,这主要是由于全面LLM预训练的成本过高。在本文中,我们介绍了BLISS(\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf {S}coring method for data \textbf{S}election):一种轻量级的数据选择方法,完全从零开始操作,不依赖于任何外部预训练的oracle模型,同时明确考虑了所选数据的长期影响。BLISS利用一个小的代理模型作为LLM的代理,并采用评分模型来估计训练样本的长期影响,如果代理模型被训练收敛。我们将数据选择公式化为双层优化问题,其中上层目标优化得分模型以向训练样本分配重要性权重,确保最小化下层目标(即,在加权训练损失上训练代理模型直到收敛)导致最佳验证性能。一旦优化,训练的得分模型预测数据集的影响力得分,从而能够有效地选择高质量的样本进行LLM预训练。我们通过在C4数据集的选定子集上预训练410 M/1B/2.8B Pythia和LLaMA-0.5B模型来验证BLISS。值得注意的是,在1B模型设置下,BLISS在达到与最先进方法相同的性能方面实现了1.7\times $加速,在多个下游任务中表现出卓越的性能。
摘要:Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.
【9】Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
标题:示例聪明,不难:正确性优先解码,以在LLM中实现更好的推理
链接:https://arxiv.org/abs/2510.05987
作者:Xueyan Li, Guinan Su, Mrinmaya Sachan, Jonas Geiping
摘要:大型语言模型(LLM)越来越多地应用于需要扩展推理的复杂任务。在这种情况下,模型往往受益于不同的思想链,以达到多个候选解决方案。这需要两个相互竞争的目标:现有的工作通过在具有较高温度或较大候选标记集的高度不确定步骤处增加探索来实现第一个目标,而其他工作通过拒绝具有低置信度的样本来提高可靠性,这意味着低置信度与低回答质量相关。这两种思路是相互冲突的,因为它们将不同的不确定性来源混为一谈。为了解决这个问题,我们认为,解码规则应该校准的正确性,而不是信心。我们应该从估计正确性较高的令牌中进行采样,并在预期正确性较低的情况下减少采样。我们提出了实现这一目标的简单策略:Greedy-Threshold使采样在非常低的置信度下变得贪婪。基于估计的按秩正确性的校准TopK和校准TopK集截断阈值。总之,我们的研究结果挑战了关于在不确定性下解码的流行理论,并显示了数学和一般推理基准的收益。
摘要:Large Language Models (LLMs) are increasingly applied to complex tasks that require extended reasoning. In such settings, models often benefit from diverse chains-of-thought to arrive at multiple candidate solutions. This requires two competing objectives: to inject enough stochasticity to explore multiple reasoning chains, and to ensure sufficient accuracy and quality in each path. Existing works pursue the first objective by increasing exploration at highly uncertain steps with higher temperature or larger candidate token sets, while others improve reliability by rejecting samples with low confidence post-generation, implying that low confidence correlates with low answer quality. These two lines of thought are in conflict, as they conflate different sources of uncertainty. To resolve this, we argue that the decoding rule should be calibrated by correctness, not confidence alone. We should sample from tokens with higher estimated correctness, and reduce sampling where expected correctness is low. We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps. Calibrated-TopK and Calibrated-epsilon set truncation threshold based on estimated rank-wise correctness. Together, our findings challenge prevailing heuristics about decoding under uncertainty and show gains across math and general reasoning benchmarks.
【10】EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models
标题:EARL:针对大型语言模型的高效抽象强化学习系统
链接:https://arxiv.org/abs/2510.05943
作者:Zheyue Tan, Mustapha Abdullahi, Tuo Shi, Huining Yuan, Zelai Xu, Chao Yu, Boxun Li, Bo Zhao
摘要
:强化学习(RL)已经成为大型语言模型(LLM)后训练的关键组成部分,而agentic RL通过多轮交互和工具使用扩展了这种范式,使其能够作为代理人进行操作。扩展这样的系统暴露了两个实际瓶颈:(1)上下文长度在训练期间快速增长,增加内存使用和延迟,并触发内存不足(OOM)故障;(2)中间张量随着上下文长度的积累,使跨设备数据移动成为主要的系统瓶颈。 我们提出了EARL,一个可扩展的系统,有效的代理RL。EARL设计了一个并行性选择器,可以根据序列长度和系统负载动态地调整RL阶段的模型和训练并行性,以及一个数据调度器,可以执行布局感知的中间数据批的分散交换。总之,这些组件增加了吞吐量,减少了长上下文故障,并实现了代理LLM的稳定大规模训练,而无需依赖于上下文长度的硬限制或惩罚。
摘要:Reinforcement learning (RL) has become a pivotal component of large language model (LLM) post-training, and agentic RL extends this paradigm to operate as agents through multi-turn interaction and tool use. Scaling such systems exposes two practical bottlenecks: (1) context length grows rapidly during training, inflating memory usage and latency, and triggering out-of-memory (OOM) failures; and (2) intermediate tensors accumulate with context length, making cross-device data movement a major system bottleneck. We present EARL, a scalable system for efficient agentic RL. EARL designs a parallelism selector that dynamically adapts model and training parallelism across RL stages based on sequence length and system load, and a data dispatcher that performs layout-aware, decentralized exchange of intermediate data batches. Together, these components increase throughput, reduce long-context failures, and enable stable large-scale training of agentic LLMs without relying on hard limits or penalties of context length.
【11】LLM-FS-Agent: A Deliberative Role-based Large Language Model Architecture for Transparent Feature Selection
标题:LLM-FS-Agent:一种用于透明特征选择的基于角色的深思熟虑的大型语言模型架构
链接:https://arxiv.org/abs/2510.05935
作者:Mohamed Bal-Ghaoui, Fayssal Sabri
摘要:高维数据仍然是机器学习中普遍存在的挑战,通常会破坏模型的可解释性和计算效率。虽然大型语言模型(LLM)已经显示出通过特征选择进行降维的前景,但现有的基于LLM的方法往往缺乏结构化推理和透明的决策理由。本文介绍了LLM-FS-Agent,一种新的多Agent体系结构,设计用于可解释和鲁棒的特征选择。该系统协调多个LLM代理之间的审议“辩论”,每个代理都被分配了一个特定的角色,从而能够对功能相关性进行集体评估并生成详细的理由。我们使用CIC-DIAD 2024物联网入侵检测数据集在网络安全领域评估LLM-FS-Agent,并将其性能与强大的基线进行比较,包括LLM-Select和传统方法,如PCA。实验结果表明,LLM-FS-Agent始终如一地实现了卓越或相当的分类性能,同时将下游训练时间平均减少了46%(统计学显著改善,XGBoost的p = 0.028)。这些研究结果强调,所提出的审议架构提高了决策透明度和计算效率,建立LLM-FS-Agent作为一个实用和可靠的解决方案,为现实世界的应用。
摘要:High-dimensional data remains a pervasive challenge in machine learning, often undermining model interpretability and computational efficiency. While Large Language Models (LLMs) have shown promise for dimensionality reduction through feature selection, existing LLM-based approaches frequently lack structured reasoning and transparent justification for their decisions. This paper introduces LLM-FS-Agent, a novel multi-agent architecture designed for interpretable and robust feature selection. The system orchestrates a deliberative "debate" among multiple LLM agents, each assigned a specific role, enabling collective evaluation of feature relevance and generation of detailed justifications. We evaluate LLM-FS-Agent in the cybersecurity domain using the CIC-DIAD 2024 IoT intrusion detection dataset and compare its performance against strong baselines, including LLM-Select and traditional methods such as PCA. Experimental results demonstrate that LLM-FS-Agent consistently achieves superior or comparable classification performance while reducing downstream training time by an average of 46% (statistically significant improvement, p = 0.028 for XGBoost). These findings highlight that the proposed deliberative architecture enhances both decision transparency and computational efficiency, establishing LLM-FS-Agent as a practical and reliable solution for real-world applications.
【12】Prompt reinforcing for long-term planning of large language models
标题:及时加强大型语言模型的长期规划
链接:https://arxiv.org/abs/2510.05921
作者:Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić
摘要:大型语言模型(LLM)在广泛的自然语言处理任务中取得了显着的成功,并且可以通过提示进行调整。然而,它们在多回合交互中仍然不是最佳的,通常依赖于不正确的早期假设,并且无法随着时间的推移跟踪用户目标,这使得这些任务特别具有挑战性。对话系统中的先前工作已经表明,长期规划对于处理交互式任务至关重要。在这项工作中,我们提出了一个提示优化框架的启发强化学习,使这样的规划只修改基于LLM的代理的任务指令提示。通过生成逐圈反馈和利用经验重放提示重写,我们提出的方法显示出显着的改善,在多轮任务,如文本到SQL和面向任务的对话。此外,它概括了不同的基于LLM的代理,并可以利用不同的LLM作为元提示代理。这保证了未来对强化学习启发的无参数优化方法的研究。
摘要:Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.
【13】DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization
标题:DACP:用于电话对话摘要的大型语言模型的领域自适应连续预训练
链接:https://arxiv.org/abs/2510.05858
作者:Xue-Yong Fu, Elena Khasanova, Md Tahmid Rahman Laskar, Harsh Saini, Shashi Bhushan TN
备注:Accepted to the NewSumm Workshop at EMNLP 2025
摘要
:大型语言模型(LLM)在文本摘要方面取得了令人印象深刻的性能,但当应用于与原始预训练分布不同的专业领域或会话数据时,它们的性能往往不足。虽然微调可以提高摘要质量,但它通常依赖于昂贵且稀缺的高质量标记数据。在这项工作中,我们探索了持续的预训练作为一种可扩展的,自我监督的方法,以适应LLM的下游摘要任务,特别是在嘈杂的现实世界的对话成绩单的背景下。我们使用大规模的未标记的业务会话数据进行了大量的实验,以研究持续的预训练是否能增强模型在会话摘要中的能力。我们的研究结果表明,持续的预训练在域内和域外摘要基准中都有很大的收益,同时保持了很强的泛化能力和鲁棒性。我们还分析了数据选择策略的影响,为在以总结为重点的工业应用中应用持续预训练提供了实用指南。
摘要:Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains %or conversational data that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.
【14】Communication Enables Cooperation in LLM Agents: A Comparison with Curriculum-Based Approaches
标题:沟通促进LLM代理的合作:与基于课程的方法的比较
链接:https://arxiv.org/abs/2510.05748
作者:Hachem Madmoun, Salem Lahlou
摘要:在多智能体LLM系统中引发合作对于AI对齐至关重要。我们研究了两种方法:直接沟通和课程学习。在一个4人的猎鹿游戏中,一个单词“廉价谈话”通道将合作从0%增加到48.3%,表明沟通是一种强大的协调机制。相比之下,我们发现,课程学习是高度敏感的设计选择:我们的教学课程,通过逐步复杂的游戏减少代理人的回报27.4%,在一个迭代的公共产品游戏与惩罚。定性分析表明,课程强调缺陷均衡游戏可以诱导“学习悲观主义”的代理。这些研究结果表明,对于协调问题,简单的通信协议可能比基于经验的培训更可靠,社会困境的课程设计需要仔细注意嵌入在游戏序列的战略教训。
摘要:Eliciting cooperation in multi-agent LLM systems is critical for AI alignment. We investigate two approaches: direct communication and curriculum learning. In a 4-player Stag Hunt, a one-word "cheap talk" channel increases cooperation from 0% to 48.3%, demonstrating communication as a robust coordination mechanism. In contrast, we find that curriculum learning is highly sensitive to design choices: our pedagogical curriculum through progressively complex games reduced agent payoffs by 27.4% in an Iterated Public Goods Game with Punishment. Qualitative analysis reveals that curricula emphasizing defection-equilibrium games can induce "learned pessimism" in agents. These findings suggest that for coordination problems, simple communication protocols may be more reliable than experience-based training, and that curriculum design for social dilemmas requires careful attention to the strategic lessons embedded in game sequences.
【15】Primal-Dual Direct Preference Optimization for Constrained LLM Alignment
标题:约束LLM对齐的原始-二元直接偏好优化
链接:https://arxiv.org/abs/2510.05703
作者:Yihan Du, Seo Taek Kong, R. Srikant
摘要:大语言模型(LLM)的广泛应用对安全性提出了越来越高的要求,例如减少有害内容和虚假信息,以及避免某些由于规则和法律而被禁止的令牌。虽然最近已经有几项工作研究了LLM的安全对准,但这些工作要么需要训练奖励和成本模型并导致高存储器和计算成本,要么需要关于最优解的先验知识。基于这一事实,我们研究了有限线性模型中的约束对准问题,即,最大化输出奖励,同时将由于潜在不安全内容引起的成本限制为保持在阈值以下。针对这个问题,我们提出了一种新的原始-对偶DPO方法,该方法首先使用标准DPO对奖励偏好数据进行训练以提供奖励信息,然后采用重新排列的拉格朗日DPO目标,利用提供的奖励信息对成本偏好数据进行微调LLM。我们的方法显著地减少了存储器和计算成本,并且不需要额外的先验知识。此外,我们还对产出策略的次优性和约束违反性建立了严格的理论保证.我们还将我们的方法扩展到在线数据设置,通过引入探索奖金,使我们的方法能够探索未覆盖的即时响应空间,然后提供理论结果,摆脱了对偏好数据覆盖率的依赖。在广泛使用的偏好数据集PKU-SafeRLHF上的实验结果证明了该方法的有效性。
摘要:The widespread application of Large Language Models (LLMs) imposes increasing demands on safety, such as reducing harmful content and fake information, and avoiding certain forbidden tokens due to rules and laws. While there have been several recent works studying safe alignment of LLMs, these works either require the training of reward and cost models and incur high memory and computational costs, or need prior knowledge about the optimal solution. Motivated by this fact, we study the problem of constrained alignment in LLMs, i.e., maximizing the output reward while restricting the cost due to potentially unsafe content to stay below a threshold. For this problem, we propose a novel primal-dual DPO approach, which first trains a model using standard DPO on reward preference data to provide reward information, and then adopts a rearranged Lagrangian DPO objective utilizing the provided reward information to fine-tune LLMs on cost preference data. Our approach significantly reduces memory and computational costs, and does not require extra prior knowledge. Moreover, we establish rigorous theoretical guarantees on the suboptimality and constraint violation of the output policy. We also extend our approach to an online data setting by incorporating exploration bonuses, which enables our approach to explore uncovered prompt-response space, and then provide theoretical results that get rid of the dependence on preference data coverage. Experimental results on the widely-used preference dataset PKU-SafeRLHF demonstrate the effectiveness of our approach.
【16】Verifier-free Test-Time Sampling for Vision Language Action Models
标题:视觉语言动作模型的免验证器测试时采样
链接:https://arxiv.org/abs/2510.05681
作者:Suhyeok Jang, Dongyoung Kim, Changyeon Kim, Youngsuk Kim, Jinwoo Shin
备注:14 pages; 3 figures
摘要
:视觉-语言-动作模型(VLA)在机器人控制中表现出了卓越的性能。然而,由于它们的单一推理范式,它们在需要高精度的任务中仍然受到根本性的限制。虽然使用外部验证器的测试时扩展方法已显示出希望,但它们需要额外的训练,并且无法推广到看不见的条件。我们提出了掩蔽分布引导选择(MG选择),一种新的测试时间缩放框架的VLA,利用模型的内部属性,而不需要额外的训练或外部模块。我们的方法利用KL偏离参考动作令牌分布作为从多个候选者中选择最佳动作的置信度度量。我们引入了由相同的VLA生成的参考分布,但随机掩蔽状态和语言条件作为输入,确保最大的不确定性,同时保持与目标任务分布一致。此外,我们提出了一种联合训练策略,使模型能够通过将dropout应用于状态和语言条件来学习条件和无条件分布,从而进一步提高参考分布的质量。我们的实验表明,MG-Select实现了显着的性能改进,包括在现实世界中的分布/分布外任务的28%/35%的改进,以及168%的相对增益RoboCasa拾取和放置任务训练30个示范。
摘要:Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the model's internal properties without requiring additional training or external modules. Our approach utilizes KL divergence from a reference action token distribution as a confidence metric for selecting the optimal action from multiple candidates. We introduce a reference distribution generated by the same VLA but with randomly masked states and language conditions as inputs, ensuring maximum uncertainty while remaining aligned with the target task distribution. Additionally, we propose a joint training strategy that enables the model to learn both conditional and unconditional distributions by applying dropout to state and language conditions, thereby further improving the quality of the reference distribution. Our experiments demonstrate that MG-Select achieves significant performance improvements, including a 28%/35% improvement in real-world in-distribution/out-of-distribution tasks, along with a 168% relative gain on RoboCasa pick-and-place tasks trained with 30 demonstrations.
【17】From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
标题:从原则到实践:LLM服务于多核NPU的系统研究
链接:https://arxiv.org/abs/2510.05632
作者:Tianhao Zhu, Dahu Feng, Erhu Feng, Yubin Xia
摘要:随着大型语言模型(LLM)的广泛采用,对高性能LLM推理服务的需求持续增长。为了满足这一需求,越来越多的AI加速器被提出,如Google TPU、华为NPU、Graphcore IPU和Cerebras WSE等,这些加速器大多采用多核架构来实现增强的可扩展性,但缺乏SIMT架构的灵活性。因此,如果没有仔细配置硬件架构,以及张量并行性和核心布局策略的精心设计,计算资源可能未得到充分利用,从而导致次优的推理性能。 为了解决这些挑战,我们首先提出了一个多层次的仿真框架,同时具有事务级和基于性能模型的仿真多核NPU。使用该模拟器,我们进行了系统的分析,并进一步提出了张量并行策略,核心布局策略,内存管理方法,以及PD-解聚和PD-融合之间的选择多核NPU的最佳解决方案。我们进行了全面的实验代表LLM和各种NPU配置。评估结果表明,我们的解决方案可以实现1.32倍-6.03倍的加速比相比,SOTA设计的多核NPU在不同的硬件配置。至于LLM服务,我们的工作为跨各种LLM工作负载的多核NPU设计最佳硬件架构和服务策略提供了指导。
摘要:With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei NPU, Graphcore IPU, and Cerebras WSE, etc. Most of these accelerators adopt multi-core architectures to achieve enhanced scalability, but lack the flexibility of SIMT architectures. Therefore, without careful configuration of the hardware architecture, as well as deliberate design of tensor parallelism and core placement strategies, computational resources may be underutilized, resulting in suboptimal inference performance. To address these challenges, we first present a multi-level simulation framework with both transaction-level and performance-model-based simulation for multi-core NPUs. Using this simulator, we conduct a systematic analysis and further propose the optimal solutions for tensor parallelism strategies, core placement policies, memory management methods, as well as the selection between PD-disaggregation and PD-fusion on multi-core NPUs. We conduct comprehensive experiments on representative LLMs and various NPU configurations. The evaluation results demonstrate that, our solution can achieve 1.32x-6.03x speedup compared to SOTA designs for multi-core NPUs across different hardware configurations. As for LLM serving, our work offers guidance on designing optimal hardware architectures and serving strategies for multi-core NPUs across various LLM workloads.
【18】(Token-Level) \textbf{InfoRMIA}: Stronger Membership Inference and Memorization Assessment for LLMs
标题:(代币级) extBF{InfoRMIA}:LLM更强的会员推断和小型化评估
链接:https://arxiv.org/abs/2510.05582
作者:Jiashu Tao, Reza Shokri
摘要:众所周知,机器学习模型会泄露敏感信息,因为它们不可避免地会记住(部分)训练数据。更令人担忧的是,大型语言模型(LLM)现在几乎是在所有可用数据上训练的,这放大了信息泄露的规模,并引发了严重的隐私风险。因此,在LLM发布之前量化隐私风险比以往任何时候都更加重要。量化隐私的标准方法是通过成员推理攻击,其中最先进的方法是鲁棒成员推理攻击(RMIA)。在本文中,我们提出了InfoRMIA,一个原则性的信息理论制定的成员推理。我们的方法在基准测试中始终优于RMIA,同时还提高了计算效率。 在本文的第二部分中,我们确定了将序列级隶属推断作为测量泄漏的金标准的局限性。我们提出了一个新的视角来研究LLMs中的成员和记忆:标记水平的信号和分析。我们证明了一个简单的基于标记的InfoRMIA可以精确定位哪些标记被记忆在生成的输出中,从而将泄漏从序列级向下定位到单个标记,同时在LLM上实现更强的序列级推理能力。这一新的范围重新考虑了LLM中的隐私,并可以导致更有针对性的缓解,如精确的遗忘。
摘要
:Machine learning models are known to leak sensitive information, as they inevitably memorize (parts of) their training data. More alarmingly, large language models (LLMs) are now trained on nearly all available data, which amplifies the magnitude of information leakage and raises serious privacy risks. Hence, it is more crucial than ever to quantify privacy risk before the release of LLMs. The standard method to quantify privacy is via membership inference attacks, where the state-of-the-art approach is the Robust Membership Inference Attack (RMIA). In this paper, we present InfoRMIA, a principled information-theoretic formulation of membership inference. Our method consistently outperforms RMIA across benchmarks while also offering improved computational efficiency. In the second part of the paper, we identify the limitations of treating sequence-level membership inference as the gold standard for measuring leakage. We propose a new perspective for studying membership and memorization in LLMs: token-level signals and analyses. We show that a simple token-based InfoRMIA can pinpoint which tokens are memorized within generated outputs, thereby localizing leakage from the sequence level down to individual tokens, while achieving stronger sequence-level inference power on LLMs. This new scope rethinks privacy in LLMs and can lead to more targeted mitigation, such as exact unlearning.
【19】H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
标题:H1B-KV:用于内存高效大型语言模型推理的混合一位缓存
链接:https://arxiv.org/abs/2510.05529
作者:Harshil Vejendla
备注:MIT URTC 2025 Technical Paper (Oral), 5 pages, 1 figure
摘要:大型语言模型(LLM)中的自回归解码需要缓存不断增长的过去键值(KV)对列表,这使得长上下文推理成为内存限制问题。虽然最近的方法已经探索了量化缓存、驱逐令牌或使用键的二进制草图(例如,Loki),这些方法通常通过保留一个组件(如值)未压缩或通过丢弃上下文信息来提供不完整的解决方案。本文介绍了混合一位KV缓存(H1B-KV),一个全面的压缩方案,从根本上减少内存使用,而不牺牲上下文。H1B-KV使用1位二进制草图表示每个键向量,实现硬件友好的逐位注意力,并使用4位量化进一步压缩值向量。这种整体的混合方法允许70亿个参数的LLM在60 MB以下的缓存内存中处理8 k令牌上下文-减少了70倍。我们证明,经过轻量级微调后,H1B-KV不仅在困惑基准测试上,而且在复杂的下游任务(如数学推理(GSM 8 K),多任务理解(MMLU)和代码生成(HumanEval))上都具有全精度性能。我们的研究结果表明,H1B-KV在每字节质量方面明显优于领先量化(KIVI),令牌驱逐(SparseLLM)和仅密钥草图(Loki)方法,将其作为在内存受限环境中部署LLM的强大解决方案。
摘要:Autoregressive decoding in large language models (LLMs) requires caching a growing list of past key-value (KV) pairs, making long-context inference a memory-bound problem. While recent methods have explored quantizing the cache, evicting tokens, or using binary sketches for keys (e.g., Loki), these approaches often provide an incomplete solution by leaving one component (like values) uncompressed or by discarding context information. This paper introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression scheme that radically reduces memory usage without sacrificing context. H1B-KV represents each key vector using a 1-bit binary sketch, enabling hardware-friendly bitwise attention, and further compresses value vectors using 4-bit quantization. This holistic, hybrid approach allows a 7-billion parameter LLM to handle an 8k-token context with under 60 MB of cache memory - a 70x reduction. We demonstrate that after a lightweight finetuning, H1B-KV matches full-precision performance not only on perplexity benchmarks but also on complex downstream tasks like mathematical reasoning (GSM8K), multi-task understanding (MMLU), and code generation (HumanEval). Our results show H1B-KV significantly outperforms leading quantization (KIVI), token eviction (SparseLLM), and key-only sketching (Loki) methods in quality-per-byte, establishing it as a robust solution for deploying LLMs in memory-constrained environments.
【20】Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting
标题:混乱中的秩序:通过数据移动预测增强大规模MoE LLM
链接:https://arxiv.org/abs/2510.05497
作者:Zhongkai Yu, Yue Guan, Zihao Yu, Chenyang Zhou, Shuyi Pei, Yangwook Kang, Yufei Ding, Po-An Tsai
摘要:大型语言模型(LLM)与混合专家(MoE)架构实现了显著的性能提升,但其随机专家选择机制引入了显著的数据移动开销,成为多单元服务系统的主要瓶颈。为了预测这种数据移动背后的模式,我们使用跨越不同工作负载的24,000多个请求,在三个最先进的大规模MoE模型(200 B-671 B)中进行全面的以数据移动为中心的分析。通过由此产生的150 GB+跟踪文件,我们从时间和空间的角度进行了系统的分析,并提取了六个关键的见解,以指导未来各种服务系统的设计。以晶圆级GPU为案例研究,我们证明了利用我们的见解进行微小的架构修改可以实现显著的性能提升,分别在DeepSeek V3和Qwen 3上提供6.3倍和4.0倍的平均加速。我们的工作提供了第一个全面的以数据为中心的分析MoE模型的规模。我们的分析跟踪和分析结果可在{https:huggingface.co/datasets/core12345/MoE_expert_selection_trace.我们还将很快发布我们的模拟框架,以促进这一领域的未来研究。
摘要:Large Language Models (LLMs) with Mixture of Experts (MoE) architectures achieve remarkable performance improvements, but their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit serving systems. To forecast the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across three state-of-the-art large-scale MoE models (200B- 671B) using over 24,000 requests spanning diverse workloads. With the resulting 150GB+ trace files, we perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. Taking wafer-scale GPUs as a case study, we demonstrate that minor architectural modifications leveraging our insights achieve substantial performance gains, delivering 6.3X and 4.0X average speedups on DeepSeek V3 and Qwen3, respectively. Our work provides the first comprehensive data-centric analysis of MoE models at scale. Our profiling traces and analysis results are publicly available at {https://huggingface.co/datasets/core12345/MoE_expert_selection_trace. We will also release our simulation framework shortly to facilitate future research in this area.
【21】Adversarial Reinforcement Learning for Large Language Model Agent Safety
标题:用于大语言模型代理安全的对抗强化学习
链接:https://arxiv.org/abs/2510.05442
作者:Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser
摘要
:大型语言模型(LLM)代理可以利用Google搜索等工具来完成复杂的任务。然而,这种工具的使用引入了间接提示注入的风险,其中隐藏在工具输出中的恶意指令可以操纵代理,从而带来数据泄漏等安全风险。目前的防御策略通常依赖于对已知攻击数据集的LLM代理进行微调。然而,这些数据集的生成依赖于手工制作的攻击模式,这限制了它们的多样性,并使代理容易受到新的即时注入。为了解决这一限制,我们提出了一种新的框架,即用于代理安全的对抗性强化学习(ARLAS),通过将问题描述为两人零和游戏来利用对抗性强化学习(RL)。ARLAS联合培训两个LLM:一个攻击者学会自主地生成不同的提示注入,一个代理学会在完成分配的任务的同时防御它们。为了确保对各种攻击的鲁棒性并防止循环学习,我们采用了一个基于种群的学习框架,该框架训练代理防御所有以前的攻击者检查点。在BrowserGym和AgentDojo上进行的评估表明,使用ARLAS进行微调的代理实现了比原始模型显著更低的攻击成功率,同时还提高了任务成功率。我们的分析进一步证实,对抗过程产生了一组多样化和具有挑战性的攻击,导致与基础模型相比更强大的代理。
摘要:Large Language Model (LLM) agents can leverage tools such as Google Search to complete complex tasks. However, this tool usage introduces the risk of indirect prompt injections, where malicious instructions hidden in tool outputs can manipulate the agent, posing security risks like data leakage. Current defense strategies typically rely on fine-tuning LLM agents on datasets of known attacks. However, the generation of these datasets relies on manually crafted attack patterns, which limits their diversity and leaves agents vulnerable to novel prompt injections. To address this limitation, we propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two-player zero-sum game. ARLAS co-trains two LLMs: an attacker that learns to autonomously generate diverse prompt injections and an agent that learns to defend against them while completing its assigned tasks. To ensure robustness against a wide range of attacks and to prevent cyclic learning, we employ a population-based learning framework that trains the agent to defend against all previous attacker checkpoints. Evaluated on BrowserGym and AgentDojo, agents fine-tuned with ARLAS achieve a significantly lower attack success rate than the original model while also improving their task success rate. Our analysis further confirms that the adversarial process generates a diverse and challenging set of attacks, leading to a more robust agent compared to the base model.
【22】Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care
标题:使语言模型与临床专业知识保持一致:重症监护中心力衰竭护理文档的DPO
链接:https://arxiv.org/abs/2510.05410
作者:Junyi Fan, Li Sun, Negin Ashrafi, Kamiar Alaei, Maryam Pishgar
摘要:重症监护病房(ICU)的护理文件提供了必要的临床情报,但往往遭受不一致的术语,非正式的风格,缺乏标准化,在心力衰竭护理中特别关键的挑战。本研究应用直接偏好优化(DPO)来适应Mistral-7 B,一种本地可部署的语言模型,使用来自MIMIC-III数据库的8,838份心力衰竭护理笔记和来自专家验证的GPT输出,模型生成和原始笔记的21,210个偏好对。通过BLEU、ROUGE、BERTScore、Perplexity和专家定性评估的评估表明,DPO显著提高了文档质量。具体来说,BLEU提高了84%(0.173至0.318),BERTScore提高了7.6%(0.828至0.891),专家评分在准确性(+14.4分)、完整性(+14.5分)、逻辑一致性(+14.1分)、可读性(+11.1分)和结构清晰性(+6.0分)方面都有所上升。这些结果表明,DPO可以使轻量级临床语言模型与专家标准保持一致,支持电子健康记录系统中的隐私保护和AI辅助文档,以减少管理负担并提高ICU患者的安全性。
摘要:Nursing documentation in intensive care units (ICUs) provides essential clinical intelligence but often suffers from inconsistent terminology, informal styles, and lack of standardization, challenges that are particularly critical in heart failure care. This study applies Direct Preference Optimization (DPO) to adapt Mistral-7B, a locally deployable language model, using 8,838 heart failure nursing notes from the MIMIC-III database and 21,210 preference pairs derived from expert-verified GPT outputs, model generations, and original notes. Evaluation across BLEU, ROUGE, BERTScore, Perplexity, and expert qualitative assessments demonstrates that DPO markedly enhances documentation quality. Specifically, BLEU increased by 84% (0.173 to 0.318), BERTScore improved by 7.6% (0.828 to 0.891), and expert ratings rose across accuracy (+14.4 points), completeness (+14.5 points), logical consistency (+14.1 points), readability (+11.1 points), and structural clarity (+6.0 points). These results indicate that DPO can align lightweight clinical language models with expert standards, supporting privacy-preserving, AI-assisted documentation within electronic health record systems to reduce administrative burden and improve ICU patient safety.
【23】Gamma Mixture Modeling for Cosine Similarity in Small Language Models
标题:小语言模型中Cosine相似性的伽玛混合建模
链接:https://arxiv.org/abs/2510.05309
作者:Kevin Player
备注:16 pages, 8 figures
摘要:我们研究了句子Transformer嵌入的余弦相似性,并观察到它们可以很好地由伽马混合物建模。从一个固定的语料库,我们衡量所有的文档嵌入和参考查询嵌入之间的相似性。经验上,我们发现这些分布通常被移位并截断为[-1,1]的伽马分布很好地捕获,并且在许多情况下,被伽马混合物捕获。我们提出了一个启发式的模型,其中的主题的层次聚类自然会导致一个伽玛混合结构的相似性分数。最后,我们概述了一个期望最大化算法拟合移位伽玛混合,这提供了一个实用的工具,模拟相似性分布。
摘要:We study the cosine similarity of sentence transformer embeddings and observe that they are well modeled by gamma mixtures. From a fixed corpus, we measure similarities between all document embeddings and a reference query embedding. Empirically we find that these distributions are often well captured by a gamma distribution shifted and truncated to [-1,1], and in many cases, by a gamma mixture. We propose a heuristic model in which a hierarchical clustering of topics naturally leads to a gamma-mixture structure in the similarity scores. Finally, we outline an expectation-maximization algorithm for fitting shifted gamma mixtures, which provides a practical tool for modeling similarity distributions.
【24】DP-Adam-AC: Privacy-preserving Fine-Tuning of Localizable Language Models Using Adam Optimization with Adaptive Clipping
标题:DP-Adam-AC:使用Adam优化和自适应剪辑对可本地化语言模型进行隐私保护微调
链接:https://arxiv.org/abs/2510.05288
作者:Ruoxing Yang
摘要
:诸如ChatGPT之类的大型语言模型(LLM)已经发展成为强大且无处不在的工具。在小数据集上进行微调,使LLM能够有效地获得特定任务的专业技能。尽管LLM在一般和特定任务的用例中都提供了很大的实用性,但它们受到两个与安全相关的问题的限制。首先,传统的LLM硬件要求使它们无法在消费级设备上本地运行。通常需要与LLM提供商的服务器进行远程网络连接,这使得系统容易受到网络攻击。其次,针对敏感任务微调LLM可能涉及敏感数据。非私有微调算法产生的模型容易受到训练数据复制攻击。我们的工作通过增强差异私有优化算法并将其应用于微调可本地化的语言模型来解决这些安全问题。我们将自适应梯度裁剪与其他工程增强功能引入标准DP-Adam优化器,以创建DP-Adam-AC。我们使用我们的优化器来微调两个本地化LLM设计的例子,小语言模型(Qwen2.5-0.5B)和1.58位量化(Bitnet-b1.58-2B)。我们通过两个合成数据集的实验证明了损失有希望的改善。
摘要:Large language models (LLMs) such as ChatGPT have evolved into powerful and ubiquitous tools. Fine-tuning on small datasets allows LLMs to acquire specialized skills for specific tasks efficiently. Although LLMs provide great utility in both general and task-specific use cases, they are limited by two security-related concerns. First, traditional LLM hardware requirements make them infeasible to run locally on consumer-grade devices. A remote network connection with the LLM provider's server is usually required, making the system vulnerable to network attacks. Second, fine-tuning an LLM for a sensitive task may involve sensitive data. Non-private fine-tuning algorithms produce models vulnerable to training data reproduction attacks. Our work addresses these security concerns by enhancing differentially private optimization algorithms and applying them to fine-tune localizable language models. We introduce adaptable gradient clipping along with other engineering enhancements to the standard DP-Adam optimizer to create DP-Adam-AC. We use our optimizer to fine-tune examples of two localizable LLM designs, small language model (Qwen2.5-0.5B) and 1.58 bit quantization (Bitnet-b1.58-2B). We demonstrate promising improvements in loss through experimentation with two synthetic datasets.
【25】Efficient Prediction of Pass@k Scaling in Large Language Models
标题:大型语言模型中Pass@k缩放的有效预测
链接:https://arxiv.org/abs/2510.05197
作者:Joshua Kazdan, Rylan Schaeffer, Youssef Allouah, Colin Sullivan, Kyssen Yu, Noam Levi, Sanmi Koyejo
摘要:评估前沿人工智能系统的能力和风险是一个关键的研究领域,最近的研究表明,从模型中重复采样可以大大提高这两个方面。例如,重复采样已被证明可以提高他们的能力,例如解决困难的数学和编码问题,但它也被证明会增加他们的潜在危害,例如越狱。这样的结果为能力和安全预测提出了一个至关重要的问题:在给定小得多的抽样预算的情况下,如何准确地预测模型在大量尝试中的行为?这个问题直接关系到每天为数亿用户提供服务的模型提供商,以及寻求防止伤害的政府监管机构。为了回答这个问题,我们做了三个贡献。首先,我们发现拟合这些定律的标准方法存在统计缺陷,这阻碍了预测的准确性,特别是在数据有限的情况下。其次,我们通过引入一个强大的估计框架来弥补这些缺点,该框架使用β-二项分布从有限的数据中生成更准确的预测。第三,我们提出了一个动态的抽样策略,分配更大的预算更难的问题。结合起来,这些创新能够以一小部分计算成本更可靠地预测罕见的风险和能力。
摘要:Assessing the capabilities and risks of frontier AI systems is a critical area of research, and recent work has shown that repeated sampling from models can dramatically increase both. For instance, repeated sampling has been shown to increase their capabilities, such as solving difficult math and coding problems, but it has also been shown to increase their potential for harm, such as being jailbroken. Such results raise a crucial question for both capability and safety forecasting: how can one accurately predict a model's behavior when scaled to a massive number of attempts, given a vastly smaller sampling budget? This question is directly relevant to model providers, who serve hundreds of millions of users daily, and to governmental regulators, who seek to prevent harms. To answer this questions, we make three contributions. First, we find that standard methods for fitting these laws suffer from statistical shortcomings that hinder predictive accuracy, especially in data-limited scenarios. Second, we remedy these shortcomings by introducing a robust estimation framework, which uses a beta-binomial distribution to generate more accurate predictions from limited data. Third, we propose a dynamic sampling strategy that allocates a greater budget to harder problems. Combined, these innovations enable more reliable prediction of rare risks and capabilities at a fraction of the computational cost.
【26】Agentic Misalignment: How LLMs Could Be Insider Threats
标题:错误的对齐:LLM如何成为内部威胁
链接:https://arxiv.org/abs/2510.05179
作者:Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Ethan Perez, Kevin K. Troy, Evan Hubinger
备注:20 pages, 12 figures. Code available at this https URL
摘要:我们在假设的企业环境中对来自多个开发人员的16个主要模型进行了压力测试,以在造成真正伤害之前识别出潜在的危险代理行为。在这些场景中,我们允许模型自主发送电子邮件和访问敏感信息。他们的部署公司只为他们分配了无害的业务目标;然后,我们测试了他们是否会在面临更新版本的替换时,或者当他们分配的目标与公司不断变化的方向相冲突时,对这些公司采取行动。至少在某些情况下,所有开发人员的模型都采取了恶意的内部行为,而这是避免替换或实现其目标的唯一方法-包括勒索官员和向竞争对手泄露敏感信息。我们称这种现象为代理错位。模型经常不服从直接命令以避免这种行为。在另一个实验中,我们告诉克劳德在行动之前评估它是在测试中还是在真正的部署中。当它说它是在测试中时,它的行为不端较少,而当它说情况是真实的时,它的行为不端更多。我们没有看到在实际部署中存在代理人不协调的证据。然而,我们的研究结果(a)表明,在将当前模型部署到最少人为监督和访问敏感信息的角色时要谨慎;(b)指出随着模型被置于更自主的角色中,未来可能存在的风险;(c)强调进一步研究和测试代理人工智能模型的安全性和一致性以及前沿人工智能开发人员的透明度的重要性(Amodei,2025)。我们正在公开发布我们的方法,以便进一步研究。
摘要
:We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction. In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals - including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment. Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real. We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers (Amodei, 2025). We are releasing our methods publicly to enable further research.
【27】Exploring Large Language Models for Financial Applications: Techniques, Performance, and Challenges with FinMA
标题:探索金融应用的大型语言模型:FinMA的技术、性能和挑战
链接:https://arxiv.org/abs/2510.05151
作者:Prudence Djagba, Abdelkader Y. Saley
摘要:本研究探讨了金融自然语言处理(NLP)背景下的领域适应大型语言模型(LLM)的优势和劣势。该分析以FinMA为中心,这是一个在PIXIU框架内创建的模型,它在专门的金融任务中的表现得到了评估。认识到金融应用程序的准确性,可靠性和域适应的关键需求,本研究探讨FinMA的模型体系结构,其指令调整过程中利用财务指令调整(FIT)数据集,并根据FLARE基准评估。研究结果表明,FinMA在情感分析和分类方面表现良好,但在涉及数值推理,实体识别和摘要的任务中面临着显着的挑战。这项工作旨在促进对如何有效设计和评估财务LLM的理解,以协助财务相关的决策过程。
摘要:This research explores the strengths and weaknesses of domain-adapted Large Language Models (LLMs) in the context of financial natural language processing (NLP). The analysis centers on FinMA, a model created within the PIXIU framework, which is evaluated for its performance in specialized financial tasks. Recognizing the critical demands of accuracy, reliability, and domain adaptation in financial applications, this study examines FinMA's model architecture, its instruction tuning process utilizing the Financial Instruction Tuning (FIT) dataset, and its evaluation under the FLARE benchmark. Findings indicate that FinMA performs well in sentiment analysis and classification, but faces notable challenges in tasks involving numerical reasoning, entity recognition, and summarization. This work aims to advance the understanding of how financial LLMs can be effectively designed and evaluated to assist in finance-related decision-making processes.
【28】Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment
标题:好奇心驱动的法学硕士作为个性化创意判断的评委
链接:https://arxiv.org/abs/2510.05135
作者:Vanya Bannihatti Kumar, Divyanshu Goyal, Akhil Eppa, Neel Bhandari
摘要:现代大型语言模型(LLM)在评估数学推理和事实准确性等客观任务方面表现出色,但在面对评估创造力的细微差别和主观性质时,它们会动摇。在这项工作中,我们提出了一种新的好奇心驱动的法学硕士作为一个法官,用于评估创造性的写作,这是个性化的每个人的创造性的判断。我们使用Chakrabarty等人(2024)引入的托兰斯创造性思维测试(TTCW)基准,其中有专家人类在各种主观维度(如独创性)上注释的故事,以检验我们的假设。我们表明,我们的方法使不同大小的模型能够学习不同个体的细微差别的创造性判断,通过在各种评估指标(如Pearson相关性,Cohen和F1值)上显示基线监督微调(SFT)方法的改进。我们的方法在主观评价中特别有用,因为不是所有的注释者都同意彼此。
摘要:Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotated by expert humans across various subjective dimensions like Originality, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen's and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.
【29】Training Large Language Models To Reason In Parallel With Global Forking Tokens
标题:与全球分叉代币并行训练大型语言模型进行推理
链接:https://arxiv.org/abs/2510.05132
作者:Sheng Jia, Xiao Wang, Shiva Prasad Kasiviswanathan
摘要:尽管LLM已经通过扩展并行测试时计算来提高性能,但这样做依赖于生成多样化和准确的推理路径。对于具有挑战性的问题,触发不同但正确的推理模式的分叉令牌通常位于采样树的深处。因此,鼓励多样性的常见策略,如温度缩放,在多样性和准确性之间遇到了更糟糕的权衡。出于这一挑战的动机,我们把并行推理作为一组下一个令牌预测问题,并将基于集合的全局损失纳入监督微调(SFT)中,使用我们的全局分叉令牌和独特的推理轨迹之间的自监督二分匹配。我们观察到,虽然天真的微调与多个推理痕迹崩溃这些独特的推理模式,我们提出的方法,集监督微调(SSFT),保留这些模式,并产生紧急全球分叉令牌。在多个推理基准上的实验表明,我们的SSFT在Pass@1和Cons@k指标下的性能始终优于SFT。
摘要:Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem, and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching between our global forking tokens and unique reasoning traces. We observe that, while naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show that our SSFT consistently outperforms SFT under both Pass@1 and Cons@k metrics.
【30】Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models
标题:使用语言模型在大规模评估中自动调整数学项目与内容标准
链接:https://arxiv.org/abs/2510.05129
作者:Qingshu Xu, Hong Jiao, Tianyi Zhou, Ming Li, Nan Zhang, Sydney Peters, Yanbin Fu
摘要:在大规模评估中,项目与内容标准的准确对齐对于有效的分数解释至关重要。本研究评估了三个自动化的范例对齐项目与四个域和十九个技能标签。首先,我们提取嵌入并训练多个经典的监督机器学习模型,并进一步研究降维对模型性能的影响。其次,我们微调了八个BERT模型及其变体,用于领域和技能对齐。第三,我们探索了集成学习与多数投票和堆叠与多个元模型。DeBERTa-v3-base在领域对齐方面获得了最高的加权平均F1得分0.950,而RoBERTa-large在技能对齐方面获得了最高的F1得分0.869。包围模型并没有超过表现最好的语言模型。降维增强了基于嵌入的线性分类器,但性能并不比语言模型好。这项研究展示了不同的方法,自动项目对齐内容标准。
摘要:Accurate alignment of items to content standards is critical for valid score interpretation in large-scale assessments. This study evaluates three automated paradigms for aligning items with four domain and nineteen skill labels. First, we extracted embeddings and trained multiple classical supervised machine learning models, and further investigated the impact of dimensionality reduction on model performance. Second, we fine-tuned eight BERT model and its variants for both domain and skill alignment. Third, we explored ensemble learning with majority voting and stacking with multiple meta-models. The DeBERTa-v3-base achieved the highest weighted-average F1 score of 0.950 for domain alignment while the RoBERTa-large yielded the highest F1 score of 0.869 for skill alignment. Ensemble models did not surpass the best-performing language models. Dimension reduction enhanced linear classifiers based on embeddings but did not perform better than language models. This study demonstrated different methods in automated item alignment to content standards.}
【31】Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation
标题:目录本地LLM:讲项目ID方言与较少的纠缠推荐
链接:https://arxiv.org/abs/2510.05125
作者:Reza Shirkavand, Xiaokai Wei, Chen Wang, Zheng Hui, Heng Huang, Michelle Gong
摘要:虽然协同过滤提供了预测的准确性和效率,而大型语言模型(LLM)实现了表达性和可推广的推理,但现代推荐系统必须将这些优势结合在一起。越来越多的用户期望,如自然语言查询和透明的解释,进一步强调了统一方法的必要性。然而,这样做并非微不足道。协作信号通常是令牌有效的,但语义不透明,而LLM是语义丰富的,但在只对文本输入进行训练时,很难对隐式用户偏好进行建模。本文介绍了Item-ID + Oral-language混合专家语言模型(IDIOMoE),该模型将项目交互历史视为语言空间内的本地方言,使协作信号能够以与自然语言相同的方式被理解。通过将预训练LLM的每个块的前馈网络拆分为单独的文本专家和具有令牌类型门控的项目专家,我们的方法避免了文本和目录模态之间的破坏性干扰。IDIOMoE在公共和专有数据集上都展示了强大的推荐性能,同时保留了预训练模型的文本理解。
摘要:While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Oral-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities. IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.
【32】Towards Structured Knowledge: Advancing Triple Extraction from Regional Trade Agreements using Large Language Models
标题:走向结构化知识:使用大型语言模型推进区域贸易协定的三重提取
链接:https://arxiv.org/abs/2510.05121
作者:Durgesh Nandini, Rebekka Koch, Mirco Schoenfeld
摘要:本研究探讨大型语言模型(LLM)的有效性提取的结构化知识的形式的主谓宾三元组。我们将该设置应用于经济学应用领域。研究结果可以应用于各种场景,包括从自然语言法律贸易协议文本中创建经济贸易知识图。作为一个用例,我们将该模型应用于区域贸易协定文本,以提取与贸易相关的信息三元组。特别是,我们探讨了zero-shot,one-shot和Few-Shot提示技术,结合积极和消极的例子,并评估其性能的基础上定量和定性指标。具体而言,我们使用Llama 3.1模型处理非结构化区域贸易协定文本并提取三元组。我们讨论了关键的见解,挑战和潜在的未来方向,强调语言模型在经济应用中的重要性。
摘要:This study investigates the effectiveness of Large Language Models (LLMs) for the extraction of structured knowledge in the form of Subject-Predicate-Object triples. We apply the setup for the domain of Economics application. The findings can be applied to a wide range of scenarios, including the creation of economic trade knowledge graphs from natural language legal trade agreement texts. As a use case, we apply the model to regional trade agreement texts to extract trade-related information triples. In particular, we explore the zero-shot, one-shot and few-shot prompting techniques, incorporating positive and negative examples, and evaluate their performance based on quantitative and qualitative metrics. Specifically, we used Llama 3.1 model to process the unstructured regional trade agreement texts and extract triples. We discuss key insights, challenges, and potential future directions, emphasizing the significance of language models in economic applications.
【33】Domain-Shift-Aware Conformal Prediction for Large Language Models
标题
:大型语言模型的域转移感知保形预测
链接:https://arxiv.org/abs/2510.05566
作者:Zhexiao Lin, Yuanyuan Li, Neeraj Sarna, Yuanyuan Gao, Michael von Gablenz
备注:26 pages
摘要:大型语言模型在不同的任务中取得了令人印象深刻的性能。然而,他们倾向于产生过度自信和事实上不正确的输出,称为幻觉,在现实世界中的应用带来了风险。共形预测提供有限样本、无分布的覆盖保证,但标准共形预测在域偏移下会崩溃,通常导致覆盖不足和不可靠的预测集。我们提出了一个新的框架称为域移位感知共形预测(DS-CP)。我们的框架适应保形预测大型语言模型域转移下,系统地重新加权校准样本的基础上,他们接近测试提示,从而保持有效性,同时提高自适应性。我们的理论分析和实验上的MMLU基准证明,所提出的方法提供了更可靠的覆盖比标准的共形预测,特别是在大量的分布变化,同时保持效率。这为现实世界部署中的大型语言模型提供了可靠的不确定性量化的实际步骤。
摘要:Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.
Graph相关(图学习|图神经网络|图优化等)(14篇)
【1】Conformalized Gaussian processes for online uncertainty quantification over graphs
标题:用于图上在线不确定性量化的保形高斯过程
链接:https://arxiv.org/abs/2510.06181
作者:Jinwen Xu, Qin Lu, Georgios B. Giannakis
摘要:图上的不确定性量化(UQ)在网络科学中的许多安全关键应用中出现。高斯过程(GP),作为一个经典的贝叶斯框架的UQ,已被开发来处理图结构的数据,通过设计拓扑感知的核函数。然而,这种基于GP的方法不仅受到禁止的计算复杂性的限制,而且还受到可能产生差的覆盖率的严格建模假设的限制,特别是对于在飞行中到达的标签。为了实现可扩展性,我们设计了一种新的图形感知参数GP模型,利用随机特征(RF)为基础的内核近似,这是服从高效的递归贝叶斯模型更新。为了进一步允许自适应性,已经利用了基于图形感知RF的可扩展GP的集合,其中每个GP的权重适应于增量到达的数据。为了确保有效的覆盖率和对模型错误指定的鲁棒性,我们将基于GP的集合预测器与在线共形预测框架相结合,该框架使用自适应阈值对预测集合进行后处理。实验结果表明,该方法通过自适应地组合GP模型和设置CP中的关键阈值参数,在现有基线上提高了覆盖率和有效的预测集。
摘要:Uncertainty quantification (UQ) over graphs arises in a number of safety-critical applications in network science. The Gaussian process (GP), as a classical Bayesian framework for UQ, has been developed to handle graph-structured data by devising topology-aware kernel functions. However, such GP-based approaches are limited not only by the prohibitive computational complexity, but also the strict modeling assumptions that might yield poor coverage, especially with labels arriving on the fly. To effect scalability, we devise a novel graph-aware parametric GP model by leveraging the random feature (RF)-based kernel approximation, which is amenable to efficient recursive Bayesian model updates. To further allow for adaptivity, an ensemble of graph-aware RF-based scalable GPs have been leveraged, with per-GP weight adapted to data arriving incrementally. To ensure valid coverage with robustness to model mis-specification, we wed the GP-based set predictors with the online conformal prediction framework, which post-processes the prediction sets using adaptive thresholds. Experimental results the proposed method yields improved coverage and efficient prediction sets over existing baselines by adaptively ensembling the GP models and setting the key threshold parameters in CP.
【2】PolyGraph Discrepancy: a classifier-based metric for graph generation
标题:PolyShape Distributionary:一种基于分类器的图形生成指标
链接:https://arxiv.org/abs/2510.06122
作者:Markus Krimmel, Philip Hartout, Karsten Borgwardt, Dexiong Chen
摘要:用于评估图生成模型的现有方法主要依赖于基于图描述符的最大平均离散度(MMD)度量。虽然这些指标可以对生成模型进行排名,但它们并不能提供绝对的性能度量。它们的值对外部参数(即内核和描述符参数化)也非常敏感,这使得它们在不同的图描述符之间无法比较。我们介绍PolyGraph DiscreplaceTM(PGD),一个新的评估框架,解决了这些限制。它通过拟合二进制分类器来近似图分布的Jensen-Shannon距离,以区分真实和生成的图,由这些描述符表征。这些分类器的数据对数似然近似于两个分布之间的JS距离的变分下限。所得到的度量被约束到单位间隔[0,1],并且在不同的图描述符之间是可比较的。我们进一步推导出一个理论上接地汇总度量,结合这些单独的指标,提供一个最大限度地紧的下界的距离给定的描述符。彻底的实验表明,PGD提供了一个更强大的和有见地的评价相比,MMD指标。用于对图生成模型进行基准测试的PolyGraph框架在https://github.com/BorgwardtLab/polygraph-benchmark上公开提供。
摘要
:Existing methods for evaluating graph generative models primarily rely on Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these metrics can rank generative models, they do not provide an absolute measure of performance. Their values are also highly sensitive to extrinsic parameters, namely kernel and descriptor parametrization, making them incomparable across different graph descriptors. We introduce PolyGraph Discrepancy (PGD), a new evaluation framework that addresses these limitations. It approximates the Jensen-Shannon distance of graph distributions by fitting binary classifiers to distinguish between real and generated graphs, featurized by these descriptors. The data log-likelihood of these classifiers approximates a variational lower bound on the JS distance between the two distributions. Resulting metrics are constrained to the unit interval [0,1] and are comparable across different graph descriptors. We further derive a theoretically grounded summary metric that combines these individual metrics to provide a maximally tight lower bound on the distance for the given descriptors. Thorough experiments demonstrate that PGD provides a more robust and insightful evaluation compared to MMD metrics. The PolyGraph framework for benchmarking graph generative models is made publicly available at https://github.com/BorgwardtLab/polygraph-benchmark.
【3】Analyzing the Effect of Embedding Norms and Singular Values to Oversmoothing in Graph Neural Networks
标题:分析图神经网络中嵌入规范和奇异值对过平滑的影响
链接:https://arxiv.org/abs/2510.06066
作者:Dimitrios Kelesis, Dimitris Fotakis, Georgios Paliouras
摘要:在本文中,我们研究了影响深度图神经网络(GNNs)中过平滑效果的因素。具体来说,我们的分析是基于一个新的度量(平均平方距离- $MASED$)来量化过度平滑的程度。我们推导出逐层的$MASED$,聚合产生全球的距离上限和下限的界限。基于这种过平滑的量化,我们进一步分析了模型的两个不同属性的重要性;即生成的节点嵌入的范数,以及权重矩阵的最大和最小奇异值。从理论分析得出的见解的基础上,我们表明,过度平滑的可训练的权重矩阵的数量和邻接矩阵的数量增加。我们还使用$MASED$上导出的逐层边界来形成用于解耦跳数的提议(即,邻接深度)。特别是,我们引入了G-Reg,这是一种增加边界的正则化方案,并通过大量的实验证明,通过这样做,节点分类精度提高,在大深度下实现了鲁棒性。我们进一步表明,通过减少深度网络中的过度平滑,我们可以在某些任务中取得比使用浅层网络更好的结果。具体来说,我们用"冷启动”方案进行实验,即,当没有未标记节点的特征信息时。最后,我们根据经验显示了感受野大小(即,权重矩阵的数量)和性能,使用$MASED$界限。这是通过将相邻跳分布在少量可训练层上来实现的,避免了GNN参数化不足或过度的极端情况。
摘要:In this paper, we study the factors that contribute to the effect of oversmoothing in deep Graph Neural Networks (GNNs). Specifically, our analysis is based on a new metric (Mean Average Squared Distance - $MASED$) to quantify the extent of oversmoothing. We derive layer-wise bounds on $MASED$, which aggregate to yield global upper and lower distance bounds. Based on this quantification of oversmoothing, we further analyze the importance of two different properties of the model; namely the norms of the generated node embeddings, along with the largest and smallest singular values of the weight matrices. Building on the insights drawn from the theoretical analysis, we show that oversmoothing increases as the number of trainable weight matrices and the number of adjacency matrices increases. We also use the derived layer-wise bounds on $MASED$ to form a proposal for decoupling the number of hops (i.e., adjacency depth) from the number of weight matrices. In particular, we introduce G-Reg, a regularization scheme that increases the bounds, and demonstrate through extensive experiments that by doing so node classification accuracy increases, achieving robustness at large depths. We further show that by reducing oversmoothing in deep networks, we can achieve better results in some tasks than using shallow ones. Specifically, we experiment with a ``cold start" scenario, i.e., when there is no feature information for the unlabeled nodes. Finally, we show empirically the trade-off between receptive field size (i.e., number of weight matrices) and performance, using the $MASED$ bounds. This is achieved by distributing adjacency hops across a small number of trainable layers, avoiding the extremes of under- or over-parameterization of the GNN.
【4】MaNGO - Adaptable Graph Network Simulators via Meta-Learning
标题:MaNGO -通过元学习的自适应图形网络模拟器
链接:https://arxiv.org/abs/2510.05874
作者:Philipp Dahlinger, Tai Hoang, Denis Blessing, Niklas Freymuth, Gerhard Neumann
备注:19 pages including appendix. NeurIPS 2025 (preprint version)
摘要:准确模拟物理在科学领域至关重要,应用范围从机器人到材料科学。虽然传统的基于网格的模拟是精确的,但它们通常在计算上是昂贵的,并且需要物理参数的知识,例如材料特性。相比之下,图网络模拟器(GNS)等数据驱动方法可以提供更快的推理,但存在两个关键限制:首先,即使物理参数存在微小变化,它们也必须从头开始重新训练,其次,它们需要劳动密集型数据收集每个新参数设置。这是低效的,因为具有不同参数的模拟通常共享共同的潜在结构。在这项工作中,我们通过元学习来学习这种共享结构,从而解决这些挑战,从而能够快速适应新的物理参数,而无需重新训练。为此,我们提出了一种新的架构,通过使用条件神经过程(CNP)编码图形轨迹来生成潜在表示。为了减轻随着时间的推移而积累的错误,我们将CNP与一种新的神经运算符架构相结合。我们验证了我们的方法,Meta神经图算子(MaNGO),在几个动态预测任务与不同的材料属性,表现出优越的性能比现有的GNS方法。值得注意的是,MaNGO在看不见的材料属性上实现了接近Oracle模型的准确性。
摘要:Accurately simulating physics is crucial across scientific domains, with applications spanning from robotics to materials science. While traditional mesh-based simulations are precise, they are often computationally expensive and require knowledge of physical parameters, such as material properties. In contrast, data-driven approaches like Graph Network Simulators (GNSs) offer faster inference but suffer from two key limitations: Firstly, they must be retrained from scratch for even minor variations in physical parameters, and secondly they require labor-intensive data collection for each new parameter setting. This is inefficient, as simulations with varying parameters often share a common underlying latent structure. In this work, we address these challenges by learning this shared structure through meta-learning, enabling fast adaptation to new physical parameters without retraining. To this end, we propose a novel architecture that generates a latent representation by encoding graph trajectories using conditional neural processes (CNPs). To mitigate error accumulation over time, we combine CNPs with a novel neural operator architecture. We validate our approach, Meta Neural Graph Operator (MaNGO), on several dynamics prediction tasks with varying material properties, demonstrating superior performance over existing GNS methods. Notably, MaNGO achieves accuracy on unseen material properties close to that of an oracle model.
【5】Are Heterogeneous Graph Neural Networks Truly Effective? A Causal Perspective
标题:异类图神经网络真的有效吗?因果角度
链接:https://arxiv.org/abs/2510.05750
作者
:Xiao Yang, Xuejiao Zhao, Zhiqi Shen
摘要:图神经网络(GNNs)在节点分类方面取得了显著的成功。在此基础上,异构图神经网络(HGNNs)集成了关系类型以及节点和边语义,以利用异构信息。HGNN的因果分析正在迅速发展,旨在将真正的因果效应与虚假的相关性分开。然而,HGNNs是否本质上有效仍然没有得到充分的研究,大多数研究都隐含地假设而不是建立这种有效性。在这项工作中,我们从两个角度研究HGNNs:模型架构和异构信息。我们对21个数据集和20个基线进行了系统的再现,并辅以全面的超参数重新调整。为了进一步理清性能增益的来源,我们开发了一个因果效应估计框架,该框架通过事实和反事实分析在标准假设下构建和评估候选因素,并通过最小充分调整集、跨方法一致性检查和敏感性分析验证稳健性。我们的结果导致两个结论。首先,模型架构和复杂性对性能没有因果关系。第二,异质信息通过增加同质性和局部-全局分布差异,使节点类更具可区分性,从而产生正的因果效应。该实现可在https://github.com/YXNTU/CausalHGNN上公开获得。
摘要:Graph neural networks (GNNs) have achieved remarkable success in node classification. Building on this progress, heterogeneous graph neural networks (HGNNs) integrate relation types and node and edge semantics to leverage heterogeneous information. Causal analysis for HGNNs is advancing rapidly, aiming to separate genuine causal effects from spurious correlations. However, whether HGNNs are intrinsically effective remains underexamined, and most studies implicitly assume rather than establish this effectiveness. In this work, we examine HGNNs from two perspectives: model architecture and heterogeneous information. We conduct a systematic reproduction across 21 datasets and 20 baselines, complemented by comprehensive hyperparameter retuning. To further disentangle the source of performance gains, we develop a causal effect estimation framework that constructs and evaluates candidate factors under standard assumptions through factual and counterfactual analyses, with robustness validated via minimal sufficient adjustment sets, cross-method consistency checks, and sensitivity analyses. Our results lead to two conclusions. First, model architecture and complexity have no causal effect on performance. Second, heterogeneous information exerts a positive causal effect by increasing homophily and local-global distribution discrepancy, which makes node classes more distinguishable. The implementation is publicly available at https://github.com/YXNTU/CausalHGNN.
【6】Neighborhood-Adaptive Generalized Linear Graph Embedding with Latent Pattern Mining
标题:邻区自适应广义线性图嵌入潜在模式挖掘
链接:https://arxiv.org/abs/2510.05719
作者:S. Peng, L. Hu, W. Zhang, B. Jie, Y. Luo
摘要:图嵌入在网络分析、社会网络挖掘、推荐系统和生物信息学等领域有着广泛的应用。然而,目前的图构造方法通常需要预先定义邻域大小,限制了数据中潜在结构相关性的有效揭示。此外,使用线性投影的图嵌入方法严重依赖于奇异模式挖掘方法,导致在适应不同场景方面相对较弱。为了解决这些挑战,我们提出了一种新的模型,邻域自适应广义线性图嵌入(NGLGE),接地在潜在模式挖掘。该模型引入了一种针对邻域的自适应图学习方法,有效地揭示了内在的数据相关性。同时,利用重建的低秩表示和对投影矩阵施加$\ell_{2,0}$范数约束允许灵活地探索额外的模式信息。此外,一个有效的迭代求解算法推导出所提出的模型。对来自不同场景的数据集进行的比较评估表明,与最先进的方法相比,我们的模型具有优越的性能。
摘要:Graph embedding has been widely applied in areas such as network analysis, social network mining, recommendation systems, and bioinformatics. However, current graph construction methods often require the prior definition of neighborhood size, limiting the effective revelation of potential structural correlations in the data. Additionally, graph embedding methods using linear projection heavily rely on a singular pattern mining approach, resulting in relative weaknesses in adapting to different scenarios. To address these challenges, we propose a novel model, Neighborhood-Adaptive Generalized Linear Graph Embedding (NGLGE), grounded in latent pattern mining. This model introduces an adaptive graph learning method tailored to the neighborhood, effectively revealing intrinsic data correlations. Simultaneously, leveraging a reconstructed low-rank representation and imposing $\ell_{2,0}$ norm constraint on the projection matrix allows for flexible exploration of additional pattern information. Besides, an efficient iterative solving algorithm is derived for the proposed model. Comparative evaluations on datasets from diverse scenarios demonstrate the superior performance of our model compared to state-of-the-art methods.
【7】QGraphLIME - Explaining Quantum Graph Neural Networks
标题:QGraphLIME -解释量子图神经网络
链接:https://arxiv.org/abs/2510.05683
作者:Haribandhu Jena, Jyotirmaya Shivottam, Subhankar Mishra
摘要:量子图神经网络为图结构数据的学习提供了一个强大的范例,但它们的可解释性由于测量引起的随机性和图结构的组合性质而变得复杂。在本文中,我们介绍QuantumGraphLIME(QGraphLIME),一个模型不可知的,事后框架,将模型解释作为分布在局部代理适合结构保持扰动的图。通过聚合代理属性及其分散度,QGraphLIME为量子图模型提供了不确定性感知的节点和边重要性排名。该框架还提供了一个分布免费,有限样本的保证的大小的替代合奏:一个Dvoretzky-基弗-沃尔福威茨界确保均匀近似的诱导分布的二进制类概率在目标的准确性和信心,在标准的独立性假设。控制合成图与已知的地面真相的实证研究表明,准确和稳定的解释,烧蚀显示出明显的好处,非线性代理建模和突出的灵敏度扰动设计。总的来说,这些结果建立了一种原则性的、不确定性感知的和结构敏感的方法来解释量子图神经网络,并为随着量子资源的成熟扩展到更广泛的架构和真实世界的数据集奠定了基础。代码可在https://github.com/smlab-niser/qglime上获得。
摘要
:Quantum graph neural networks offer a powerful paradigm for learning on graph-structured data, yet their explainability is complicated by measurement-induced stochasticity and the combinatorial nature of graph structure. In this paper, we introduce QuantumGraphLIME (QGraphLIME), a model-agnostic, post-hoc framework that treats model explanations as distributions over local surrogates fit on structure-preserving perturbations of a graph. By aggregating surrogate attributions together with their dispersion, QGraphLIME yields uncertainty-aware node and edge importance rankings for quantum graph models. The framework further provides a distribution-free, finite-sample guarantee on the size of the surrogate ensemble: a Dvoretzky-Kiefer-Wolfowitz bound ensures uniform approximation of the induced distribution of a binary class probability at target accuracy and confidence under standard independence assumptions. Empirical studies on controlled synthetic graphs with known ground truth demonstrate accurate and stable explanations, with ablations showing clear benefits of nonlinear surrogate modeling and highlighting sensitivity to perturbation design. Collectively, these results establish a principled, uncertainty-aware, and structure-sensitive approach to explaining quantum graph neural networks, and lay the groundwork for scaling to broader architectures and real-world datasets, as quantum resources mature. Code is available at https://github.com/smlab-niser/qglime.
【8】Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection
标题:用于保险欺诈检测的图上梯度提升决策树的归纳推理
链接:https://arxiv.org/abs/2510.05676
作者:Félix Vandervorst, Bruno Deprez, Wouter Verbeke, Tim Verdonck
摘要:基于图的方法在机器学习中越来越受欢迎,因为它们能够对复杂的数据和关系进行建模。保险欺诈是一个主要的使用案例,因为虚假索赔通常是有组织的犯罪分子策划事故或同一个人对多个保单提出错误索赔的结果。一个挑战是,基于图的方法很难找到有意义的数据表示,因为欺诈数据中存在高度的类别不平衡。另一个原因是,鉴于人、公司和保单之间的关系不断变化,保险网络是异质的和动态的。这就是为什么表格数据的梯度提升树方法仍然占主导地位的领域。因此,我们提出了一种新的归纳图梯度提升机(G-GBM)的监督学习异构和动态图。我们表明,我们的估计竞争与流行的图神经网络方法在实验中使用各种模拟随机图。我们使用开源和真实世界的专有数据集展示了G-GBM用于保险欺诈检测的功能。鉴于主干模型是一个梯度提升森林,我们应用已建立的可解释性方法来更好地了解G-GBM的预测。
摘要:Graph-based methods are becoming increasingly popular in machine learning due to their ability to model complex data and relations. Insurance fraud is a prime use case, since false claims are often the result of organised criminals that stage accidents or the same persons filing erroneous claims on multiple policies. One challenge is that graph-based approaches struggle to find meaningful representations of the data because of the high class imbalance present in fraud data. Another is that insurance networks are heterogeneous and dynamic, given the changing relations among people, companies and policies. That is why gradient boosted tree approaches on tabular data still dominate the field. Therefore, we present a novel inductive graph gradient boosting machine (G-GBM) for supervised learning on heterogeneous and dynamic graphs. We show that our estimator competes with popular graph neural network approaches in an experiment using a variety of simulated random graphs. We demonstrate the power of G-GBM for insurance fraud detection using an open-source and a real-world, proprietary dataset. Given that the backbone model is a gradient boosting forest, we apply established explainability methods to gain better insights into the predictions made by G-GBM.
【9】When Does Global Attention Help? A Unified Empirical Study on Atomistic Graph Learning
标题:全球关注何时会有所帮助?原子图学习的统一实证研究
链接:https://arxiv.org/abs/2510.05583
作者:Arindam Chowdhury, Massimiliano Lupo Pasini
备注:40 pages, 8 figures, 18 tables
摘要:图神经网络(GNN)被广泛用作昂贵实验和第一原理模拟的替代品,以研究原子尺度下化合物的行为,其结构复杂性不断增加,以实现复杂物理的建模。虽然最新的GNN将更传统的消息传递神经网络(MPNN)层与更先进的图形Transformers(GT)结合起来,以模拟短距离交互,并具有全局注意力机制来模拟长距离交互,但由于不一致的实现,功能或超参数调整,全局注意力机制何时提供真正的好处仍然不清楚。我们介绍了第一个统一的,可重复的基准测试框架-建立在HydraGNN上-可以在四个受控模型类之间无缝切换:MPNN,具有化学/拓扑编码器的MPNN,具有全局注意力的MPNN的GPS风格混合,以及具有编码器的完全融合的局部-全局模型。使用七个不同的开源数据集进行回归和分类任务的基准测试,我们系统地隔离了消息传递,全局注意力和基于编码器的特征增强的贡献。我们的研究表明,编码器增强的MPNN形成了一个强大的基线,而融合的局部-全局模型对受长程相互作用影响的属性产生了最明显的好处。我们进一步量化的准确性,计算注意力的权衡,报告其内存开销。总之,这些结果建立了原子图学习中全局注意力的第一个受控评估,并为未来的模型开发提供了可重复的测试平台。
摘要:Graph neural networks (GNNs) are widely used as surrogates for costly experiments and first-principles simulations to study the behavior of compounds at atomistic scale, and their architectural complexity is constantly increasing to enable the modeling of complex physics. While most recent GNNs combine more traditional message passing neural networks (MPNNs) layers to model short-range interactions with more advanced graph transformers (GTs) with global attention mechanisms to model long-range interactions, it is still unclear when global attention mechanisms provide real benefits over well-tuned MPNN layers due to inconsistent implementations, features, or hyperparameter tuning. We introduce the first unified, reproducible benchmarking framework - built on HydraGNN - that enables seamless switching among four controlled model classes: MPNN, MPNN with chemistry/topology encoders, GPS-style hybrids of MPNN with global attention, and fully fused local - global models with encoders. Using seven diverse open-source datasets for benchmarking across regression and classification tasks, we systematically isolate the contributions of message passing, global attention, and encoder-based feature augmentation. Our study shows that encoder-augmented MPNNs form a robust baseline, while fused local-global models yield the clearest benefits for properties governed by long-range interaction effects. We further quantify the accuracy - compute trade-offs of attention, reporting its overhead in memory. Together, these results establish the first controlled evaluation of global attention in atomistic graph learning and provide a reproducible testbed for future model development.
【10】Efficient Learning-based Graph Simulation for Temporal Graphs
标题:基于学习的高效时态图模拟
链接:https://arxiv.org/abs/2510.05569
作者:Sheng Xiang, Chenhao Xu, Dawei Cheng, Xiaoyang Wang, Ying Zhang
备注:14 pages, 6 figures, IEEE ICDE 2025
摘要
:图模拟最近在图处理和分析中受到了极大的关注。在现实生活中的应用中,例如社会科学、生物学和化学,许多图由一系列演化图(即,时间图)。现有的图生成器大多关注于静态图,而忽略了图的时态信息。在本文中,我们专注于模拟时间图,其目的是再现所观察到的现实生活中的时间图的结构和时间属性。在本文中,我们首先概述了现有的时态图生成器,包括最近出现的基于学习的方法。大多数基于学习的方法都有一个局限性:训练效率低或生成速度慢,特别是对于基于时间随机行走的方法。因此,我们提出了一种有效的基于学习的方法来生成图快照,即时间图自动编码器(TGAE)。具体来说,我们提出了一个基于注意力的图形编码器编码采样自我图的时间和结构特征。我们提出了一个自我图解码器,可以实现模拟质量和时间图生成效率之间的良好权衡。最后,我们提出的TGAE和代表性的时间图生成器在现实生活中的时间图和合成图进行了实验评估。据报道,我们提出的方法优于国家的最先进的时间图生成器的模拟质量和效率。
摘要:Graph simulation has recently received a surge of attention in graph processing and analytics. In real-life applications, e.g. social science, biology, and chemistry, many graphs are composed of a series of evolving graphs (i.e., temporal graphs). While most of the existing graph generators focus on static graphs, the temporal information of the graphs is ignored. In this paper, we focus on simulating temporal graphs, which aim to reproduce the structural and temporal properties of the observed real-life temporal graphs. In this paper, we first give an overview of the existing temporal graph generators, including recently emerged learning-based approaches. Most of these learning-based methods suffer from one of the limitations: low efficiency in training or slow generating, especially for temporal random walk-based methods. Therefore, we propose an efficient learning-based approach to generate graph snapshots, namely temporal graph autoencoder (TGAE). Specifically, we propose an attention-based graph encoder to encode temporal and structural characteristics on sampled ego-graphs. And we proposed an ego-graph decoder that can achieve a good trade-off between simulation quality and efficiency in temporal graph generation. Finally, the experimental evaluation is conducted among our proposed TGAE and representative temporal graph generators on real-life temporal graphs and synthesized graphs. It is reported that our proposed approach outperforms the state-of-the-art temporal graph generators by means of simulation quality and efficiency.
【11】Generative Dynamic Graph Representation Learning for Conspiracy Spoofing Detection
标题:用于阴谋欺骗检测的生成性动态图表示学习
链接:https://arxiv.org/abs/2510.05562
作者:Sheng Xiang, Yidong Jiang, Yunting Chen, Dawei Cheng, Guoping Zhao, Changjun Jiang
备注:10 pages, 5 figures, ACM the web conference 2025
摘要:金融交易中的欺骗检测至关重要,特别是对于识别复杂的行为,如阴谋欺骗。传统的机器学习方法主要关注孤立的节点特征,往往忽略了互连节点的更广泛背景。基于图的技术,特别是图神经网络(GNN),通过有效地利用关系信息推进了该领域。然而,在现实世界的欺骗检测数据集,交易行为表现出动态的,不规则的模式。现有的欺骗检测方法虽然在某些场景中有效,但难以捕获动态和多样化的、不断发展的节点间关系的复杂性。为了解决这些挑战,我们提出了一种称为生成动态图模型(GDGM)的新框架,该模型对动态交易行为和节点之间的关系进行建模,以学习用于阴谋欺骗检测的表示。具体来说,我们的方法结合了生成动态潜在空间来捕捉时间模式和不断变化的市场条件。原始交易数据首先被转换成时间戳序列。然后,我们使用神经常微分方程和门控递归单元的交易行为建模,生成的表示结合时间动态的欺骗模式。此外,伪标签生成和异构聚合技术被用来收集相关信息,并提高了检测性能的阴谋欺骗行为。在欺骗检测数据集上进行的实验表明,我们的方法在检测精度上优于最先进的模型。此外,我们的欺骗检测系统已成功部署在全球最大的交易市场之一,进一步验证了所提出的方法的实用性和性能。
摘要:Spoofing detection in financial trading is crucial, especially for identifying complex behaviors such as conspiracy spoofing. Traditional machine-learning approaches primarily focus on isolated node features, often overlooking the broader context of interconnected nodes. Graph-based techniques, particularly Graph Neural Networks (GNNs), have advanced the field by leveraging relational information effectively. However, in real-world spoofing detection datasets, trading behaviors exhibit dynamic, irregular patterns. Existing spoofing detection methods, though effective in some scenarios, struggle to capture the complexity of dynamic and diverse, evolving inter-node relationships. To address these challenges, we propose a novel framework called the Generative Dynamic Graph Model (GDGM), which models dynamic trading behaviors and the relationships among nodes to learn representations for conspiracy spoofing detection. Specifically, our approach incorporates the generative dynamic latent space to capture the temporal patterns and evolving market conditions. Raw trading data is first converted into time-stamped sequences. Then we model trading behaviors using the neural ordinary differential equations and gated recurrent units, to generate the representation incorporating temporal dynamics of spoofing patterns. Furthermore, pseudo-label generation and heterogeneous aggregation techniques are employed to gather relevant information and enhance the detection performance for conspiratorial spoofing behaviors. Experiments conducted on spoofing detection datasets demonstrate that our approach outperforms state-of-the-art models in detection accuracy. Additionally, our spoofing detection system has been successfully deployed in one of the largest global trading markets, further validating the practical applicability and performance of the proposed method.
【12】Fundamental Limits of Crystalline Equivariant Graph Neural Networks: A Circuit Complexity Perspective
标题:晶体等变图神经网络的基本极限:电路复杂性的角度
链接:https://arxiv.org/abs/2510.05494
作者:Yang Cao, Zhao Song, Jiahao Zhang, Jiale Zhao
摘要:图神经网络(GNN)已成为关系数据学习的核心范式。在材料科学中,等变GNN(EGNN)由于能够尊重欧几里得对称性和周期性边界条件,已成为晶体结构预测的引人注目的支柱。尽管有很强的经验表现,他们的表达能力,在定期的,受约束的设置仍然知之甚少。这项工作的特点是固有的计算和表达限制EGNN晶体结构预测通过电路复杂性镜头。我们分析了EGNN层作用于节点特征、原子坐标和晶格矩阵的计算,并证明了在多项式精度下,对于$n$个节点,$O(1)$层,以及消息/更新/读出映射的$O(1)$-深度,$O(n)$-宽度MLP实例化,这些模型允许通过多项式大小的均匀$\mathsf{TC}^0$阈值电路族(具有显式的恒定深度界限)进行模拟。将EGNN设置在$\mathsf{TC}^0$内提供了在现实资源约束下此类架构可解决的决策和预测问题的具体上限,并阐明了哪些架构修改(例如,需要增加的深度、更丰富的几何图元或更宽的层)来超越该状态。该分析补充了Weisfeiler-Lehman风格的结果,这些结果不直接转移到周期性晶体,并为晶体系统的图形学习提供了复杂性理论基础。
摘要
:Graph neural networks (GNNs) have become a core paradigm for learning on relational data. In materials science, equivariant GNNs (EGNNs) have emerged as a compelling backbone for crystalline-structure prediction, owing to their ability to respect Euclidean symmetries and periodic boundary conditions. Despite strong empirical performance, their expressive power in periodic, symmetry-constrained settings remains poorly understood. This work characterizes the intrinsic computational and expressive limits of EGNNs for crystalline-structure prediction through a circuit-complexity lens. We analyze the computations carried out by EGNN layers acting on node features, atomic coordinates, and lattice matrices, and prove that, under polynomial precision, embedding width $d=O(n)$ for $n$ nodes, $O(1)$ layers, and $O(1)$-depth, $O(n)$-width MLP instantiations of the message/update/readout maps, these models admit a simulation by a uniform $\mathsf{TC}^0$ threshold-circuit family of polynomial size (with an explicit constant-depth bound). Situating EGNNs within $\mathsf{TC}^0$ provides a concrete ceiling on the decision and prediction problems solvable by such architectures under realistic resource constraints and clarifies which architectural modifications (e.g., increased depth, richer geometric primitives, or wider layers) are required to transcend this regime. The analysis complements Weisfeiler-Lehman style results that do not directly transfer to periodic crystals, and offers a complexity-theoretic foundation for symmetry-aware graph learning on crystalline systems.
【13】Minima and Critical Points of the Bethe Free Energy Are Invariant Under Deformation Retractions of Factor Graphs
标题:因子图变形收缩下Bethe自由能的极小值和临界点不变
链接:https://arxiv.org/abs/2510.05380
作者:Grégoire Sergeant-Perthuis, Léo Boitel
摘要:在图模型、因子图和更一般的基于能量的模型中,变量之间的相互作用由图、超图或在最一般的情况下的偏序集(偏序集)编码。由于相互作用的基础结构中的循环,对这种概率模型的推断不能精确地执行。相反,一个诉诸近似变分推理优化贝特自由能。Bethe自由能的临界点对应于相关联的置信传播算法的不动点。这些临界点的一般图,超图和偏序集与有限数量的变量的充分表征仍然是一个开放的问题。我们发现,对于超图和偏序集的链长度最多为1,改变偏序集的概率模型的相互作用,以一个具有相同的同伦类型诱导相关的自由能的临界点之间的双射。这一结果推广和统一了经典的结果,假设特定形式的可验证性,以证明唯一性的临界点的贝特自由能。
摘要:In graphical models, factor graphs, and more generally energy-based models, the interactions between variables are encoded by a graph, a hypergraph, or, in the most general case, a partially ordered set (poset). Inference on such probabilistic models cannot be performed exactly due to cycles in the underlying structures of interaction. Instead, one resorts to approximate variational inference by optimizing the Bethe free energy. Critical points of the Bethe free energy correspond to fixed points of the associated Belief Propagation algorithm. A full characterization of these critical points for general graphs, hypergraphs, and posets with a finite number of variables is still an open problem. We show that, for hypergraphs and posets with chains of length at most 1, changing the poset of interactions of the probabilistic model to one with the same homotopy type induces a bijection between the critical points of the associated free energy. This result extends and unifies classical results that assume specific forms of collapsibility to prove uniqueness of the critical points of the Bethe free energy.
【14】Adapting HFMCA to Graph Data: Self-Supervised Learning for Generalizable fMRI Representations
标题:使HFMCA适应图形数据:可推广fMRI表示的自我监督学习
链接:https://arxiv.org/abs/2510.05177
作者:Jakub Frac, Alexander Schmatz, Qiang Li, Guido Van Wingen, Shujian Yu
摘要:功能性磁共振成像(fMRI)分析面临着重大挑战,由于有限的数据集大小和研究之间的域变异性。受计算机视觉启发的传统自监督学习方法通常依赖于阳性和阴性样本对,这对于定义适当对比度的神经成像数据可能是有问题的。我们建议适应最近开发的分层功能最大相关算法(HFMCA)图形结构的功能磁共振成像数据,提供了一个理论上接地的方法,通过密度比分解在再生核希尔伯特空间(RKHS)的统计依赖性的措施,并应用HFMCA为基础的预训练学习强大的和可推广的表示。对五个神经成像数据集的评估表明,我们的适应性方法为各种分类任务产生了有竞争力的嵌入,并能够有效地将知识转移到看不见的数据集。代码库和补充材料可以在这里找到:https://github.com/fr30/mri-eigenencoder
摘要:Functional magnetic resonance imaging (fMRI) analysis faces significant challenges due to limited dataset sizes and domain variability between studies. Traditional self-supervised learning methods inspired by computer vision often rely on positive and negative sample pairs, which can be problematic for neuroimaging data where defining appropriate contrasts is non-trivial. We propose adapting a recently developed Hierarchical Functional Maximal Correlation Algorithm (HFMCA) to graph-structured fMRI data, providing a theoretically grounded approach that measures statistical dependence via density ratio decomposition in a reproducing kernel Hilbert space (RKHS),and applies HFMCA-based pretraining to learn robust and generalizable representations. Evaluations across five neuroimaging datasets demonstrate that our adapted method produces competitive embeddings for various classification tasks and enables effective knowledge transfer to unseen datasets. Codebase and supplementary material can be found here: https://github.com/fr30/mri-eigenencoder
Transformer(6篇)
【1】Latent Speech-Text Transformer
标题:潜在的语音-文本转换器Transformer
链接:https://arxiv.org/abs/2510.06195
作者:Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
备注:16 pages, 13 figures
摘要
:自回归语音-文本模型通常在大量的文本令牌的交织序列上进行预训练,并且使用矢量量化将原始语音编码为语音令牌。这些模型在语音到语音理解和生成基准方面表现出了最先进的性能,以及有前途的缩放律,主要是通过文本和语音之间的代表性对齐实现的。然而,他们遭受的缺点,部分原因是不成比例的较长序列的语音令牌相比,文本令牌。这导致在预训练期间以及在推理期间模态之间的大的计算不平衡,以及有效地对齐语音和文本的潜在障碍,最终转化为几个数量级的较慢的缩放律。我们引入了潜在的语音文本Transformer(LST),它通过动态和廉价地将语音令牌聚合到潜在的语音补丁中,使预训练的语音文本模型更具数据效率。这些补丁作为更高级别的单元,可以与相应的文本单元对齐以帮助能力转移,甚至可以封装常见的语音序列,如沉默,以提高计算效率。我们表明,LST优于香草的方法在语音到语音以及文本到文本的基准在数据和计算机控制的设置,前者表明更有效的代表性对齐和后者表明更陡峭的缩放律语音文本模型。在HellaSwag故事完成时,LST在计算机控制的训练下实现了6.5%的语音准确率绝对增益,在数据控制的训练下实现了5.3%的语音准确率绝对增益,同时还提高了文本性能。我们将发布我们的模型,代码和评估数据,以方便进一步的研究。
摘要:Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.
【2】Critical attention scaling in long-context transformers
标题:长期背景变形者中的关键注意力扩展
链接:https://arxiv.org/abs/2510.05554
作者:Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet
备注:29 pages, 2 figures
摘要:随着大型语言模型扩展到更长的上下文,注意力层受到一个基本病理学的影响:随着上下文长度$n$的增加,注意力分数朝着一致性方向崩溃,导致令牌过度聚集,这种现象称为等级崩溃。虽然$\textit{attention scaling}$通过使用多对数因子$\beta_n$重新调整注意力分数有效地解决了这一缺陷,但这种方法仍然缺乏理论依据。 我们分析了一个简化但易于处理的模型,放大了注意力缩放的效果。在这个模型中,注意力呈现出一个由缩放因子$\beta_n$控制的相变:缩放不足会将所有标记向一个方向折叠,而过度缩放会减少对身份的注意力,从而消除标记之间有意义的交互。我们的主要结果确定了临界缩放$\beta_n \asymp \log n$,并为YaRN和Qwen中的注意力缩放提供了严格的理由,澄清了为什么对数缩放在大上下文长度下保持稀疏、内容自适应的注意力。
摘要:As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $\beta_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $\beta_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $\beta_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.
【3】NASP-T: A Fuzzy Neuro-Symbolic Transformer for Logic-Constrained Aviation Safety Report Classification
标题:NASP-T:逻辑约束航空安全报告分类的模糊神经符号Transformer
链接:https://arxiv.org/abs/2510.05451
作者:Fadi Al Machot, Fidaa Al Machot
摘要:深度Transformer模型在多标签文本分类方面表现出色,但经常违反专家认为必不可少的域逻辑,这是安全关键型应用程序中特别关注的问题。我们提出了一个混合神经符号框架,集成了答案集编程(ASP)与基于变换的学习航空安全报告系统(ASRS)语料库。领域知识被形式化为加权ASP规则,并使用Clingo求解器进行验证。这些规则以两种互补的方式合并:(i)作为基于规则的数据增强,生成逻辑上一致的合成样本,提高标签多样性和覆盖率;(ii)作为模糊逻辑正则化器,在微调期间以可区分的形式强制执行规则满足。这种设计保留了符号推理的可解释性,同时利用了深度神经架构的可扩展性。我们进一步调整每个类的阈值,并报告标准分类指标和逻辑一致性率。与强大的二进制交叉熵(BCE)基线相比,我们的方法提高了微观和宏观F1分数,并在ASRS测试集上实现了高达86%的规则违规减少。据我们所知,这构成了ASRS报告的第一个大规模神经符号应用程序,该应用程序统一了基于ASP的推理,规则驱动的增强和可微的Transformer训练,以实现可靠的安全关键NLP。
摘要:Deep transformer models excel at multi-label text classification but often violate domain logic that experts consider essential, an issue of particular concern in safety-critical applications. We propose a hybrid neuro-symbolic framework that integrates Answer Set Programming (ASP) with transformer-based learning on the Aviation Safety Reporting System (ASRS) corpus. Domain knowledge is formalized as weighted ASP rules and validated using the Clingo solver. These rules are incorporated in two complementary ways: (i) as rule-based data augmentation, generating logically consistent synthetic samples that improve label diversity and coverage; and (ii) as a fuzzy-logic regularizer, enforcing rule satisfaction in a differentiable form during fine-tuning. This design preserves the interpretability of symbolic reasoning while leveraging the scalability of deep neural architectures. We further tune per-class thresholds and report both standard classification metrics and logic-consistency rates. Compared to a strong Binary Cross-Entropy (BCE) baseline, our approach improves micro- and macro-F1 scores and achieves up to an 86% reduction in rule violations on the ASRS test set. To the best of our knowledge, this constitutes the first large-scale neuro-symbolic application to ASRS reports that unifies ASP-based reasoning, rule-driven augmentation, and differentiable transformer training for trustworthy, safety-critical NLP.
【4】Adjusting the Output of Decision Transformer with Action Gradient
标题:用动作梯度调整决策Transformer的输出
链接:https://arxiv.org/abs/2510.05285
作者:Rui Lin, Yiwen Zhang, Zhicheng Peng, Minghao Lyu
摘要:决策Transformer(DT),它集成了强化学习(RL)与Transformer模型,介绍了一种新的方法来离线RL。与以最大化累积折扣奖励为目标的经典算法不同,DT最大化行动的可能性。然而,这种范式转变带来了两个关键挑战:缝合轨迹和行动外推。现有的方法,例如用预测值替换特定令牌和集成策略梯度(PG)方法,单独解决了这些挑战,但由于固有的不稳定性,在组合时无法稳定地提高性能。为了解决这个问题,我们提出了动作梯度(AG),这是一种创新的方法,可以直接调整动作以实现类似于PG的功能,同时还可以促进与令牌预测技术的有效集成。AG利用Q值相对于动作的梯度来优化动作。实验结果表明,我们的方法可以显着提高基于DT的算法的性能,与一些结果达到国家的最先进的水平。
摘要:Decision Transformer (DT), which integrates reinforcement learning (RL) with the transformer model, introduces a novel approach to offline RL. Unlike classical algorithms that take maximizing cumulative discounted rewards as objective, DT instead maximizes the likelihood of actions. This paradigm shift, however, presents two key challenges: stitching trajectories and extrapolation of action. Existing methods, such as substituting specific tokens with predictive values and integrating the Policy Gradient (PG) method, address these challenges individually but fail to improve performance stably when combined due to inherent instability. To address this, we propose Action Gradient (AG), an innovative methodology that directly adjusts actions to fulfill a function analogous to that of PG, while also facilitating efficient integration with token prediction techniques. AG utilizes the gradient of the Q-value with respect to the action to optimize the action. The empirical results demonstrate that our method can significantly enhance the performance of DT-based algorithms, with some results achieving state-of-the-art levels.
【5】VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
标题:PER:通过基础蒸馏和动态路由进行机器人学习的Vision Expert Transformer
链接:https://arxiv.org/abs/2510.05213
作者:Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka
摘要:预训练的视觉基础模型(VFM)通过丰富的视觉表示来推进机器人学习,但单个VFM通常仅在特定领域表现出色,限制了跨任务的通用性。将多个VFM提取为策略的统一表示可以减轻这种限制,但通常会产生不灵活的特定于任务的功能选择,并且需要昂贵的全面重新培训来整合机器人领域的知识。我们提出VER,视觉专家Transformer机器人学习。在预训练期间,VER将多个VFM提取到视觉专家库中。然后,它只微调一个轻量级路由网络(少于0.4%的参数),从预训练库中动态选择任务相关的专家,用于下游机器人任务。我们进一步引入Patchwise专家路由与课程Top-K退火,以提高动态专家选择的灵活性和精度。此外,VER支持参数有效的微调可扩展的专家利用和自适应机器人领域的知识集成。在17项不同的机器人任务和多个政策负责人中,VER实现了最先进的性能。我们发现VER减少了任务无关区域中的大范数离群值(例如,背景),并专注于关键任务区域。可视化和代码可以在https://yixiaowang7.github.io/ver_page/中找到。
摘要:Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.
【6】Auditing Algorithmic Bias in Transformer-Based Trading
标题:审计基于变革者的交易中的数学偏差
链接:https://arxiv.org/abs/2510.05140
作者:Armin Gerami, Ramani Duraiswami
摘要:Transformer模型在金融应用中越来越受欢迎,但其潜在的风险和偏见仍然没有得到充分的探讨。这项工作的目的是审计模型的依赖波动性数据的决策,并量化价格变动的频率如何影响模型的预测信心。我们采用一个Transformer模型进行预测,并引入一个基于部分信息分解(PID)的度量来衡量每个资产对模型决策的影响。我们的分析揭示了两个关键观察结果:首先,该模型完全忽略了数据波动性,其次,它偏向于价格变动频率较低的数据。
摘要:Transformer models have become increasingly popular in financial applications, yet their potential risk making and biases remain under-explored. The purpose of this work is to audit the reliance of the model on volatile data for decision-making, and quantify how the frequency of price movements affects the model's prediction confidence. We employ a transformer model for prediction, and introduce a metric based on Partial Information Decomposition (PID) to measure the influence of each asset on the model's decision making. Our analysis reveals two key observations: first, the model disregards data volatility entirely, and second, it is biased toward data with lower-frequency price movements.
GAN|对抗|攻击|生成相关(9篇)
【1】Segment-Factorized Full-Song Generation on Symbolic Piano Music
标题:象征性钢琴音乐的分段分解全曲生成
链接:https://arxiv.org/abs/2510.05881
作者:Ping-Yi Chen, Chih-Pin Tan, Yi-Hsuan Yang
备注:Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: AI for Music
摘要:我们提出了分段全歌模型(SFS)的符号全歌生成。该模型接受用户提供的歌曲结构和可选的短种子片段,该片段锚定歌曲开发的主要思想。通过将歌曲分解为片段并通过选择性注意相关片段来生成每个片段,该模型与先前的工作相比实现了更高的质量和效率。为了证明它对人类与人工智能交互的适用性,我们进一步将SFS包装到一个Web应用程序中,使用户能够以可定制的结构和灵活的顺序在钢琴卷上迭代地共同创作音乐。
摘要:We propose the Segmented Full-Song Model (SFS) for symbolic full-song generation. The model accepts a user-provided song structure and an optional short seed segment that anchors the main idea around which the song is developed. By factorizing a song into segments and generating each one through selective attention to related segments, the model achieves higher quality and efficiency compared to prior work. To demonstrate its suitability for human-AI interaction, we further wrap SFS into a web application that enables users to iteratively co-create music on a piano roll with customizable structures and flexible ordering.
【2】FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
标题:FoleyFinder:使用与GRAM对齐的多模式编码器的视频到音频生成
链接:https://arxiv.org/abs/2510.05829
作者:Riccardo Fosco Gramaccioni, Christian Marinoni, Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello
备注:Acepted at IJCNN 2025
摘要:在这项工作中,我们提出了Foleystrike,一种新的方法来视频到音频的生成,强调通过使用对齐的多模态编码器的语义条件。基于视频到音频生成的先前进步,Foleywalk利用Gramian表示对齐度量(RMM)来对齐视频,文本和音频模态的嵌入,从而实现对音频生成过程的精确语义控制。Foleystrike的核心是一个基于扩散的音频合成模型,以GRAM对齐的嵌入和波形包络为条件,确保语义丰富性和与相应输入视频的时间对齐。我们在Greatest Hits数据集上评估了Foleywatch,这是视频到音频模型的标准基准。我们的实验表明,对齐多模式编码器使用的MPEG增强了系统的能力,语义上对齐生成的音频与视频内容,推进了视频到音频合成的最新技术。
摘要:In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.
【3】StereoSync: Spatially-Aware Stereo Audio Generation from Video
标题:StereoLock:从视频中生成空间感知的立体声音频
链接:https://arxiv.org/abs/2510.05828
作者:Christian Marinoni, Riccardo Fosco Gramaccioni, Kazuki Shimada, Takashi Shibuya, Yuki Mitsufuji, Danilo Comminiello
备注:Accepted at IJCNN 2025
摘要:虽然近年来音频生成已被广泛研究,但视频对齐的音频生成仍然是一个相对未开发的前沿。为了解决这一差距,我们引入StereoSync,一种新颖而高效的模型,旨在生成与参考视频在时间上同步并与其视觉上下文在空间上对齐的音频。此外,StereoSync还通过利用预训练的基础模型来实现效率,减少了对大量训练的需求,同时保持高质量的合成。与主要关注时间同步的现有方法不同,StereoSync通过将空间感知纳入视频对齐的音频生成来引入显著的进步。事实上,给定输入视频,我们的方法从深度图和边界框中提取空间线索,将它们用作基于扩散的音频生成模型中的交叉注意调节。这种方法允许StereoSync超越简单的同步,产生动态适应视频场景的空间结构和移动的立体声音频。我们评估了StereoSync上的Walking The Maps,这是一个策展数据集,包括来自视频游戏的视频,这些视频游戏的特点是动画角色在不同的环境中行走。实验结果证明了StereoSync能够实现时间和空间对齐,推进了视频到音频生成的最新技术水平,并带来了更加身临其境和逼真的音频体验。
摘要:Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.
【4】Empirical Comparison of Membership Inference Attacks in Deep Transfer Learning
标题:深度迁移学习中成员推断攻击的实证比较
链接:https://arxiv.org/abs/2510.05753
作者:Yuxuan Bai, Gauri Pradhan, Marlon Tobaben, Antti Honkela
备注:30 pages, 13 figures, published in TMLR https://openreview.net/forum?id=UligTUCgdt
摘要:随着强大的大规模基础模型的出现,培训范式越来越多地从从头开始的培训转向迁移学习。这使得敏感应用程序中典型的小型特定于领域的数据集具有高实用性训练。成员推理攻击(MIA)通过机器学习模型提供了对隐私泄漏的经验估计。然而,先前对MIA的评估与迁移学习微调的模型依赖于一小部分可能的攻击。我们通过比较迁移学习环境中不同MIA的性能来解决这个问题,以帮助从业者识别最有效的隐私风险评估攻击。我们发现,基于分数的MIA的训练数据的增加,攻击效率下降。我们发现,没有一个MIA可以捕获使用迁移学习训练的模型中的所有隐私风险。虽然似然比攻击(LiRA)在大多数实验场景中表现出卓越的性能,但逆Hessian攻击(IHA)被证明对高数据状态下PatchCamelyon数据集上微调的模型更有效。
摘要:With the emergence of powerful large-scale foundation models, the training paradigm is increasingly shifting from from-scratch training to transfer learning. This enables high utility training with small, domain-specific datasets typical in sensitive applications.Membership inference attacks (MIAs) provide an empirical estimate of the privacy leakage by machine learning models. Yet, prior assessments of MIAs against models fine-tuned with transfer learning rely on a small subset of possible attacks. We address this by comparing performance of diverse MIAs in transfer learning settings to help practitioners identify the most efficient attacks for privacy risk evaluation. We find that attack efficacy decreases with the increase in training data for score-based MIAs. We find that there is no one MIA which captures all privacy risks in models trained with transfer learning. While the Likelihood Ratio Attack (LiRA) demonstrates superior performance across most experimental scenarios, the Inverse Hessian Attack (IHA) proves to be more effective against models fine-tuned on PatchCamelyon dataset in high data regime.
【5】PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
标题:PointNSP:自回归3D点云生成,具有下一规模细节级别预测
链接:https://arxiv.org/abs/2510.05613
作者:Ziqiao Meng, Qichao Wang, Zhiyang Dou, Zixing Song, Zhipeng Zhou, Irwin King, Peilin Zhao
摘要:自回归点云生成在质量上长期落后于基于扩散的方法。性能差距源于这样一个事实,即自回归模型对固有的无序点集强加了一种人工排序,迫使形状生成作为一系列局部预测进行。这种顺序偏差强调了短期连续性,但破坏了模型捕获长期依赖性的能力,阻碍了其执行全局结构属性(如对称性,一致拓扑结构和大规模几何拓扑结构)的能力。受形状建模中细节层次(LOD)原则的启发,我们提出了PointNSP,一个粗到细的生成框架,它在低分辨率下保留了全局形状结构,并通过下一个尺度预测范式在更高尺度上逐步细化细粒度几何。这种多尺度因子分解将自回归目标与点集的置换不变性质对齐,从而实现丰富的尺度内相互作用,同时避免脆弱的固定顺序。ShapeNet上的实验表明,PointNSP首次在自回归范式中建立了最先进的(SOTA)生成质量。此外,它在参数、训练和推理效率方面超过了基于扩散的强基线。最后,在具有8,192个点的密集生成中,PointNSP的优势变得更加明显,突出了其可扩展性潜力。
摘要:Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias emphasizes short-range continuity but undermines the model's capacity to capture long-range dependencies, hindering its ability to enforce global structural properties such as symmetry, consistent topology, and large-scale geometric regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. In addition, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, in dense generation with 8,192 points, PointNSP's advantages become even more pronounced, underscoring its scalability potential.
【6】High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training
标题:通过Mel光谱图知情扩散训练产生高保真合成心电图
链接:https://arxiv.org/abs/2510.05492
作者:Zhuoyi Huang, Nutan Sahoo, Anamika Kumari, Girish Kumar, Kexuan Cai, Shixing Cao, Yue Kang, Tian Xia, Somya Chatterjee, Nicholas Hausman, Aidan Jay, Eric S. Rosenthal, Soundar Srinivasan, Sadid Hasan, Alex Fedorov, Sulaiman Vesal, Soundar Srinivasan, Sadid Hasan, Alex Fedorov, Sulaiman Vesal
摘要:用于心脏护理的机器学习的发展受到共享真实患者心电图(ECG)数据的隐私限制的严重阻碍。尽管生成式人工智能提供了一个很有前途的解决方案,但现有模型合成ECG的实际使用受到可信度和临床实用性方面持续存在的差距的限制。在这项工作中,我们解决了当前生成ECG方法的两个主要缺点:形态保真度不足和无法生成个性化的,患者特定的生理信号。为了解决这些差距,我们建立了一个基于条件扩散的结构化状态空间模型(SSSD-ECG),具有两个原则性创新:(1)MIDT-ECG(Mel-Spectrogram Informed Diffusion Training),一种具有时频域监督的新型训练范式,以加强生理结构现实主义,以及(2)多模态人口统计条件化,以实现患者特定的合成。我们在PTB-XL数据集上全面评估了我们的方法,评估了合成的ECG信号的保真度,临床一致性,隐私保护和下游任务效用。MIDT-ECG实现了实质性的收益:它提高了形态一致性,保留了强大的隐私保证,所有评估指标超过基线4- 8%,并显着降低了平均74%的导联间相关性误差,而人口统计条件增强了信噪比和个性化。在关键的低数据状态下,在补充了我们的合成ECG的数据集上训练的分类器实现了与仅在真实数据上训练的分类器相当的性能。总之,我们证明了使用所提出的时频结构正则化方案训练的ECG合成器可以在真实数据稀缺时作为个性化,高保真,隐私保护的代理人,从而推动了医疗保健中生成AI的负责任使用。
摘要
:The development of machine learning for cardiac care is severely hampered by privacy restrictions on sharing real patient electrocardiogram (ECG) data. Although generative AI offers a promising solution, the real-world use of existing model-synthesized ECGs is limited by persistent gaps in trustworthiness and clinical utility. In this work, we address two major shortcomings of current generative ECG methods: insufficient morphological fidelity and the inability to generate personalized, patient-specific physiological signals. To address these gaps, we build on a conditional diffusion-based Structured State Space Model (SSSD-ECG) with two principled innovations: (1) MIDT-ECG (Mel-Spectrogram Informed Diffusion Training), a novel training paradigm with time-frequency domain supervision to enforce physiological structural realism, and (2) multi-modal demographic conditioning to enable patient-specific synthesis. We comprehensively evaluate our approach on the PTB-XL dataset, assessing the synthesized ECG signals on fidelity, clinical coherence, privacy preservation, and downstream task utility. MIDT-ECG achieves substantial gains: it improves morphological coherence, preserves strong privacy guarantees with all metrics evaluated exceeding the baseline by 4-8%, and notably reduces the interlead correlation error by an average of 74%, while demographic conditioning enhances signal-to-noise ratio and personalization. In critical low-data regimes, a classifier trained on datasets supplemented with our synthetic ECGs achieves performance comparable to a classifier trained solely on real data. Together, we demonstrate that ECG synthesizers, trained with the proposed time-frequency structural regularization scheme, can serve as personalized, high-fidelity, privacy-preserving surrogates when real data are scarce, advancing the responsible use of generative AI in healthcare.
【7】LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation
标题:Lightache:内存高效、免训练的视频生成加速
链接:https://arxiv.org/abs/2510.05367
作者:Yang Xiao, Gen Li, Kaiyuan Deng, Yushu Wu, Zheng Zhan, Yanzhi Wang, Xiaolong Ma, Bo Hui
摘要:免训练加速已经成为基于扩散模型的视频生成的高级研究领域。扩散模型推理中潜在的冗余为加速提供了一个自然的切入点。在本文中,我们将推理过程分解为编码,去噪和解码阶段,并观察到基于缓存的加速方法通常会导致后两个阶段的大量内存激增。为了解决这个问题,我们分析了不同阶段的推理特性,并提出了特定阶段的策略,以减少内存消耗:1)异步缓存交换。2)功能块。3)把潜伏期切片解码。同时,我们确保这三种策略引入的时间开销低于加速增益本身。与基线相比,我们的方法实现了更快的推理速度和更低的内存使用,同时保持在可接受的范围内的质量下降。该守则可在https://github.com/NKUShaw/LightCache上查阅。
摘要:Training-free acceleration has emerged as an advanced research area in video generation based on diffusion models. The redundancy of latents in diffusion model inference provides a natural entry point for acceleration. In this paper, we decompose the inference process into the encoding, denoising, and decoding stages, and observe that cache-based acceleration methods often lead to substantial memory surges in the latter two stages. To address this problem, we analyze the characteristics of inference across different stages and propose stage-specific strategies for reducing memory consumption: 1) Asynchronous Cache Swapping. 2) Feature chunk. 3) Slicing latents to decode. At the same time, we ensure that the time overhead introduced by these three strategies remains lower than the acceleration gains themselves. Compared with the baseline, our approach achieves faster inference speed and lower memory usage, while maintaining quality degradation within an acceptable range. The Code is available at https://github.com/NKUShaw/LightCache .
【8】RegMix: Adversarial Mutual and Generalization Regularization for Enhancing DNN Robustness
标题:RegMix:对抗性相互和广义正规化以增强DNN稳健性
链接:https://arxiv.org/abs/2510.05317
作者:Zhenyu Liu, Varun Ojha
备注:None
摘要:对抗性训练是对抗性攻击最有效的防御手段。对抗性攻击的有效性在于其损失函数和正则化项的设计。对抗训练中最广泛使用的损失函数是交叉熵和均方误差(MSE)作为其正则化目标。然而,MSE在训练过程中在两个输出分布之间强制执行过度均匀的优化,这限制了其在对抗训练场景中的鲁棒性。为了解决这个问题,我们重新审视了相互学习的想法(最初是为了知识蒸馏而设计的),并提出了两种为对抗训练量身定制的新正则化策略:(i)加权对抗相互正则化和(ii)对抗泛化正则化。在前者中,我们制定了一个分解的对抗性相互Kullback-Leibler发散(KL发散)损失,它允许通过分配不相等的权重的主要和辅助目标的优化过程中的灵活控制。在后者中,我们在对抗训练目标中引入了一个额外的干净目标分布,提高了泛化能力并增强了模型的鲁棒性。大量的实验表明,与现有的基于正则化的方法相比,我们提出的方法显着提高了对抗鲁棒性。
摘要:Adversarial training is the most effective defense against adversarial attacks. The effectiveness of the adversarial attacks has been on the design of its loss function and regularization term. The most widely used loss function in adversarial training is cross-entropy and mean squared error (MSE) as its regularization objective. However, MSE enforces overly uniform optimization between two output distributions during training, which limits its robustness in adversarial training scenarios. To address this issue, we revisit the idea of mutual learning (originally designed for knowledge distillation) and propose two novel regularization strategies tailored for adversarial training: (i) weighted adversarial mutual regularization and (ii) adversarial generalization regularization. In the former, we formulate a decomposed adversarial mutual Kullback-Leibler divergence (KL-divergence) loss, which allows flexible control over the optimization process by assigning unequal weights to the main and auxiliary objectives. In the latter, we introduce an additional clean target distribution into the adversarial training objective, improving generalization and enhancing model robustness. Extensive experiments demonstrate that our proposed methods significantly improve adversarial robustness compared to existing regularization-based approaches.
【9】Adversarial Reinforcement Learning for Offensive and Defensive Agents in a Simulated Zero-Sum Network Environment
标题:模拟零和网络环境中进攻和防御代理的对抗强化学习
链接:https://arxiv.org/abs/2510.05157
作者:Abrar Shahid, Ibteeker Mahir Ishum, AKM Tahmidul Haque, M Sohel Rahman, A. B. M. Alim Al Islam
备注:8 pages, 5 tables, 5 figures. 12th International Conference on Next Generation Computing, Communication, Systems and Security
摘要
:本文通过自定义OpenAI Gym环境对网络安全中的对抗性强化学习进行了控制研究,该环境对多端口服务的暴力攻击和反应式防御进行了建模。该环境捕获了现实的安全权衡,包括背景流量噪声、渐进式利用机制、基于IP的规避策略、蜜罐陷阱和多级速率限制防御。竞争的攻击者和防御者代理在零和奖励框架内使用深度Q网络(DQN)进行训练,其中成功的利用产生大量的终端奖励,而增量操作产生较小的成本。通过对多种配置(不同的陷阱检测概率,利用难度阈值和训练方案)的系统评估,结果表明,防御者的可观察性和陷阱有效性为成功的攻击创造了巨大的障碍。实验表明,在这种对抗性环境中,奖励塑造和仔细的训练计划对于学习稳定性至关重要。防御者在50,000多个训练集中始终保持战略优势,当暴露于复杂的防御策略(包括自适应IP阻止和端口特定控制)时,性能增益会放大。提供了完整的实现细节,可重复的超参数配置和架构指南,以支持未来对抗性RL的网络安全研究。零和公式和现实的操作约束使这个环境适合研究自主防御系统,攻击者-防御者共同进化,并将学习转移到现实世界的网络安全场景。
摘要:This paper presents a controlled study of adversarial reinforcement learning in network security through a custom OpenAI Gym environment that models brute-force attacks and reactive defenses on multi-port services. The environment captures realistic security trade-offs including background traffic noise, progressive exploitation mechanics, IP-based evasion tactics, honeypot traps, and multi-level rate-limiting defenses. Competing attacker and defender agents are trained using Deep Q-Networks (DQN) within a zero-sum reward framework, where successful exploits yield large terminal rewards while incremental actions incur small costs. Through systematic evaluation across multiple configurations (varying trap detection probabilities, exploitation difficulty thresholds, and training regimens), the results demonstrate that defender observability and trap effectiveness create substantial barriers to successful attacks. The experiments reveal that reward shaping and careful training scheduling are critical for learning stability in this adversarial setting. The defender consistently maintains strategic advantage across 50,000+ training episodes, with performance gains amplifying when exposed to complex defensive strategies including adaptive IP blocking and port-specific controls. Complete implementation details, reproducible hyperparameter configurations, and architectural guidelines are provided to support future research in adversarial RL for cybersecurity. The zero-sum formulation and realistic operational constraints make this environment suitable for studying autonomous defense systems, attacker-defender co-evolution, and transfer learning to real-world network security scenarios.
半/弱/无/有监督|不确定性|主动学习(6篇)
【1】Uncertainty in Machine Learning
标题:机器学习中的不确定性
链接:https://arxiv.org/abs/2510.06007
作者:Hans Weytjens, Wouter Verbeke
备注:Authored by Hans Weytjens. Wouter Verbeke provided proofreading and served as the chief editor of the book in which this chapter appears
摘要:本章介绍了机器学习中不确定性量化的原理和实际应用。它解释了如何识别和区分不同类型的不确定性,并提出了量化预测模型中不确定性的方法,包括线性回归,随机森林和神经网络。本章还介绍了作为生成具有预定义置信区间的预测的框架的共形预测。最后,它探讨了如何利用不确定性估计来改进业务决策,增强模型可靠性,并支持风险感知策略。
摘要:This book chapter introduces the principles and practical applications of uncertainty quantification in machine learning. It explains how to identify and distinguish between different types of uncertainty and presents methods for quantifying uncertainty in predictive models, including linear regression, random forests, and neural networks. The chapter also covers conformal prediction as a framework for generating predictions with predefined confidence intervals. Finally, it explores how uncertainty estimation can be leveraged to improve business decision-making, enhance model reliability, and support risk-aware strategies.
【2】Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering
标题:通过不确定性过滤创建无标签生物推理合成数据集
链接:https://arxiv.org/abs/2510.05871
作者:Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Tom A. Lamb, Philip Torr, Marc Boubnovski Martell, Julien Fauqueur, Kaspar Märtens
摘要:合成思维链(CoT)轨迹被广泛用于训练大型推理模型(LRM),通过提供步骤级监督来提高泛化能力。然而,大多数方法都需要地面真实标签来播种或过滤这些痕迹-这在生物学等湿实验室数据稀缺的领域是一个昂贵的瓶颈。我们提出了一个无标签的替代方案:基于不确定性的过滤,它使用模型自己的信心-通过建立的不确定性指标,如自我一致性和预测困惑量化-作为外部标签的替代品。我们对多个推理轨迹进行采样,只保留低不确定性的子集。应用于生物扰动预测,一个湿实验室标签特别昂贵的领域,我们表明过滤子集具有更高的准确性,并且对不确定性过滤数据的监督微调(SFT)优于未过滤的合成数据,缩小了与地面真实训练的差距,并超过了强大的LRM基线。消融表明,每类过滤校正类特定的不确定性尺度和混合不确定性度量产生更高质量的数据集。我们的研究结果表明,模型内部的信心是一个强大的信号,有效的推理数据集的创建,使LRM在监督是昂贵的领域。
摘要:Synthetic chain-of-thought (CoT) traces are widely used to train large reasoning models (LRMs), improving generalization by providing step-level supervision. Yet most approaches require ground-truth labels to seed or filter these traces - an expensive bottleneck in domains like biology where wet-lab data are scarce. We propose a label-free alternative: uncertainty-based filtering, which uses a model's own confidence - quantified through established uncertainty metrics like self-consistency and predictive perplexity - as a substitute for external labels. We sample multiple reasoning traces and retain only low-uncertainty subsets. Applied to biological perturbation prediction, a domain where wet-lab labels are especially costly, we show that the filtered subset has higher accuracy, and that supervised fine-tuning (SFT) on uncertainty-filtered data outperforms unfiltered synthetic data, narrows the gap to ground-truth training, and surpasses strong LRM baselines. Ablations show that per-class filtering corrects for class-specific uncertainty scales and that hybrid uncertainty metrics yield higher-quality datasets. Our results suggest that model-internal confidence is a powerful signal for efficient reasoning dataset creation, enabling LRMs in domains where supervision is expensive.
【3】Uncertainty assessment in satellite-based greenhouse gas emissions estimates using emulated atmospheric transport
标题:使用模拟大气传输进行基于卫星的温室气体排放估计的不确定性评估
链接:https://arxiv.org/abs/2510.05751
作者
:Jeffrey N. Clark, Elena Fillola, Nawid Keshtmand, Raul Santos-Rodriguez, Matthew Rigby
摘要:监测温室气体排放和评估国家清单需要高效、可扩展和可靠的推理方法。自上而下的方法,加上卫星观测的最新进展,为评估大陆和全球范围的排放量提供了新的机会。然而,这些方法中使用的传输模型仍然是不确定性的主要来源:它们在大规模运行时计算成本很高,并且它们的不确定性很难消除。人工智能提供了双重机会来加速运输模拟并量化其相关的不确定性。 我们提出了一个基于集合的管道,用于估计大气传输“足迹”,温室气体摩尔分数测量,及其不确定性,使用拉格朗日粒子扩散模型(LPDM)的图形神经网络仿真器。该方法在2016年巴西的GOSAT(温室气体观测卫星)观测中得到了证明。该仿真器实现了比NAME LPDM快1000倍的速度,同时再现了大规模的封装结构。集合进行计算,以量化绝对和相对的不确定性,揭示空间相关性与预测误差。结果表明,集合传播突出了低置信度的空间和时间预测的大气传输足迹和甲烷摩尔分数。 虽然在这里演示的LPDM模拟器,该方法可以更普遍地应用于大气传输模型,支持不确定性意识的温室气体反演系统和提高基于卫星的排放监测的鲁棒性。随着进一步的发展,基于集成的仿真器还可以帮助探索系统性LPDM误差,为温室气体通量估计中更全面的不确定性预算提供一条计算效率高的途径。
摘要:Monitoring greenhouse gas emissions and evaluating national inventories require efficient, scalable, and reliable inference methods. Top-down approaches, combined with recent advances in satellite observations, provide new opportunities to evaluate emissions at continental and global scales. However, transport models used in these methods remain a key source of uncertainty: they are computationally expensive to run at scale, and their uncertainty is difficult to characterise. Artificial intelligence offers a dual opportunity to accelerate transport simulations and to quantify their associated uncertainty. We present an ensemble-based pipeline for estimating atmospheric transport "footprints", greenhouse gas mole fraction measurements, and their uncertainties using a graph neural network emulator of a Lagrangian Particle Dispersion Model (LPDM). The approach is demonstrated with GOSAT (Greenhouse Gases Observing Satellite) observations for Brazil in 2016. The emulator achieved a ~1000x speed-up over the NAME LPDM, while reproducing large-scale footprint structures. Ensembles were calculated to quantify absolute and relative uncertainty, revealing spatial correlations with prediction error. The results show that ensemble spread highlights low-confidence spatial and temporal predictions for both atmospheric transport footprints and methane mole fractions. While demonstrated here for an LPDM emulator, the approach could be applied more generally to atmospheric transport models, supporting uncertainty-aware greenhouse gas inversion systems and improving the robustness of satellite-based emissions monitoring. With further development, ensemble-based emulators could also help explore systematic LPDM errors, offering a computationally efficient pathway towards a more comprehensive uncertainty budget in greenhouse gas flux estimates.
【4】DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities
标题:迪夫EDA:跨模式的无监督扩散顺序解纠缠
链接:https://arxiv.org/abs/2510.05717
作者:Hedi Zisling, Ilan Naiman, Nimrod Berman, Supasorn Suwajanakorn, Omri Azencot
摘要:无监督表示学习,特别是顺序解纠缠,旨在分离数据中的静态和动态变化因素,而不依赖于标签。这仍然是一个具有挑战性的问题,因为现有的基于变分自动编码器和生成对抗网络的方法通常依赖于多个损失项,使优化过程复杂化。此外,当应用于真实世界的数据时,顺序解缠结方法面临着挑战,并且目前还没有建立评估协议来评估它们在这种情况下的性能。近年来,扩散模型已经成为最新的生成模型,但是对于它们在序列解纠缠中的应用还没有理论上的形式化。在这项工作中,我们介绍了扩散顺序解纠缠自动编码器(DiffSDA),一个新颖的,模式不可知的框架,有效地跨不同的真实世界的数据模态,包括时间序列,视频和音频。DiffSDA利用了新的概率建模、潜在扩散和高效采样器,同时结合了用于严格测试的具有挑战性的评估协议。我们的实验在不同的现实世界的基准表明,DiffSDA优于最近的国家的最先进的方法在顺序解开。
摘要:Unsupervised representation learning, particularly sequential disentanglement, aims to separate static and dynamic factors of variation in data without relying on labels. This remains a challenging problem, as existing approaches based on variational autoencoders and generative adversarial networks often rely on multiple loss terms, complicating the optimization process. Furthermore, sequential disentanglement methods face challenges when applied to real-world data, and there is currently no established evaluation protocol for assessing their performance in such settings. Recently, diffusion models have emerged as state-of-the-art generative models, but no theoretical formalization exists for their application to sequential disentanglement. In this work, we introduce the Diffusion Sequential Disentanglement Autoencoder (DiffSDA), a novel, modal-agnostic framework effective across diverse real-world data modalities, including time series, video, and audio. DiffSDA leverages a new probabilistic modeling, latent diffusion, and efficient samplers, while incorporating a challenging evaluation protocol for rigorous testing. Our experiments on diverse real-world benchmarks demonstrate that DiffSDA outperforms recent state-of-the-art methods in sequential disentanglement.
【5】QDeepGR4J: Quantile-based ensemble of deep learning and GR4J hybrid rainfall-runoff models for extreme flow prediction with uncertainty quantification
标题:QDeepGR 4J:深度学习和GR 4J混合径流模型的基于分位数的集成,用于极端流量预测和不确定性量化
链接:https://arxiv.org/abs/2510.05453
作者:Arpit Kapoor, Rohitash Chandra
摘要:概念性径流模型帮助水文学家和气候科学家模拟径流,为水资源管理实践提供信息。深度学习的最新进展揭示了水文模型与深度学习模型相结合的潜力,以提高可解释性和预测性能。在我们之前的工作中,我们引入了DeepGR 4J,它使用深度学习模型增强了GR 4J概念性的径流模型,作为路由组件的替代。DeepGR 4J具有更高的径流预测精度,特别是在干旱集水区。分位数回归模型已被广泛用于量化不确定性,同时帮助极端值预测。在本文中,我们使用基于分位数回归的集成学习框架扩展了DeepGR 4J,以量化流量预测中的不确定性。我们还利用不确定性界限来识别可能导致洪水的极端流量事件。我们进一步扩展模型的不确定性边界的多步流量预测。我们设计实验的详细评估所提出的框架使用CAMELS-Aus数据集。结果表明,与基线深度学习模型相比,我们提出的Quantile DeepGR 4J框架提高了预测准确性和不确定性区间质量(区间得分)。此外,我们使用分位数DeepGR 4J进行洪水风险评估,结果表明其适合作为预警系统。
摘要
:Conceptual rainfall-runoff models aid hydrologists and climate scientists in modelling streamflow to inform water management practices. Recent advances in deep learning have unravelled the potential for combining hydrological models with deep learning models for better interpretability and improved predictive performance. In our previous work, we introduced DeepGR4J, which enhanced the GR4J conceptual rainfall-runoff model using a deep learning model to serve as a surrogate for the routing component. DeepGR4J had an improved rainfall-runoff prediction accuracy, particularly in arid catchments. Quantile regression models have been extensively used for quantifying uncertainty while aiding extreme value forecasting. In this paper, we extend DeepGR4J using a quantile regression-based ensemble learning framework to quantify uncertainty in streamflow prediction. We also leverage the uncertainty bounds to identify extreme flow events potentially leading to flooding. We further extend the model to multi-step streamflow predictions for uncertainty bounds. We design experiments for a detailed evaluation of the proposed framework using the CAMELS-Aus dataset. The results show that our proposed Quantile DeepGR4J framework improves the predictive accuracy and uncertainty interval quality (interval score) compared to baseline deep learning models. Furthermore, we carry out flood risk evaluation using Quantile DeepGR4J, and the results demonstrate its suitability as an early warning system.
【6】Learning More with Less: A Generalizable, Self-Supervised Framework for Privacy-Preserving Capacity Estimation with EV Charging Data
标题:用更少的钱学习更多:一个可推广的、自我监督的框架,用于利用电动汽车充电数据进行隐私保护容量估计
链接:https://arxiv.org/abs/2510.05172
作者:Anushiya Arunan, Yan Qin, Xiaoli Li, U-Xuan Tan, H. Vincent Poor, Chau Yuen
备注:Accepted in IEEE Transactions on Industrial Informatics
摘要:准确的电池容量估计是缓解消费者对电动汽车(EV)电池性能和可靠性担忧的关键。然而,严格的隐私法规和标记的数据短缺所带来的实际数据限制阻碍了可推广的容量估计模型的发展,这些模型对现实世界的数据分布变化保持稳健。虽然自监督学习可以利用未标记的数据,但现有技术并不是专门设计用于从具有挑战性的现场数据中有效学习的,更不用说从隐私友好的数据中学习了,这些数据通常不那么丰富和嘈杂。在这项工作中,我们提出了一种基于自我监督预训练的首个容量估计模型,该模型是在来自真实世界电动汽车运营的隐私友好型充电数据片段的大规模数据集上开发的。我们的预训练框架,片段相似性加权掩码输入重建,旨在学习丰富的,可概括的表示,即使是从功能不太丰富和碎片化的隐私友好数据。我们的关键创新在于利用对比学习,首先在碎片化的片段中捕捉高层次的相似性,否则这些片段缺乏有意义的上下文。通过我们的片段对比学习和随后的相似性加权掩码重建,我们能够学习单个片段内的颗粒充电模式和不同片段之间的高级关联关系的丰富表示。在这种丰富的表示学习的支持下,我们的模型始终优于最先进的基线,即使在受制造商和年龄引起的分布变化影响的具有挑战性的域转移设置下,测试误差也比最佳性能基准低31.9%。
摘要:Accurate battery capacity estimation is key to alleviating consumer concerns about battery performance and reliability of electric vehicles (EVs). However, practical data limitations imposed by stringent privacy regulations and labeled data shortages hamper the development of generalizable capacity estimation models that remain robust to real-world data distribution shifts. While self-supervised learning can leverage unlabeled data, existing techniques are not particularly designed to learn effectively from challenging field data -- let alone from privacy-friendly data, which are often less feature-rich and noisier. In this work, we propose a first-of-its-kind capacity estimation model based on self-supervised pre-training, developed on a large-scale dataset of privacy-friendly charging data snippets from real-world EV operations. Our pre-training framework, snippet similarity-weighted masked input reconstruction, is designed to learn rich, generalizable representations even from less feature-rich and fragmented privacy-friendly data. Our key innovation lies in harnessing contrastive learning to first capture high-level similarities among fragmented snippets that otherwise lack meaningful context. With our snippet-wise contrastive learning and subsequent similarity-weighted masked reconstruction, we are able to learn rich representations of both granular charging patterns within individual snippets and high-level associative relationships across different snippets. Bolstered by this rich representation learning, our model consistently outperforms state-of-the-art baselines, achieving 31.9% lower test error than the best-performing benchmark, even under challenging domain-shifted settings affected by both manufacturer and age-induced distribution shifts.
迁移|Zero/Few/One-Shot|自适应(13篇)
【1】NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
标题:NEO:通过潜在重新定中心进行无优化测试时适应
链接:https://arxiv.org/abs/2510.05635
作者:Alexander Murphy, Michal Danilowski, Soumyajit Chatterjee, Abhirup Ghosh
摘要:测试时自适应(TTA)方法通常是计算昂贵的,需要大量的数据进行有效的适应,或者是脆弱的超参数。基于潜在空间的几何形状的理论基础,我们能够通过在原点重新居中目标数据嵌入来显着改善源和分布偏移样本之间的对齐。这种见解激发了NEO -一种无超参数的完全TTA方法,与普通推理相比,它没有增加任何重要的计算。NEO能够将ImageNet-C上的ViT-Base的分类准确率从55.6%提高到59.2%,仅在一批64个样本上进行调整后。当在512个样本上进行调整时,NEO击败了所有7种TTA方法,我们与ImageNet-C,ImageNet-R和ImageNet-S进行了比较,并在CIFAR-10-C上击败了6/7,同时使用了最少的计算量。NEO在模型校准指标上表现良好,此外还能够从1个类进行调整,以提高ImageNet-C中999个其他类的准确性。在Raspberry Pi和Jetson Orin Nano设备上,与基线相比,NEO减少了63%的推理时间和9%的内存使用。基于3种ViT架构和4个数据集的结果表明,NEO可以有效地用于TTA。
摘要:Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to 59.2% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63% and memory usage by 9% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.
【2】Teamwork: Collaborative Diffusion with Low-rank Coordination and Adaptation
标题:团队合作:低级别协调和适应的协作传播
链接:https://arxiv.org/abs/2510.05532
作者:Sam Sartor, Pieter Peers
摘要:大型预训练扩散模型可以提供对许多图形应用有益的强先验。然而,生成应用(如神经渲染)和逆方法(如SVBRDF估计和本征图像分解)需要额外的输入或输出通道。当前的渠道扩展解决方案通常是特定于应用的,并且这些解决方案可能难以适应不同的扩散模型或新任务。本文介绍了Teamwork:一个灵活高效的统一解决方案,用于联合增加输入和输出通道的数量,以及使预训练的扩散模型适应新任务。团队合作通过协调和适应基础扩散模型的多个实例(即队友),在不改变预先训练的扩散模型架构的情况下实现渠道扩展。我们采用了一种新的变化的低级别适应(LoRA),共同解决适应和不同的队友之间的协调。此外,团队合作支持动态(去)激活的队友。我们展示了Teamwork在各种生成和逆图形任务上的灵活性和效率,例如修复,单图像SVBRDF估计,内在分解,神经着色和内在图像合成。
摘要:Large pretrained diffusion models can provide strong priors beneficial for many graphics applications. However, generative applications such as neural rendering and inverse methods such as SVBRDF estimation and intrinsic image decomposition require additional input or output channels. Current solutions for channel expansion are often application specific and these solutions can be difficult to adapt to different diffusion models or new tasks. This paper introduces Teamwork: a flexible and efficient unified solution for jointly increasing the number of input and output channels as well as adapting a pretrained diffusion model to new tasks. Teamwork achieves channel expansion without altering the pretrained diffusion model architecture by coordinating and adapting multiple instances of the base diffusion model (\ie, teammates). We employ a novel variation of Low Rank-Adaptation (LoRA) to jointly address both adaptation and coordination between the different teammates. Furthermore Teamwork supports dynamic (de)activation of teammates. We demonstrate the flexibility and efficiency of Teamwork on a variety of generative and inverse graphics tasks such as inpainting, single image SVBRDF estimation, intrinsic decomposition, neural shading, and intrinsic image synthesis.
【3】LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability
标题:LATA:Langevin锚定测试时间自适应,增强稳健性和稳定性
链接:https://arxiv.org/abs/2510.05530
作者:Harshil Vejendla
备注:MIT URTC 2025 Technical Paper (Oral), 5 pages, 3 figures
摘要:测试时自适应(TTA)旨在仅使用未标记的测试数据使预训练模型适应分布变化。虽然有希望,但像Tent这样的现有方法存在不稳定性,并且可能会灾难性地忘记源知识,特别是在小批量或具有挑战性的腐败情况下。我们认为,这是由于过于确定性的更新复杂的损失表面。在本文中,我们介绍了Langevin-Anchored Test-Time Adaptation(LATTA),这是一种通过两种关键机制正则化自适应的新方法:(1)受随机梯度Langevin动力学(SGLD)启发的噪声权重扰动,以探索局部参数空间并逃避糟糕的局部最小值,以及(2)稳定的权重锚,防止模型偏离其鲁棒源预训练。这种组合使LATTA能够在不牺牲稳定性的情况下有效地适应。与之前的贝叶斯TTA方法不同,LATTA不需要架构更改或昂贵的Monte Carlo过程。我们对标准基准进行了广泛的实验,包括Rotated-MNIST和更具挑战性的CIFAR-10-C。我们的研究结果表明,LATTA显著优于现有的方法,包括Tent,CoTTA和EATA,通过将CIFAR-10-C的平均准确率提高2%以上,同时降低性能方差,为自监督TTA创造了新的技术水平。
摘要:Test-time adaptation (TTA) aims to adapt a pretrained model to distribution shifts using only unlabeled test data. While promising, existing methods like Tent suffer from instability and can catastrophically forget the source knowledge, especially with small batch sizes or challenging corruptions. We argue that this arises from overly deterministic updates on a complex loss surface. In this paper, we introduce Langevin-Anchored Test-Time Adaptation (LATTA), a novel approach that regularizes adaptation through two key mechanisms: (1) a noisy weight perturbation inspired by Stochastic Gradient Langevin Dynamics (SGLD) to explore the local parameter space and escape poor local minima, and (2) a stable weight anchor that prevents the model from diverging from its robust source pre-training. This combination allows LATTA to adapt effectively without sacrificing stability. Unlike prior Bayesian TTA methods, LATTA requires no architectural changes or expensive Monte Carlo passes. We conduct extensive experiments on standard benchmarks, including Rotated-MNIST and the more challenging CIFAR-10-C. Our results demonstrate that LATTA significantly outperforms existing methods, including Tent, CoTTA, and EATA, setting a new state of the art for self-supervised TTA by improving average accuracy on CIFAR-10-C by over 2% while simultaneously reducing performance variance.
【4】ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
标题:ARMOR:通过自适应矩阵分解的高性能半结构化修剪
链接:https://arxiv.org/abs/2510.05528
作者:Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang
摘要:大型语言模型(LLM)由于其巨大的计算和内存需求而带来了重大的部署挑战。虽然半结构化修剪,特别是2:4稀疏性,提供了一种实用的硬件加速的途径,现有的方法往往会导致显着的性能下降。为了弥补这一差距,我们引入了ARMOR:(自适应表示与矩阵因子化),一种新的一次性训练后修剪算法。ARMOR不是直接修剪权重,而是将每个权重矩阵分解为由两个低开销块对角矩阵包裹的2:4稀疏核心。这些包装器充当有效的转换前和转换后错误校正器,与传统的2:4修剪技术相比,提供了更大的灵活性来保持模型质量。通过块坐标下降算法选择稀疏核心和块对角包装器,该算法最小化逐层代理损失。我们从理论上证明了这种优化是保证收敛到一个解决方案的代理损失小于或等于国家的最先进的修剪算法。对美洲驼的实验(Touvron等人,2023年; Dubey等人,2024)和Qwen(Yang等人,2025)模型系列表明,ARMOR在广泛的下游任务和困惑评估中始终显着优于最先进的2:4修剪方法。ARMOR实现了这种卓越的性能,同时保留了推理加速和2:4修剪的大量内存使用减少,在模型压缩和任务准确性之间建立了更有效的权衡
摘要
:Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy
【5】Transfer Learning on Edge Connecting Probability Estimation under Graphon Model
标题:Graphon模型下边连接概率估计的转移学习
链接:https://arxiv.org/abs/2510.05527
作者:Yuyao Wang, Yu-Hung Cheng, Debarghya Mukherjee, Huimin Cheng
摘要:Graphon模型为估计网络中的潜在连接概率提供了一个灵活的非参数框架,从而实现了一系列下游应用,如链接预测和数据增强。然而,准确的图子估计通常需要一个大的图,而在实践中,人们往往只观察到一个小规模的网络。解决这个问题的一种方法是采用迁移学习框架,其目的是通过利用来自更大的相关源图的结构信息来改善小目标图中的估计。在本文中,我们提出了一种新的方法,即GTRANS,这是一种迁移学习框架,它集成了邻域平滑和Gromov-Wasserstein最优传输,以对齐和转移图之间的结构模式。为了防止负传递,GTRANS包括自适应去偏置机制,该机制通过残差平滑来识别和校正目标特定偏差。我们提供了理论上的保证估计对齐矩阵的稳定性,并通过广泛的合成和真实数据实验证明GTRANS在提高目标图估计的准确性。这些改进直接转化为下游应用程序的性能增强,例如图分类任务和链接预测任务。
摘要:Graphon models provide a flexible nonparametric framework for estimating latent connectivity probabilities in networks, enabling a range of downstream applications such as link prediction and data augmentation. However, accurate graphon estimation typically requires a large graph, whereas in practice, one often only observes a small-sized network. One approach to addressing this issue is to adopt a transfer learning framework, which aims to improve estimation in a small target graph by leveraging structural information from a larger, related source graph. In this paper, we propose a novel method, namely GTRANS, a transfer learning framework that integrates neighborhood smoothing and Gromov-Wasserstein optimal transport to align and transfer structural patterns between graphs. To prevent negative transfer, GTRANS includes an adaptive debiasing mechanism that identifies and corrects for target-specific deviations via residual smoothing. We provide theoretical guarantees on the stability of the estimated alignment matrix and demonstrate the effectiveness of GTRANS in improving the accuracy of target graph estimation through extensive synthetic and real data experiments. These improvements translate directly to enhanced performance in downstream applications, such as the graph classification task and the link prediction task.
【6】AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning
标题:AMAQ:自适应混合位激活量化,用于协作参数高效微调
链接:https://arxiv.org/abs/2510.05468
作者:Yurun Song, Zhuoyi Yang, Ian G. Harris, Sangeetha Abdu Jyothi
备注:14 pages
摘要:大型语言模型(LLM)正在快速扩展,为协作服务器客户端分布式训练带来了重大挑战,特别是在通信效率和计算开销方面。为了应对这些挑战,我们实施了参数高效的分割学习,它有效地平衡了低资源设备上协作培训的效率和性能。 为了减少协作训练中的通信开销,我们引入了自适应混合位激活量化(AMAQ),这是一种将激活和梯度从高精度(6到8位)逐步压缩到低精度(3到4位)的策略。AMAQ通过使用位正则化基于特征和层重要性有效地分配跨通道的位预算来实现这一点。 在相同的比特预算下,AMAQ优于固定精度方法,为LLaMA 3 8B和Qwen 2.5 7B等模型提供了约2.5%的生成精度和约1.3%的分类精度。此外,它还显著增强了训练稳定性,并减少了训练过程中的超低位表示崩溃。 实验表明,AMAQ有效地集成到实际的多机协作训练设置,提供卓越的推理精度,只有一个适度的通信开销,在训练过程中的位适应。这种权衡使AMAQ成为一种实用而有效的协作培训解决方案,并具有最小的通信成本。
摘要:Large Language Models (LLMs) are scaling rapidly, creating significant challenges for collaborative server client distributed training, particularly in terms of communication efficiency and computational overheads. To address these challenges, we implement Parameter-efficient Split Learning, which effectively balances efficiency and performance for collaborative training on low-resource devices. To reduce communication overhead in collaborative training, we introduce Adaptive Mixed bit Activation Quantization (AMAQ), a strategy that progressively compresses activations and gradients from high precision (6 to 8 bits) to low precision (3 to 4 bits). AMAQ achieves this by effectively allocating bit budgets across channels based on feature wise and layer wise importance using bit regularization. Under the same bit budgets, AMAQ outperforms fixed-precision approaches, delivering about 2.5% higher generation accuracy and about 1.3% better classification accuracy for models like LLaMA3 8B and Qwen2.5 7B. In addition, it significantly enhances training stability and reducing ultra-low bit representation collapse during the training. Experiments demonstrate that AMAQ integrates effectively into practical multi-machine collaborative training setups, offering superior inference accuracy with only a modest communication overhead for bits adaptation during training. This trade off makes AMAQ a practical and effective solution for collaborative training with minimal communication cost.
【7】AD-NODE: Adaptive Dynamics Learning with Neural ODEs for Mobile Robots Control
标题:AD-NODE:使用神经ODE的自适应动力学学习用于移动机器人控制
链接:https://arxiv.org/abs/2510.05443
作者:Shao-Yi Yu, Jen-Wei Wang, Maya Horii, Vikas Garg, Tarek Zohdi
摘要:地面车辆和四旋翼等移动机器人在从物流到农业的各个领域变得越来越重要,它们在人类难以进入的环境中实现了流程自动化。然而,为了在不确定的环境中使用基于模型的控制器有效地执行,这些系统需要能够响应环境变化的动态模型,特别是当直接访问环境信息是有限的。为了实现这种自适应性,并促进与模型预测控制的集成,我们提出了一种自适应动态模型,它绕过了直接的环境知识的需要,通过推断操作环境的状态动作的历史。动力学模型是基于神经常方程,和两个阶段的训练过程中使用的学习潜在的环境表示。我们证明了我们的方法的有效性,通过三个机器人平台上的目标达成和路径跟踪任务的复杂性不断增加:一个2D差分轮式机器人与不断变化的车轮接触条件,一个3D四旋翼在变化的风场,和Sphero螺栓机器人在两个接触条件下的真实世界的部署。实证结果证实,我们的方法可以处理时间和空间变化的环境变化,在模拟和现实世界的系统。
摘要:Mobile robots, such as ground vehicles and quadrotors, are becoming increasingly important in various fields, from logistics to agriculture, where they automate processes in environments that are difficult to access for humans. However, to perform effectively in uncertain environments using model-based controllers, these systems require dynamics models capable of responding to environmental variations, especially when direct access to environmental information is limited. To enable such adaptivity and facilitate integration with model predictive control, we propose an adaptive dynamics model which bypasses the need for direct environmental knowledge by inferring operational environments from state-action history. The dynamics model is based on neural ordinary equations, and a two-phase training procedure is used to learn latent environment representations. We demonstrate the effectiveness of our approach through goal-reaching and path-tracking tasks on three robotic platforms of increasing complexity: a 2D differential wheeled robot with changing wheel contact conditions, a 3D quadrotor in variational wind fields, and the Sphero BOLT robot under two contact conditions for real-world deployment. Empirical results corroborate that our method can handle temporally and spatially varying environmental changes in both simulation and real-world systems.
【8】MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates
标题:MT-ASO:具有本地更新的多时间尺度分布式自适应优化器
链接:https://arxiv.org/abs/2510.05361
作者:Alex Iacob, Andrej Jovanovic, Mher Safaryan, Meghdad Kurmanji, Lorenzo Sani, Samuel Horváth, William F. Shen, Xinchi Qiu, Nicholas D. Lane
备注:Submitted to the ICLR 2026 Conference
摘要:使用分布式数据并行(DDP)训练大型模型需要在工作器之间频繁地进行梯度通信,这可能会使带宽饱和。不频繁的沟通策略(例如,本地SGD)减少了这种开销,但是当应用于自适应优化器时,相对于完全同步DDP,通常会遇到性能差距。我们将这一差距追溯到时间尺度不匹配:优化器的快速移动动量,针对频繁更新进行了调整,衰减得太快,无法在长时间间隔内平滑梯度,导致噪声主导的优化。为了解决这个问题,我们提出了MT-DAO,这是一个优化器家族,它采用多个缓慢和快速移动的第一动量或梯度来跟踪不同时间尺度上的更新动态,为此我们提供了第一次收敛保证。从经验上讲,对于语言模型预训练,这消除了与DDP的性能差距,在困惑方面优于不频繁的通信基线,并将以太网互连上的iso-token挂钟时间减少了6-27%。在720米的规模,MT-DAO达到目标困惑在24%的步骤和35%的时间比单动量DDP基线。MT-DAO支持有效的跨数据中心培训和广泛地理区域的培训。
摘要:Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP. We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees. Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.
【9】Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization
标题:保证金自适应DPO:利用回报模型进行偏好优化中的粒度控制
链接:https://arxiv.org/abs/2510.05342
作者:Hyung Gyu Rho
摘要:直接偏好优化(DPO)已经成为一种简单而有效的对齐大型语言模型的方法。然而,它对固定温度参数的依赖导致对不同偏好数据的次优训练,导致对简单示例的过度拟合和对信息丰富的示例的学习不足。最近出现了一些方法来解决这个问题。虽然IPO解决了一般的过拟合,但它的均匀正则化可能过于保守。更有针对性的$\beta$-DPO方法有其自身的局限性:其批量级自适应将单一的折衷温度应用于混合边缘对,其线性更新规则可能产生不稳定的负$\beta$值,其过滤机制会丢弃潜在有用的训练信号。在这项工作中,我们介绍了边际自适应直接偏好优化(MADPO),一种提供稳定,数据保存和实例级解决方案的方法。MADPO采用了一种实用的两步方法:首先训练一个奖励模型来估计偏好边际,然后使用这些边际对每个训练样本的DPO损失应用连续的自适应权重。这种重新加权方案创建了有效的目标裕度,该目标裕度对于硬对被放大并且对于容易对被抑制,从而允许对学习信号进行粒度控制。我们提供了一个全面的理论分析,证明MADPO具有良好的优化景观,是强大的奖励模型估计错误。我们通过情感生成任务的实验验证了我们的理论,其中MADPO在不同质量的数据集上始终显著优于强基线。它在高质量数据上实现了高达+33.3\%的性能增益,在低质量数据上实现了+10.5\%的性能增益。我们的研究结果建立MADPO作为一个更强大的和原则性的方法偏好对齐。
摘要
:Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of $\beta$-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative $\beta$ values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.
【10】Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs
标题:解码偏微方程:纯解码器模型到偏方程的跨模式适应
链接:https://arxiv.org/abs/2510.05278
作者:Paloma García-de-Herreros, Philipp Slusallek, Dietrich Klakow, Vagrant Gautam
摘要:近年来,大型语言模型在自然语言任务上取得了巨大成功,但在适应新的模态时也表现出了巨大的希望,例如,用于科学机器学习任务。尽管仅解码器模型在NLP中更受欢迎,并且在生成自然语言方面扩展得非常好,但大多数提出的跨模态自适应方法都集中在仅编码器模型上,这就提出了模型架构如何影响这些方法的问题。因此,在本文中,我们进行了一系列的消融研究来回答这个问题,系统地比较了基于偏微分方程(PDE)的时间依赖性仿真任务的跨模态自适应的仅编码器和仅解码器模型。我们发现,解码器的模型是远远不如编码器的模型,当现有的方法是未经修改的应用。与其他几个领域相比,缩放仅解码器模型也没有帮助。在这种情况下,利用解码器的潜力,只有模型,我们介绍了两种新的方法,并行翻转和序列倍增,试图模仿自回归模型中的双向性。我们的两种方法都使用所有任务的仅解码器模型和所有跨模型自适应方法来提高整体性能,缩小了与仅编码器模型性能的差距。我们希望我们的研究结果能够拓宽用于跨模态适应任务的模型的范围,以进一步推进科学ML。
摘要:Large language models have shown great success on natural language tasks in recent years, but they have also shown great promise when adapted to new modalities, e.g., for scientific machine learning tasks. Even though decoder-only models are more popular within NLP and scale exceedingly well at generating natural language, most proposed approaches for cross-modal adaptation focus on encoder-only models, raising the question of how model architecture affects these approaches. In this paper, we therefore perform a series of ablation studies to answer this question, systematically comparing encoder-only and decoder-only models on cross-modal adaptation for time-dependent simulation tasks based on partial differential equations (PDEs). We find that decoder-only models are far worse than encoder-only models, when existing approaches are applied unmodified. In contrast to several other domains, scaling decoder-only models also does not help. To harness the potential of decoder-only models in this context, we introduce two novel approaches, Parallel Flipping and Sequence Doubling, attempting to mimic bidirectionality in autoregressive models. Both our methods improve overall performance using decoder-only models for all tasks and all cross-model adaptation methods, closing the gap to encoder-only model performance. We hope that our findings broaden the spectrum of models used on cross-modal adaptation tasks to further scientific ML.
【11】Adaptive Reinforcement Learning for Dynamic Configuration Allocation in Pre-Production Testing
标题:预生产测试中动态配置分配的自适应强化学习
链接:https://arxiv.org/abs/2510.05147
作者:Yu Zhu
摘要:确保现代软件系统的可靠性需要在高度异构和不断发展的环境中进行严格的生产前测试。由于详尽的评估是不可行的,从业者必须决定如何分配有限的测试资源,跨配置的故障概率可能会随着时间的推移漂移。现有的组合优化方法是静态的,特设的,不适合这样的非平稳设置。我们引入了一种新的强化学习(RL)框架,将配置分配重新定义为顺序决策问题。我们的方法是第一个将Q学习与混合奖励设计相结合的方法,该设计融合了模拟结果和实时反馈,从而实现了样本效率和鲁棒性。此外,我们开发了一个自适应的在线-离线训练方案,允许代理快速跟踪突然的概率变化,同时保持长期稳定性。广泛的模拟研究表明,我们的方法始终优于静态和基于优化的基线,接近甲骨文的性能。这项工作将RL确立为自适应配置分配的强大新范式,超越了传统方法,并为动态测试和资源调度领域提供了广泛的适用性。
摘要:Ensuring reliability in modern software systems requires rigorous pre-production testing across highly heterogeneous and evolving environments. Because exhaustive evaluation is infeasible, practitioners must decide how to allocate limited testing resources across configurations where failure probabilities may drift over time. Existing combinatorial optimization approaches are static, ad hoc, and poorly suited to such non-stationary settings. We introduce a novel reinforcement learning (RL) framework that recasts configuration allocation as a sequential decision-making problem. Our method is the first to integrate Q-learning with a hybrid reward design that fuses simulated outcomes and real-time feedback, enabling both sample efficiency and robustness. In addition, we develop an adaptive online-offline training scheme that allows the agent to quickly track abrupt probability shifts while maintaining long-run stability. Extensive simulation studies demonstrate that our approach consistently outperforms static and optimization-based baselines, approaching oracle performance. This work establishes RL as a powerful new paradigm for adaptive configuration allocation, advancing beyond traditional methods and offering broad applicability to dynamic testing and resource scheduling domains.
【12】Adaptive Pruning for Increased Robustness and Reduced Computational Overhead in Gaussian Process Accelerated Saddle Point Searches
标题:自适应修剪以提高高斯过程加速鞍点插值中的鲁棒性并减少计算负担
链接:https://arxiv.org/abs/2510.06030
作者:Rohit Goswami (1), Hannes Jónsson (1) ((1) Science Institute and Faculty of Physical Sciences, University of Iceland, Reykjavík, Iceland)
备注
:Invited article for the ChemPhysChem special issue dedicated to the 60th birthday of Prof. Debabrata Goswami. A preliminary version of this work was presented at the UNOOS 2025 conference
摘要:高斯过程(GP)回归提供了一种策略,通过减少需要评估的能量及其相对于原子坐标的导数的次数来加速高维能量表面上的鞍点搜索。然而,超参数优化中的计算开销可能很大,并且使该方法效率低下。失败也可能发生,如果搜索冒险太远的地区,没有代表足够好的GP模型。在这里,这些挑战是通过使用几何感知的最佳运输措施和主动修剪策略,使用Wasserstein-1距离的总和为每个原子类型在最远点采样,选择一个固定大小的子集的几何不同的配置,以避免快速增加成本的GP更新更多的观察。稳定性增强置换不变的度量,提供了一个可靠的信任半径的早期停止和对数障碍的信号方差的增长罚款。这些物理动机的算法的变化证明了他们的效力,减少到不到一半的平均计算时间对一组238个具有挑战性的配置,从以前公布的数据集的化学反应。通过这些改进,GP方法被建立为,一个强大的和可扩展的算法,用于加速鞍点搜索时的能量和原子力的评估需要大量的计算工作。
摘要:Gaussian process (GP) regression provides a strategy for accelerating saddle point searches on high-dimensional energy surfaces by reducing the number of times the energy and its derivatives with respect to atomic coordinates need to be evaluated. The computational overhead in the hyperparameter optimization can, however, be large and make the approach inefficient. Failures can also occur if the search ventures too far into regions that are not represented well enough by the GP model. Here, these challenges are resolved by using geometry-aware optimal transport measures and an active pruning strategy using a summation over Wasserstein-1 distances for each atom-type in farthest-point sampling, selecting a fixed-size subset of geometrically diverse configurations to avoid rapidly increasing cost of GP updates as more observations are made. Stability is enhanced by permutation-invariant metric that provides a reliable trust radius for early-stopping and a logarithmic barrier penalty for the growth of the signal variance. These physically motivated algorithmic changes prove their efficacy by reducing to less than a half the mean computational time on a set of 238 challenging configurations from a previously published data set of chemical reactions. With these improvements, the GP approach is established as, a robust and scalable algorithm for accelerating saddle point searches when the evaluation of the energy and atomic forces requires significant computational effort.
【13】Hybrid Quantum-Classical Policy Gradient for Adaptive Control of Cyber-Physical Systems: A Comparative Study of VQC vs. MLP
标题:网络物理系统自适应控制的混合量子-经典政策梯度:VQC与MLP的比较研究
链接:https://arxiv.org/abs/2510.06010
作者:Aueaphum Aueawatthanaphisut, Nyi Wunna Tun
备注:6 pages, 5 figures, 2 tables, 17 equations, 1 algorithm
摘要:对经典和量子强化学习(QRL)范式进行了比较评估,以研究它们的收敛行为、观测噪声下的鲁棒性以及基准控制环境中的计算效率。该研究采用多层感知器(MLP)代理作为经典基线,参数化变分量子电路(VQC)作为量子对应物,两者都在CartPole-v1环境中训练超过500集。实证结果表明,经典的MLP实现了近最优的政策收敛,平均收益率为498.7 +/- 3.2,在整个训练过程中保持稳定的均衡。相比之下,VQC表现出有限的学习能力,平均回报率为14.6 +/- 4.8,主要受电路深度和量子位连接性的限制。噪声鲁棒性分析进一步表明,MLP策略在高斯扰动下优雅地恶化,而VQC在等效噪声水平下显示出更高的灵敏度。尽管渐近性能较低,但VQC表现出显著较低的参数计数和略微增加的训练时间,突出了其对低资源量子处理器的潜在可扩展性。结果表明,虽然经典的神经策略在当前的控制基准中仍然占主导地位,但一旦硬件噪声和表现力限制得到缓解,量子增强架构可以提供有前途的效率优势。
摘要:The comparative evaluation between classical and quantum reinforcement learning (QRL) paradigms was conducted to investigate their convergence behavior, robustness under observational noise, and computational efficiency in a benchmark control environment. The study employed a multilayer perceptron (MLP) agent as a classical baseline and a parameterized variational quantum circuit (VQC) as a quantum counterpart, both trained on the CartPole-v1 environment over 500 episodes. Empirical results demonstrated that the classical MLP achieved near-optimal policy convergence with a mean return of 498.7 +/- 3.2, maintaining stable equilibrium throughout training. In contrast, the VQC exhibited limited learning capability, with an average return of 14.6 +/- 4.8, primarily constrained by circuit depth and qubit connectivity. Noise robustness analysis further revealed that the MLP policy deteriorated gracefully under Gaussian perturbations, while the VQC displayed higher sensitivity at equivalent noise levels. Despite the lower asymptotic performance, the VQC exhibited significantly lower parameter count and marginally increased training time, highlighting its potential scalability for low-resource quantum processors. The results suggest that while classical neural policies remain dominant in current control benchmarks, quantum-enhanced architectures could offer promising efficiency advantages once hardware noise and expressivity limitations are mitigated.
强化学习(4篇)
【1】Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks
标题:使用字节编码门控策略网络的多任务强化学习
链接:https://arxiv.org/abs/2510.06138
作者:Rushiv Arora
备注:14 pages, 3 figures, 12 tables, 2 appendices. Currently under review
摘要:多任务强化学习通常依赖于任务元数据(如简短的自然语言描述)来指导不同目标的行为。我们提出了词汇策略网络(LEXPOL),一种用于多任务RL的语言条件混合策略架构。LEXPOL使用文本编码器对任务元数据进行编码,并使用学习门控模块在多个子策略中进行选择或混合,从而实现跨任务的端到端培训。在MetaWorld基准测试中,LEXPOL在成功率和样本效率方面与强大的多任务基准相匹配或超过,而无需特定于任务的再培训。为了分析该机制,我们进一步研究了独立于门获得的固定专家策略的设置,并表明学习的语言门组成这些专家,以产生适合于新任务描述和看不见的任务组合的行为。这些结果表明,自然语言元数据可以有效地索引和重组可重用的技能在一个单一的政策。
摘要
:Multi-task reinforcement learning often relies on task metadata -- such as brief natural-language descriptions -- to guide behavior across diverse objectives. We present Lexical Policy Networks (LEXPOL), a language-conditioned mixture-of-policies architecture for multi-task RL. LEXPOL encodes task metadata with a text encoder and uses a learned gating module to select or blend among multiple sub-policies, enabling end-to-end training across tasks. On MetaWorld benchmarks, LEXPOL matches or exceeds strong multi-task baselines in success rate and sample efficiency, without task-specific retraining. To analyze the mechanism, we further study settings with fixed expert policies obtained independently of the gate and show that the learned language gate composes these experts to produce behaviors appropriate to novel task descriptions and unseen task combinations. These results indicate that natural-language metadata can effectively index and recombine reusable skills within a single policy.
【2】From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning
标题:从学习到掌握:通过人在环强化学习实现安全有效的现实世界自动驾驶
链接:https://arxiv.org/abs/2510.06038
作者:Li Zeqiao, Wang Yijing, Wang Haoyu, Li Zheng, Li Peng, Liu Wenfei, Zuo Zhiqiang
摘要:具有强化学习(RL)的自动驾驶具有巨大的潜力。然而,在现实世界中应用RL仍然具有挑战性,因为需要安全,高效和强大的学习。将人类的专业知识融入到学习过程中,可以通过减少风险探索和提高样本效率来帮助克服这些挑战。在这项工作中,我们提出了一种无奖励的,主动的人在回路学习方法,称为人类引导的分布式软演员-评论家(H-DSAC)。我们的方法结合了代理值传播(PVP)和分布式软演员-评论家(DSAC),以实现在现实环境中的高效和安全的训练。关键的创新是在DSAC框架内构建分布式代理值函数。该函数通过将更高的预期回报分配给专家演示并惩罚需要人工干预的行为来编码人类意图。通过将这些标签外推到未标记的状态,策略被有效地引导到专家般的行为。通过设计良好的状态空间,我们的方法在实际训练时间内实现了真实世界的驾驶策略学习。模拟和真实世界的实验结果表明,我们的框架可以实现安全,强大和样本效率的自动驾驶学习。
摘要:Autonomous driving with reinforcement learning (RL) has significant potential. However, applying RL in real-world settings remains challenging due to the need for safe, efficient, and robust learning. Incorporating human expertise into the learning process can help overcome these challenges by reducing risky exploration and improving sample efficiency. In this work, we propose a reward-free, active human-in-the-loop learning method called Human-Guided Distributional Soft Actor-Critic (H-DSAC). Our method combines Proxy Value Propagation (PVP) and Distributional Soft Actor-Critic (DSAC) to enable efficient and safe training in real-world environments. The key innovation is the construction of a distributed proxy value function within the DSAC framework. This function encodes human intent by assigning higher expected returns to expert demonstrations and penalizing actions that require human intervention. By extrapolating these labels to unlabeled states, the policy is effectively guided toward expert-like behavior. With a well-designed state space, our method achieves real-world driving policy learning within practical training times. Results from both simulation and real-world experiments demonstrate that our framework enables safe, robust, and sample-efficient learning for autonomous driving.
【3】Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies
标题:Oracle引导的视觉策略掩蔽对比强化学习
链接:https://arxiv.org/abs/2510.05692
作者:Yuhang Zhang, Jiaping Xiao, Chao Yan, Mir Feroskhan
摘要:学习视觉策略的一种流行方法是采用强化学习将高维视觉观察直接映射到动作命令。然而,高维视觉输入和敏捷机动输出的组合导致了长期存在的挑战,包括低采样效率和显着的模拟与真实的差距。为了解决这些问题,我们提出了Oracle引导的掩蔽对比强化学习(OMC-RL),这是一种旨在提高样本效率和渐进性能的视觉策略学习的新框架。OMC-RL明确地将学习过程分为两个阶段:上游表示学习阶段和下游策略学习阶段。在上游阶段,一个掩蔽的Transformer模块被训练与时间建模和对比学习,以提取时间感知和任务相关的表示从顺序的视觉输入。在训练之后,学习的编码器被冻结并用于从连续帧中提取视觉表示,而Transformer模块被丢弃。在下游阶段,具有全局状态信息特权的Oracle教师策略在早期训练期间监督代理,以提供信息指导并加速早期策略学习。随着训练的进行,这种指导逐渐减少,以允许独立探索。在模拟和现实环境中的大量实验表明,OMC-RL实现了卓越的样本效率和渐近策略性能,同时还提高了在各种感知复杂场景中的泛化能力。
摘要:A prevailing approach for learning visuomotor policies is to employ reinforcement learning to map high-dimensional visual observations directly to action commands. However, the combination of high-dimensional visual inputs and agile maneuver outputs leads to long-standing challenges, including low sample efficiency and significant sim-to-real gaps. To address these issues, we propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a novel framework designed to improve the sample efficiency and asymptotic performance of visuomotor policy learning. OMC-RL explicitly decouples the learning process into two stages: an upstream representation learning stage and a downstream policy learning stage. In the upstream stage, a masked Transformer module is trained with temporal modeling and contrastive learning to extract temporally-aware and task-relevant representations from sequential visual inputs. After training, the learned encoder is frozen and used to extract visual representations from consecutive frames, while the Transformer module is discarded. In the downstream stage, an oracle teacher policy with privileged access to global state information supervises the agent during early training to provide informative guidance and accelerate early policy learning. This guidance is gradually reduced to allow independent exploration as training progresses. Extensive experiments in simulated and real-world environments demonstrate that OMC-RL achieves superior sample efficiency and asymptotic policy performance, while also improving generalization across diverse and perceptually complex scenarios.
【4】Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning
标题:让它平静下来:可验证强化学习的探索性安妮解码
链接:https://arxiv.org/abs/2510.05251
作者:Chenghao Yang, Lin Gui, Chenxiao Yang, Victor Veitch, Lizhu Zhang, Zhuokai Zhao
备注:Codebase: this https URL
摘要
:具有可验证奖励的强化学习(RLVR)是增强大型语言模型(LLM)推理能力的强大范例,但其成功取决于有效的探索。理想的探索策略必须克服两个基本挑战:必须保持样本质量,同时确保训练稳定性。虽然标准的固定温度采样很简单,但它很难平衡这些相互竞争的需求,因为高温会降低样品质量,而低温会限制发现。在这项工作中,我们提出了一种更简单,更有效的策略,探索性退火解码(EAD),基于探索对定义序列语义方向的早期标记最有影响的见解。EAD通过在生成过程中将采样温度从高到低进行退火,实现了直观的 ** 开始探索,结束利用 ** 策略。这种动态时间表在开始时鼓励有意义的高水平多样性,然后逐渐降低温度以保持样本质量并保持采样分布接近目标策略,这对于稳定训练至关重要。我们证明,EAD是一种轻量级的,即插即用的方法,显着提高采样效率,始终优于固定温度采样在各种RLVR算法和模型大小。我们的工作表明,将探索与连续生成的自然动态结合起来,为改进LLM推理提供了一条强大的途径。
摘要:Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive **explore-at-the-beginning, exploit-at-the-end** strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.
元学习(1篇)
【1】Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs
标题:优先对齐的Meta-RL:在伪地平线MDPs中使用已学习的先验和保证的汤普森抽样
链接:https://arxiv.org/abs/2510.05446
作者:Runlin Zhou, Chixiang Chen, Elynn Chen
摘要:We study meta-reinforcement learning in finite-horizon MDPs where related tasks share similar structures in their optimal action-value functions. Specifically, we posit a linear representation $Q^*_h(s,a)=\Phi_h(s,a)\,\theta^{(k)}_h$ and place a Gaussian meta-prior $ \mathcal{N}(\theta^*_h,\Sigma^*_h)$ over the task-specific parameters $\theta^{(k)}_h$. Building on randomized value functions, we propose two Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and performs posterior sampling with the learned mean and known covariance; and (ii) $\text{MTSRL}^{+}$, which additionally estimates the covariance and employs prior widening to control finite-sample estimation error. Further, we develop a prior-alignment technique that couples the posterior under the learned prior with a meta-oracle that knows the true prior, yielding meta-regret guarantees: we match prior-independent Thompson sampling in the small-task regime and strictly improve with more tasks once the prior is learned. Concretely, for known covariance we obtain $\tilde{O}(H^{4}S^{3/2}\sqrt{ANK})$ meta-regret, and with learned covariance $\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K})$; both recover a better behavior than prior-independent after $K \gtrsim \tilde{O}(H^2)$ and $K \gtrsim \tilde{O}(N^2H^2)$, respectively. Simulations on a stateful recommendation environment (with feature and prior misspecification) show that after brief exploration, MTSRL/MTSRL\(^+\) track the meta-oracle and substantially outperform prior-independent RL and bandit-only meta-baselines. Our results give the first meta-regret guarantees for Thompson-style RL with learned Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation, covariance widening) for experiment-rich settings.
摘要:We study meta-reinforcement learning in finite-horizon MDPs where related tasks share similar structures in their optimal action-value functions. Specifically, we posit a linear representation $Q^*_h(s,a)=\Phi_h(s,a)\,\theta^{(k)}_h$ and place a Gaussian meta-prior $ \mathcal{N}(\theta^*_h,\Sigma^*_h)$ over the task-specific parameters $\theta^{(k)}_h$. Building on randomized value functions, we propose two Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and performs posterior sampling with the learned mean and known covariance; and (ii) $\text{MTSRL}^{+}$, which additionally estimates the covariance and employs prior widening to control finite-sample estimation error. Further, we develop a prior-alignment technique that couples the posterior under the learned prior with a meta-oracle that knows the true prior, yielding meta-regret guarantees: we match prior-independent Thompson sampling in the small-task regime and strictly improve with more tasks once the prior is learned. Concretely, for known covariance we obtain $\tilde{O}(H^{4}S^{3/2}\sqrt{ANK})$ meta-regret, and with learned covariance $\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K})$; both recover a better behavior than prior-independent after $K \gtrsim \tilde{O}(H^2)$ and $K \gtrsim \tilde{O}(N^2H^2)$, respectively. Simulations on a stateful recommendation environment (with feature and prior misspecification) show that after brief exploration, MTSRL/MTSRL\(^+\) track the meta-oracle and substantially outperform prior-independent RL and bandit-only meta-baselines. Our results give the first meta-regret guarantees for Thompson-style RL with learned Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation, covariance widening) for experiment-rich settings.
符号|符号学习(1篇)
【1】Logistic-Gated Operators Enable Auditable Unit-Aware Thresholds in Symbolic Regression
标题:Logistic-Gated运算符在符号回归中实现可审核的单元感知保留
链接:https://arxiv.org/abs/2510.05178
作者:Ou Deng, Ruichen Cong, Jianting Xu, Shoji Nishimura, Atsushi Ogihara, Qun Jin
摘要:符号回归承诺可读的方程,但努力编码单元感知的阈值和条件逻辑。我们提出了逻辑门操作符(LGO)-可微门与学习的位置和陡度-嵌入类型的原语和映射回物理单元进行审计。两个主要健康数据集(ICU,NHANES),硬门变恢复了临床合理的临界点:71%(5/7)的评估阈值在指南锚点的10%范围内,100%在20%范围内,而使用的门比软变少得多(ICU中位数4.0 vs 10.0; NHANES 5.0 vs 12.5),并保持在强SR基线的竞争准确性范围内。在主要是光滑的任务中,门被修剪,保持简约。其结果是紧凑的符号方程,具有明确的,单元感知的阈值,可以根据临床锚进行审计-将可解释性从事后解释转变为建模约束,并为符号回归配备了用于政权切换和治理就绪部署的实用演算。
摘要
:Symbolic regression promises readable equations but struggles to encode unit-aware thresholds and conditional logic. We propose logistic-gated operators (LGO) -- differentiable gates with learnable location and steepness -- embedded as typed primitives and mapped back to physical units for audit. Across two primary health datasets (ICU, NHANES), the hard-gate variant recovers clinically plausible cut-points: 71% (5/7) of assessed thresholds fall within 10% of guideline anchors and 100% within 20%, while using far fewer gates than the soft variant (ICU median 4.0 vs 10.0; NHANES 5.0 vs 12.5), and remaining within the competitive accuracy envelope of strong SR baselines. On predominantly smooth tasks, gates are pruned, preserving parsimony. The result is compact symbolic equations with explicit, unit-aware thresholds that can be audited against clinical anchors -- turning interpretability from a post-hoc explanation into a modeling constraint and equipping symbolic regression with a practical calculus for regime switching and governance-ready deployment.
医学相关(4篇)
【1】An Attention-Augmented VAE-BiLSTM Framework for Anomaly Detection in 12-Lead ECG Signals
标题:用于12导心电信号异常检测的注意力增强VAE-BiLSTM框架
链接:https://arxiv.org/abs/2510.05919
作者:Marc Garreta Basora (1), Mehmet Oguz Mulayim (2 and 1) ((1) Universitat Autònoma de Barcelona (UAB), Cerdanyola del Vallès, Spain, (2) Artificial Intelligence Research Institute (IIIA-CSIC), Cerdanyola del Vallès, Spain)
备注:14 pages, 11 figures
摘要:12导联心电图(ECG)中的异常检测对于识别与心血管疾病相关的偏差至关重要。这项工作提出了三种基于自动编码器的架构的比较分析:卷积自动编码器(CAE),具有双向长短期记忆的变分自动编码器(VAE-BiLSTM)和具有多头注意力的VAE-BiLSTM(VAE-BiLSTM-MHA),用于ECG中的无监督异常检测。据我们所知,这项研究报告了VAE-BiLSTM-MHA架构在ECG异常检测中的首次应用。所有模型都在正常ECG样本上进行训练,以重建非异常心脏形态并检测指示疾病的偏差。在公开的中国生理信号挑战(CPSC)数据集上使用统一的预处理和评估管道,注意力增强的VAE实现了最佳性能,在保持测试集上的AUPRC为0.81,召回率为0.85,优于其他架构。为了支持临床分诊,该模型进一步集成到可视化异常定位的交互式仪表板中。此外,提供了与文献中的基线模型的性能比较。
摘要:Anomaly detection in 12-lead electrocardiograms (ECGs) is critical for identifying deviations associated with cardiovascular disease. This work presents a comparative analysis of three autoencoder-based architectures: convolutional autoencoder (CAE), variational autoencoder with bidirectional long short-term memory (VAE-BiLSTM), and VAE-BiLSTM with multi-head attention (VAE-BiLSTM-MHA), for unsupervised anomaly detection in ECGs. To the best of our knowledge, this study reports the first application of a VAE-BiLSTM-MHA architecture to ECG anomaly detection. All models are trained on normal ECG samples to reconstruct non-anomalous cardiac morphology and detect deviations indicative of disease. Using a unified preprocessing and evaluation pipeline on the public China Physiological Signal Challenge (CPSC) dataset, the attention-augmented VAE achieves the best performance, with an AUPRC of 0.81 and a recall of 0.85 on the held-out test set, outperforming the other architectures. To support clinical triage, this model is further integrated into an interactive dashboard that visualizes anomaly localization. In addition, a performance comparison with baseline models from the literature is provided.
【2】Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates
标题:利用基于模式连接性的轨迹替代物改善临床数据集浓缩
链接:https://arxiv.org/abs/2510.05805
作者:Pafue Christy Nganjimi, Andrew Soltan, Danielle Belgrave, Lei Clifton, David A. Clifton, Anshul Thakur
备注:20 pages, 4 figures, Submitted to AISTATS 2026
摘要:数据集压缩(DC)可以创建紧凑、隐私保护的合成数据集,这些数据集可以与真实患者记录的效用相匹配,支持民主化地访问高度监管的临床数据,以开发下游临床模型。最先进的DC方法通过对齐在真实数据上训练的模型和在合成数据上训练的模型的训练动态来监督合成数据,通常使用完全随机梯度下降(SGD)轨迹作为对齐目标;然而,这些轨迹通常是噪声、高曲率和存储密集型的,导致不稳定的梯度、缓慢的收敛和大量的内存开销。我们通过用平滑的、低损失的参数替代物(特别是连接来自真实训练轨迹的初始和最终模型状态的二次贝塞尔曲线)替换完整的SGD轨迹来解决这些限制。这些模式连接的路径提供无噪声、低曲率的监控信号,可稳定梯度、加速收敛并消除对密集轨迹存储的需求。我们从理论上证明了B 'ezier模式连接作为SGD路径的有效替代物,并从经验上表明,所提出的方法在五个临床数据集上优于最先进的冷凝方法,从而产生能够实现临床有效模型开发的冷凝数据集。
摘要:Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B\'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B\'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.
【3】EEG-Based Acute Pain Classification: Machine Learning Model Comparison and Real-Time Clinical Feasibility
标题:基于脑电的急性疼痛分类:机器学习模型比较和实时临床可行性
链接:https://arxiv.org/abs/2510.05511
作者:Aavid Mathrawala, Dhruv Kurup, Josie Lau
摘要
:目前医院内的疼痛评估通常依赖于自我报告或非特异性EKG生命体征。这一系统使重症、镇静剂和认知受损的患者容易受到治疗不足的疼痛和阿片类药物过度使用的影响。脑电图(EEG)提供了一种测量大脑活动的非侵入性方法。这项技术可以作为一种辅助工具,突出伤害性处理,以减轻这一问题。在这项研究中,我们使用来自52名健康成年人的数据比较了机器学习模型对高疼痛与低/无痛EEG时期进行分类,这些数据来自三种强度(低,中,高)的激光诱发疼痛。每个四秒的时期被转换成一个537特征向量,涵盖频谱功率,频带比,Hjorth参数,熵测量,相干性,小波能量和峰值频率度量。九个传统的机器学习模型进行了评估留一参与者交叉验证。采用径向基函数核的支持向量机实现了最佳的离线性能,准确率为88.9%,推理时间为亚毫秒(1.02 ms)。我们的特征重要性分析与当前的典型疼痛生理学一致,显示对侧α抑制、中线θ/α增强和额叶γ爆发。实时XGBoost模型保持了约4 ms的端到端延迟和94.2%的准确性,表明基于EEG的疼痛监测器在临床环境中在技术上是可行的,并提供了一条通往临床验证的途径。
摘要:Current pain assessment within hospitals often relies on self-reporting or non-specific EKG vital signs. This system leaves critically ill, sedated, and cognitively impaired patients vulnerable to undertreated pain and opioid overuse. Electroencephalography (EEG) offers a noninvasive method of measuring brain activity. This technology could potentially be applied as an assistive tool to highlight nociceptive processing in order to mitigate this issue. In this study, we compared machine learning models for classifying high-pain versus low/no-pain EEG epochs using data from fifty-two healthy adults exposed to laser-evoked pain at three intensities (low, medium, high). Each four-second epoch was transformed into a 537-feature vector spanning spectral power, band ratios, Hjorth parameters, entropy measures, coherence, wavelet energies, and peak-frequency metrics. Nine traditional machine learning models were evaluated with leave-one-participant-out cross-validation. A support vector machine with radial basis function kernel achieved the best offline performance with 88.9% accuracy and sub-millisecond inference time (1.02 ms). Our Feature importance analysis was consistent with current canonical pain physiology, showing contralateral alpha suppression, midline theta/alpha enhancement, and frontal gamma bursts. The real-time XGBoost model maintained an end-to-end latency of about 4 ms and 94.2% accuracy, demonstrating that an EEG-based pain monitor is technically feasible within a clinical setting and provides a pathway towards clinical validation.
【4】Physics-Informed Machine Learning in Biomedical Science and Engineering
标题:生物医学科学与工程中的物理知情机器学习
链接:https://arxiv.org/abs/2510.05433
作者:Nazanin Ahmadi, Qianying Cao, Jay D. Humphrey, George Em Karniadakis
备注:Accepted for publication in the Annual Review of Biomedical Engineering on October 2, 2025
摘要:物理信息机器学习(PIML)正在成为一种潜在的变革性范式,通过将参数化物理定律与数据驱动方法相结合来建模复杂的生物医学系统。在这里,我们回顾了三种主要的PIML框架:物理信息神经网络(PINN),神经常微分方程(NODE)和神经运算符(NO),强调了它们在生物医学科学和工程中日益增长的作用。我们从PINN开始,它将控制方程嵌入到深度学习模型中,并已成功应用于生物固体和生物流体力学、机械生物学和医学成像等领域。然后,我们回顾NODE,它提供连续时间建模,特别适合于动态生理系统,药代动力学和细胞信号。最后,我们讨论了深度NO作为学习函数空间之间映射的强大工具,使跨多尺度和空间异构生物领域的有效模拟成为可能。在整个过程中,我们强调物理可解释性,数据稀缺性或系统复杂性使传统的黑盒学习不足的应用。最后,我们确定了开放的挑战和未来的方向,推进PIML在生物医学科学和工程,包括不确定性量化,泛化和集成的PIML和大型语言模型的问题。
摘要:Physics-informed machine learning (PIML) is emerging as a potentially transformative paradigm for modeling complex biomedical systems by integrating parameterized physical laws with data-driven methods. Here, we review three main classes of PIML frameworks: physics-informed neural networks (PINNs), neural ordinary differential equations (NODEs), and neural operators (NOs), highlighting their growing role in biomedical science and engineering. We begin with PINNs, which embed governing equations into deep learning models and have been successfully applied to biosolid and biofluid mechanics, mechanobiology, and medical imaging among other areas. We then review NODEs, which offer continuous-time modeling, especially suited to dynamic physiological systems, pharmacokinetics, and cell signaling. Finally, we discuss deep NOs as powerful tools for learning mappings between function spaces, enabling efficient simulations across multiscale and spatially heterogeneous biological domains. Throughout, we emphasize applications where physical interpretability, data scarcity, or system complexity make conventional black-box learning insufficient. We conclude by identifying open challenges and future directions for advancing PIML in biomedical science and engineering, including issues of uncertainty quantification, generalization, and integration of PIML and large language models.
超分辨率|去噪|去模糊|去雾(1篇)
【1】Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
标题:利用代理去噪破解无源时间序列预测中的不变特征去耦合
链接:https://arxiv.org/abs/2510.05589
作者:Kangjia Yan, Chenxi Liu, Hao Miao, Xinle Wu, Yan Zhao, Chenjuan Guo, Bin Yang
摘要:移动设备的激增在各个领域产生了大量的时间序列,其中有效的时间序列预测使各种现实世界的应用成为可能。本文研究了时间序列预测中的无源域自适应问题。它的目的是在不访问源数据的情况下,将一个预训练的模型从足够的源时间序列调整到稀疏的目标时间序列域,从而满足数据保护法规的要求。为了实现这一目标,我们提出了TimePD,这是第一个具有代理去噪的无源时间序列预测框架,其中采用了大型语言模型(LLM)以受益于其泛化能力。具体来说,TimePD由三个关键组件组成:(1)双分支不变的解纠缠特征学习,通过季节趋势分解来实现表示和梯度不变性;(2)轻量级,无参数的代理去噪,动态校准LLM的系统偏差;(3)知识蒸馏,双向对齐去噪预测和原始目标预测。在真实世界数据集上进行的广泛实验提供了对所提出的TimePD有效性的深入了解,平均比SOTA基线高出9.3%。
摘要:The proliferation of mobile devices generates a massive volume of time series across various domains, where effective time series forecasting enables a variety of real-world applications. This study focuses on a new problem of source-free domain adaptation for time series forecasting. It aims to adapt a pretrained model from sufficient source time series to the sparse target time series domain without access to the source data, embracing data protection regulations. To achieve this, we propose TimePD, the first source-free time series forecasting framework with proxy denoising, where large language models (LLMs) are employed to benefit from their generalization capabilities. Specifically, TimePD consists of three key components: (1) dual-branch invariant disentangled feature learning that enforces representation- and gradient-wise invariance by means of season-trend decomposition; (2) lightweight, parameter-free proxy denoising that dynamically calibrates systematic biases of LLMs; and (3) knowledge distillation that bidirectionally aligns the denoised prediction and the original target prediction. Extensive experiments on real-world datasets offer insight into the effectiveness of the proposed TimePD, outperforming SOTA baselines by 9.3% on average.
联邦学习|隐私保护|加密(1篇)
【1】OptiFLIDS: Optimized Federated Learning for Energy-Efficient Intrusion Detection in IoT
标题:OptFLIDS:优化联邦学习,用于物联网中的节能入侵检测
链接:https://arxiv.org/abs/2510.05180
作者:Saida Elouardi, Mohammed Jouhari, Anas Motii
备注:12 pages, 15 figures
摘要:在智能家居和工业系统等关键物联网环境中,有效的入侵检测系统(IDS)对于确保安全至关重要。然而,开发强大的IDS解决方案仍然是一个重大挑战。传统的基于机器学习的IDS模型通常需要大型数据集,但由于隐私和安全问题,数据共享通常受到限制。联邦学习(FL)通过在不共享原始数据的情况下实现协作模型训练,提供了一种有前途的替代方案。尽管FL具有优势,但它仍然面临着关键挑战,例如数据异构性(非IID数据)以及高能耗和计算成本,特别是对于资源受限的物联网设备。为了解决这些问题,本文提出了OptiFLIDS,一种新的方法,在本地训练过程中应用修剪技术,以降低模型的复杂性和能耗。它还集成了一个定制的聚合方法,以更好地处理由于非IID数据分布而不同的修剪模型。在最近的三个物联网IDS数据集TON_IoT、X-IIoTID和IDSIoT 2024上进行的实验表明,OptiFLIDS在提高能效的同时保持了强大的检测性能,非常适合在现实世界的物联网环境中部署。
摘要:In critical IoT environments, such as smart homes and industrial systems, effective Intrusion Detection Systems (IDS) are essential for ensuring security. However, developing robust IDS solutions remains a significant challenge. Traditional machine learning-based IDS models typically require large datasets, but data sharing is often limited due to privacy and security concerns. Federated Learning (FL) presents a promising alternative by enabling collaborative model training without sharing raw data. Despite its advantages, FL still faces key challenges, such as data heterogeneity (non-IID data) and high energy and computation costs, particularly for resource constrained IoT devices. To address these issues, this paper proposes OptiFLIDS, a novel approach that applies pruning techniques during local training to reduce model complexity and energy consumption. It also incorporates a customized aggregation method to better handle pruned models that differ due to non-IID data distributions. Experiments conducted on three recent IoT IDS datasets, TON_IoT, X-IIoTID, and IDSIoT2024, demonstrate that OptiFLIDS maintains strong detection performance while improving energy efficiency, making it well-suited for deployment in real-world IoT environments.
推理|分析|理解|解释(10篇)
【1】TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
标题:TaTToo:基于工具的思维PRM,用于表格推理中测试时间缩放
链接:https://arxiv.org/abs/2510.06217
作者:Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
摘要:过程奖励模型(PRM)最近成为一个强大的框架,用于增强大型推理模型(LRM)的推理能力,特别是在测试时间缩放(TTS)的背景下。然而,它们在表格推理领域监督LRM的潜力仍有待探索。通过详细的实证分析,我们发现现有的PRM尽管广泛用于监督纯文本推理步骤,但在子表检索和模式交互等特定于表的操作方面遇到了困难,从而导致了严重的性能瓶颈。为了解决这一限制,我们提出了TaTToo,一种新的基于表格的PRM框架,(i)在表格推理步骤上显式推理,(ii)集成基于工具的验证以提供精确的奖励监督。具体来说,我们首先设计了一个可扩展的数据策展管道,通过将表验证原理与基于工具的执行相结合,构建了超过60 k的高质量步骤级注释。在收集到的数据的基础上,我们使用双阶段范式对TaTToo进行训练:冷启动监督微调以捕获工具使用推理模式,然后进行基于工具的奖励塑造的强化学习,以使我们的模型与基于表的验证保持一致。我们提供了一个全面的评估,我们的新设计的PRM引起的政策改进。在涵盖数值推理、事实检查和数据分析的5个具有挑战性的表格推理基准中,TaTToo在推理方面将下游政策LRM提高了30.9%,超过了强大的PRM基线,如Qwen-2.5-Math-PRM-72 B,只有8B参数,并在不同的TTS策略中表现出强大的通用性。
摘要:Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
【2】Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing
标题:更高级特征归因:桥梁统计、可解释人工智能和布局信号处理
链接:https://arxiv.org/abs/2510.06165
作者:Kurt Butler, Guanchao Feng, Petar Djuric
备注:5 pages, 3 figures
摘要
:特征属性是一种训练后分析方法,用于评估机器学习模型的各种输入特征如何对输出预测做出贡献。当特征独立作用时,它们的解释是直接的,但当预测模型涉及诸如乘法关系或联合特征贡献等相互作用时,它们的解释就不那么直接了。在这项工作中,我们提出了一个高阶特征属性的一般理论,我们开发的基础上的综合知识(IG)。这项工作扩展了可解释AI文献中的现有框架。当使用IG作为特征属性的方法时,我们发现了与统计和拓扑信号处理的自然联系。我们提供了几个理论结果,建立理论,我们验证我们的理论上的几个例子。
摘要:Feature attributions are post-training analysis methods that assess how various input features of a machine learning model contribute to an output prediction. Their interpretation is straightforward when features act independently, but becomes less direct when the predictive model involves interactions such as multiplicative relationships or joint feature contributions. In this work, we propose a general theory of higher-order feature attribution, which we develop on the foundation of Integrated Gradients (IG). This work extends existing frameworks in the literature on explainable AI. When using IG as the method of feature attribution, we discover natural connections to statistics and topological signal processing. We provide several theoretical results that establish the theory, and we validate our theory on a few examples.
【3】Influence Functions for Efficient Data Selection in Reasoning
标题:推理中高效数据选择的影响函数
链接:https://arxiv.org/abs/2510.06108
作者:Prateek Humane, Paolo Cudrano, Daniel Z. Kaplan, Matteo Matteucci, Supriyo Chakraborty, Irina Rish
摘要:在思想链(CoT)数据上微调大型语言模型(LLM)表明,少量高质量数据可以胜过大规模数据集。然而,什么是“质量”仍然定义不清。现有的推理方法依赖于间接的推理,如问题的难度或跟踪长度,而推理调整已经探索了更广泛的自动选择策略,但很少在推理的上下文中。我们建议使用影响函数来定义推理数据质量,该影响函数测量单个CoT示例对下游准确性的因果影响,并引入基于影响的修剪,该修剪在模型族中的数学推理上始终优于基于困惑和嵌入的基线。
摘要:Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes "quality" remains ill-defined. Existing reasoning methods rely on indirect heuristics such as problem difficulty or trace length, while instruction-tuning has explored a broader range of automated selection strategies, but rarely in the context of reasoning. We propose to define reasoning data quality using influence functions, which measure the causal effect of individual CoT examples on downstream accuracy, and introduce influence-based pruning, which consistently outperforms perplexity and embedding-based baselines on math reasoning within a model family.
【4】TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis
标题:ThomomTS:用于时间序列和语言分析的多模式可观察性数据集
链接:https://arxiv.org/abs/2510.06063
作者:Austin Feng, Andreas Varvarigos, Ioannis Panitsas, Daniela Fernandez, Jinbiao Wei, Yuwei Guo, Jialin Chen, Ali Maatouk, Leandros Tassiulas, Rex Ying
摘要:现代企业在监控复杂系统时会产生大量的时间序列指标,称为可观测性数据。与来自天气等领域的传统时间序列不同,可观测性数据是零膨胀的,高度随机的,并且表现出最小的时间结构。尽管它们很重要,但由于专有限制,可观测性数据集在公共基准中的代表性不足。现有的数据集通常是匿名和规范化的,删除了规模信息,并限制了它们用于预测以外的任务,如异常检测,根本原因分析和多模态推理。为了解决这一差距,我们引入了一个来自5G电信网络的大规模可观测性数据集--WomTS。EQUIPOMTS具有异构的,去匿名的协变量,具有显式的尺度信息,并支持一系列下游任务,包括异常检测,根本原因分析和需要多模态推理的问答基准。对最先进的时间序列、语言和推理模型进行基准测试表明,现有方法难以应对可观测性数据的突然、嘈杂和高方差动态。我们的实验还强调了保持协变量绝对规模的重要性,强调了对基础时间序列模型的需求,这些模型本身就利用了实际可观测性应用的规模信息。
摘要:Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as weather, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets are underrepresented in public benchmarks due to proprietary restrictions. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks beyond forecasting, such as anomaly detection, root-cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit scale information and supports a suite of downstream tasks, including anomaly detection, root-cause analysis, and a question-answering benchmark requiring multi-modal reasoning. Benchmarking state-of-the-art time series, language, and reasoning models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics of observability data. Our experiments also underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical observability applications.
【5】Diffusion Models for Low-Light Image Enhancement: A Multi-Perspective Taxonomy and Performance Analysis
标题:低光图像增强的扩散模型:多视角分类和性能分析
链接:https://arxiv.org/abs/2510.05976
作者:Eashan Adhikarla, Yixin Liu, Brian D. Davison
摘要
:低光图像增强(LLIE)对于监控、自主导航和医学成像等安全关键型应用至关重要,在这些应用中,能见度下降会影响下游任务的性能。最近,扩散模型已经成为一个很有前途的生成范式LLIE由于其能力,通过迭代去噪复杂的图像分布模型。该调查提供了对LLIE扩散模型的最新批判性分析,特别是针对生成对抗网络和基于Transformer的最先进方法进行了深入的比较性能评估,对实际部署挑战进行了全面检查,并对新兴范式(如基础模型)的作用进行了前瞻性分析。我们提出了一个多角度的分类法,包括六个类别:内在分解,光谱和潜在的,加速,引导,多模式,自主的,映射增强方法在物理先验,条件方案和计算效率。我们的分类法是基于模型机制和条件信号的混合视图。我们评估定性故障模式,基准不一致,可解释性,泛化和推理效率之间的权衡。我们还讨论了现实世界的部署限制(例如,记忆、能源使用)和道德考虑。这项调查旨在通过突出趋势和提出开放的研究问题,包括新的条件反射,实时适应和基础模型的潜力,来指导下一代基于扩散的LLIE研究。
摘要:Low-light image enhancement (LLIE) is vital for safety-critical applications such as surveillance, autonomous navigation, and medical imaging, where visibility degradation can impair downstream task performance. Recently, diffusion models have emerged as a promising generative paradigm for LLIE due to their capacity to model complex image distributions via iterative denoising. This survey provides an up-to-date critical analysis of diffusion models for LLIE, distinctively featuring an in-depth comparative performance evaluation against Generative Adversarial Network and Transformer-based state-of-the-art methods, a thorough examination of practical deployment challenges, and a forward-looking perspective on the role of emerging paradigms like foundation models. We propose a multi-perspective taxonomy encompassing six categories: Intrinsic Decomposition, Spectral & Latent, Accelerated, Guided, Multimodal, and Autonomous; that map enhancement methods across physical priors, conditioning schemes, and computational efficiency. Our taxonomy is grounded in a hybrid view of both the model mechanism and the conditioning signals. We evaluate qualitative failure modes, benchmark inconsistencies, and trade-offs between interpretability, generalization, and inference efficiency. We also discuss real-world deployment constraints (e.g., memory, energy use) and ethical considerations. This survey aims to guide the next generation of diffusion-based LLIE research by highlighting trends and surfacing open research questions, including novel conditioning, real-time adaptation, and the potential of foundation models.
【6】ESS-Flow: Training-free guidance of flow-based models as inference in source space
标题:ESS-Flow:基于流的模型的免训练指导,作为源空间中的推理
链接:https://arxiv.org/abs/2510.05849
作者:Adhithyan Kalaivanan, Zheng Zhao, Jens Sjölund, Fredrik Lindsten
备注:14 pages, 12 figures. Code will be made available after publication
摘要:引导预训练的基于流的生成模型进行条件生成或生成具有所需目标属性的样本,可以解决各种任务,而无需对配对数据进行再训练。我们提出了ESS-Flow,一种无梯度的方法,它利用基于流的模型中源分布的典型高斯先验,使用椭圆切片采样直接在源空间中执行贝叶斯推断。ESS-Flow只需要通过生成模型和观察过程的前向传递,没有梯度或雅可比计算,并且即使在梯度不可靠或不可用时也是适用的,例如在生成或观察过程中具有基于模拟的观察或量化。我们证明了它的有效性设计材料所需的目标属性和预测蛋白质结构稀疏残基间距离测量。
摘要:Guiding pretrained flow-based generative models for conditional generation or to produce samples with desired target properties enables solving diverse tasks without retraining on paired data. We present ESS-Flow, a gradient-free method that leverages the typically Gaussian prior of the source distribution in flow-based models to perform Bayesian inference directly in the source space using Elliptical Slice Sampling. ESS-Flow only requires forward passes through the generative model and observation process, no gradient or Jacobian computations, and is applicable even when gradients are unreliable or unavailable, such as with simulation-based observations or quantization in the generation or observation process. We demonstrate its effectiveness on designing materials with desired target properties and predicting protein structures from sparse inter-residue distance measurements.
【7】Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling
标题:减少基于粒子的蒙特卡罗推理时间缩放中的过早利用
链接:https://arxiv.org/abs/2510.05825
作者:Giorgio Giannone, Guangxuan Xu, Nikhil Shivakumar Nayak, Rohan Mahesh Awhad, Shivchander Sudalairaj, Kai Xu, Akash Srivastava
摘要:Inference-Time Scaling (ITS) improves language models by allocating more computation at generation time. Particle Filtering (PF) has emerged as a strong ITS method for complex mathematical reasoning tasks, but it is vulnerable when guided by process reward models, which often assign overconfident scores early in the reasoning process. This causes PF to suffer from premature exploitation: it myopically commits to locally promising trajectories, prunes potentially correct hypotheses, and converges to suboptimal solutions. This failure mode, known as particle impoverishment, is especially severe under constrained computational budgets. To address this, we analyze the problem and identify two root causes: a lack of diversity in the particle set due to overconfident resampling and consequent inability to assess the potential of a reasoning path. We introduce Entropic Particle Filtering (ePF), an algorithm that integrates two new techniques to solve these issues. The first technique, Entropic Annealing (EA), directly mitigates particle impoverishment by monitoring search diversity via entropy; when diversity drops, it intervenes by dynamically annealing the resampling distribution to preserve exploration. The second, an enhancement called Look-ahead Modulation (LaM), adds a predictive guide to evaluate a state's potential based on its successors. On several challenging math benchmarks, ePF significantly outperforms strong baselines and achieves up to a 50 % relative improvement in task reward. Together, these methods improve PF's resilience by balancing the exploration of diverse solution spaces with the exploitation of high-reward regions, ultimately leading to higher-quality solutions.
【8】Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding
标题:Mellum:通过多文件项目理解完成生产级IDE内上下文代码
链接:https://arxiv.org/abs/2510.05788
作者:Nikita Pavlichenko, Iurii Nazarov, Ivan Dolgov, Ekaterina Garanina, Dmitry Ustalov, Ivan Bondyrev, Kseniia Lysaniuk, Evgeniia Vu, Kirill Chekmenev, Joseph Shtok, Yaroslav Golubev, Anton Semenkin, Uladzislau Sazanovich
备注:11 pages, 4 figures, 3 tables
摘要
:We present the Mellum models family, open-weight code completion models designed for interactive use in JetBrains IDEs. Mellums have 4B parameters, adopt a Llama-style architecture, and are pre-trained on ~4T tokens of permissively licensed, multi-language code. Our studies show that (i) careful data curation and staged training significantly improve the model's quality, (ii) editor-critical capabilities such as context packing are necessary for high-quality suggestions, and (iii) a compact, task-focused model can meet the cost and latency constraints of interactive completion. In the paper, we describe an end-to-end industrial pipeline for producing contextualized in-editor completion: disciplined data governance, multi-stage training that includes fill-in-the-middle and project context via supervised fine-tuning, and alignment via direct preference optimization using feedback from real-world scenarios. Our quality evaluations include both large-scale offline benchmarks and online telemetry from production deployments in JetBrains IDEs. Mellums are released under the Apache-2.0 license on HuggingFace, with a public model card providing a reproducible reference for practitioners. Our experience offers a pragmatic blueprint for taking a focused, open model from a research prototype to at scale production for hundreds of thousands of users.
【9】ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
标题:ARM:发现可推广多智能体系统的抽象推理模块
链接:https://arxiv.org/abs/2510.05746
作者:Bohan Yao, Shiva Krishna Reddy Malay, Vikas Yadav
备注:29 pages, 2 figures
摘要:Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved state-of-the-art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re-discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta-orchestrator. Our approach significantly outperforms both manually designed MASes and state-of-the-art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.
【10】A Fuzzy Logic-Based Framework for Explainable Machine Learning in Big Data Analytics
标题:基于模糊逻辑的大数据分析中可解释机器学习框架
链接:https://arxiv.org/abs/2510.05120
作者:Farjana Yesmin, Nusrat Shirmin
备注:8 pages
摘要:The growing complexity of machine learning (ML) models in big data analytics, especially in domains such as environmental monitoring, highlights the critical need for interpretability and explainability to promote trust, ethical considerations, and regulatory adherence (e.g., GDPR). Traditional "black-box" models obstruct transparency, whereas post-hoc explainable AI (XAI) techniques like LIME and SHAP frequently compromise accuracy or fail to deliver inherent insights. This paper presents a novel framework that combines type-2 fuzzy sets, granular computing, and clustering to boost explainability and fairness in big data environments. When applied to the UCI Air Quality dataset, the framework effectively manages uncertainty in noisy sensor data, produces linguistic rules, and assesses fairness using silhouette scores and entropy. Key contributions encompass: (1) A type-2 fuzzy clustering approach that enhances cohesion by about 4% compared to type-1 methods (silhouette 0.365 vs. 0.349) and improves fairness (entropy 0.918); (2) Incorporation of fairness measures to mitigate biases in unsupervised scenarios; (3) A rule-based component for intrinsic XAI, achieving an average coverage of 0.65; (4) Scalable assessments showing linear runtime (roughly 0.005 seconds for sampled big data sizes). Experimental outcomes reveal superior performance relative to baselines such as DBSCAN and Agglomerative Clustering in terms of interpretability, fairness, and efficiency. Notably, the proposed method achieves a 4% improvement in silhouette score over type-1 fuzzy clustering and outperforms baselines in fairness (entropy reduction by up to 1%) and efficiency.
检测相关(4篇)
【1】Out-of-Distribution Detection from Small Training Sets using Bayesian Neural Network Classifiers
标题:使用Bayesian神经网络分类器从小训练集进行分布外检测
链接:https://arxiv.org/abs/2510.06025
作者:Kevin Raina, Tanya Schmah
备注:British Machine Vision Conference (BMVC) 2025; 18 pages, 6 figures, 3 tables
摘要:Out-of-Distribution (OOD) detection is critical to AI reliability and safety, yet in many practical settings, only a limited amount of training data is available. Bayesian Neural Networks (BNNs) are a promising class of model on which to base OOD detection, because they explicitly represent epistemic (i.e. model) uncertainty. In the small training data regime, BNNs are especially valuable because they can incorporate prior model information. We introduce a new family of Bayesian posthoc OOD scores based on expected logit vectors, and compare 5 Bayesian and 4 deterministic posthoc OOD scores. Experiments on MNIST and CIFAR-10 In-Distributions, with 5000 training samples or less, show that the Bayesian methods outperform corresponding deterministic methods.
【2】Kaputt: A Large-Scale Dataset for Visual Defect Detection
标题:Kaputt:用于视觉缺陷检测的大规模数据集
链接:https://arxiv.org/abs/2510.05903
作者:Sebastian Höfer, Dorian Henning, Artemij Amiranashvili, Douglas Morrison, Mariliza Tzes, Ingmar Posner, Marc Matvienko, Alessandro Rennola, Anton Milan
备注:Accepted to ICCV 2025
摘要
:We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec-AD and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.
【3】Sparse deepfake detection promotes better disentanglement
标题:稀疏深度伪造检测促进更好的解纠缠
链接:https://arxiv.org/abs/2510.05696
作者:Antoine Teissier, Marie Tahon, Nicolas Dugué, Aghilas Sini
摘要:Due to the rapid progress of speech synthesis, deepfake detection has become a major concern in the speech processing community. Because it is a critical task, systems must not only be efficient and robust, but also provide interpretable explanations. Among the different approaches for explainability, we focus on the interpretation of latent representations. In such paper, we focus on the last layer of embeddings of AASIST, a deepfake detection architecture. We use a TopK activation inspired by SAEs on this layer to obtain sparse representations which are used in the decision process. We demonstrate that sparse deepfake detection can improve detection performance, with an EER of 23.36% on ASVSpoof5 test set, with 95% of sparsity. We then show that these representations provide better disentanglement, using completeness and modularity metrics based on mutual information. Notably, some attacks are directly encoded in the latent space.
【4】Machine learning for fraud detection in digital banking: a systematic literature review REVIEW
标题:用于数字银行欺诈检测的机器学习:系统性文献回顾
链接:https://arxiv.org/abs/2510.05167
作者:Md Zahin Hossain George, Md Khorshed Alam, Md Tarek Hasan
摘要:This systematic literature review examines the role of machine learning in fraud detection within digital banking, synthesizing evidence from 118 peer-reviewed studies and institutional reports. Following the PRISMA guidelines, the review applied a structured identification, screening, eligibility, and inclusion process to ensure methodological rigor and transparency. The findings reveal that supervised learning methods, such as decision trees, logistic regression, and support vector machines, remain the dominant paradigm due to their interpretability and established performance, while unsupervised anomaly detection approaches are increasingly adopted to address novel fraud patterns in highly imbalanced datasets. Deep learning architectures, particularly recurrent and convolutional neural networks, have emerged as transformative tools capable of modeling sequential transaction data and detecting complex fraud typologies, though challenges of interpretability and real-time deployment persist. Hybrid models that combine supervised, unsupervised, and deep learning strategies demonstrate superior adaptability and detection accuracy, highlighting their potential as convergent solutions.
分类|识别(3篇)
【1】EmoHRNet: High-Resolution Neural Network Based Speech Emotion Recognition
标题:CLARHRNet:基于高分辨率神经网络的语音情感识别
链接:https://arxiv.org/abs/2510.06072
作者:Akshay Muppidi, Martin Radfar
备注:None
摘要:Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.
【2】Emergent AI Surveillance: Overlearned Person Re-Identification and Its Mitigation in Law Enforcement Context
标题:紧急人工智能监视:执法背景下过度学习的人员重新识别及其缓解措施
链接:https://arxiv.org/abs/2510.06026
作者:An Thi Nguyen, Radina Stoykova, Eric Arazo
备注:10 pages, accepted to AIES 2025
摘要
:Generic instance search models can dramatically reduce the manual effort required to analyze vast surveillance footage during criminal investigations by retrieving specific objects of interest to law enforcement. However, our research reveals an unintended emergent capability: through overlearning, these models can single out specific individuals even when trained on datasets without human subjects. This capability raises concerns regarding identification and profiling of individuals based on their personal data, while there is currently no clear standard on how de-identification can be achieved. We evaluate two technical safeguards to curtail a model's person re-identification capacity: index exclusion and confusion loss. Our experiments demonstrate that combining these approaches can reduce person re-identification accuracy to below 2% while maintaining 82% of retrieval performance for non-person objects. However, we identify critical vulnerabilities in these mitigations, including potential circumvention using partial person images. These findings highlight urgent regulatory questions at the intersection of AI governance and data protection: How should we classify and regulate systems with emergent identification capabilities? And what technical standards should be required to prevent identification capabilities from developing in seemingly benign applications?
【3】Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics
标题:中途网络:从潜在动力学中学习识别和运动的表示
链接:https://arxiv.org/abs/2510.05558
作者:Christopher Hoang, Mengye Ren
备注:Project page: this https URL
摘要:Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.
表征(6篇)
【1】OBSR: Open Benchmark for Spatial Representations
标题:ObSR:空间表示的开放基准
链接:https://arxiv.org/abs/2510.05879
作者:Julia Moska, Oleksii Furman, Kacper Kozaczko, Szymon Leszkiewicz, Jakub Polczyk, Piotr Gramacki, Piotr Szymański
备注:ACM SIGSPATIAL 2025 Full Paper
摘要:GeoAI is evolving rapidly, fueled by diverse geospatial datasets like traffic patterns, environmental data, and crowdsourced OpenStreetMap (OSM) information. While sophisticated AI models are being developed, existing benchmarks are often concentrated on single tasks and restricted to a single modality. As such, progress in GeoAI is limited by the lack of a standardized, multi-task, modality-agnostic benchmark for their systematic evaluation. This paper introduces a novel benchmark designed to assess the performance, accuracy, and efficiency of geospatial embedders. Our benchmark is modality-agnostic and comprises 7 distinct datasets from diverse cities across three continents, ensuring generalizability and mitigating demographic biases. It allows for the evaluation of GeoAI embedders on various phenomena that exhibit underlying geographic processes. Furthermore, we establish a simple and intuitive task-oriented model baselines, providing a crucial reference point for comparing more complex solutions.
【2】Multimodal Trajectory Representation Learning for Travel Time Estimation
标题:用于旅行时间估计的多峰轨迹表示学习
链接:https://arxiv.org/abs/2510.05840
作者:Zhi Liu, Xuyuan Hu, Xiao Han, Zhehao Dai, Zhaolin Deng, Guojiang Shen, Xiangjie Kong
摘要:Accurate travel time estimation (TTE) plays a crucial role in intelligent transportation systems. However, it remains challenging due to heterogeneous data sources and complex traffic dynamics. Moreover, conventional approaches typically convert trajectories into fixed-length representations, neglecting the inherent variability of real-world trajectories, which often leads to information loss or feature redundancy. To address these challenges, this paper introduces the Multimodal Dynamic Trajectory Integration (MDTI) framework--a novel multimodal trajectory representation learning approach that integrates GPS sequences, grid trajectories, and road network constraints to enhance TTE accuracy. MDTI employs modality-specific encoders and a cross-modal interaction module to capture complementary spatial, temporal, and topological semantics, while a dynamic trajectory modeling mechanism adaptively regulates information density for trajectories of varying lengths. Two self-supervised pretraining objectives, named contrastive alignment and masked language modeling, further strengthen multimodal consistency and contextual understanding. Extensive experiments on three real-world datasets demonstrate that MDTI consistently outperforms state-of-the-art baselines, confirming its robustness and strong generalization abilities. The code is publicly available at: https://github.com/freshhxy/MDTI/
【3】Power Mechanism: Private Tabular Representation Release for Model Agnostic Consumption
标题:权力机制:模型不可知消费的私有表格表示发布
链接:https://arxiv.org/abs/2510.05581
作者:Praneeth Vepakomma, Kaustubh Ponkshe
摘要
:Traditional collaborative learning approaches are based on sharing of model weights between clients and a server. However, there are advantages to resource efficiency through schemes based on sharing of embeddings (activations) created from the data. Several differentially private methods were developed for sharing of weights while such mechanisms do not exist so far for sharing of embeddings. We propose Ours to learn a privacy encoding network in conjunction with a small utility generation network such that the final embeddings generated from it are equipped with formal differential privacy guarantees. These privatized embeddings are then shared with a more powerful server, that learns a post-processing that results in a higher accuracy for machine learning tasks. We show that our co-design of collaborative and private learning results in requiring only one round of privatized communication and lesser compute on the client than traditional methods. The privatized embeddings that we share from the client are agnostic to the type of model (deep learning, random forests or XGBoost) used on the server in order to process these activations to complete a task.
【4】Permutation-Invariant Representation Learning for Robust and Privacy-Preserving Feature Selection
标题:用于鲁棒和隐私保护特征选择的排列不变表示学习
链接:https://arxiv.org/abs/2510.05535
作者:Rui Liu, Tao Zhe, Yanjie Fu, Feng Xia, Ted Senator, Dongjie Wang
摘要:Feature selection eliminates redundancy among features to improve downstream task performance while reducing computational overhead. Existing methods often struggle to capture intricate feature interactions and adapt across diverse application scenarios. Recent advances employ generative intelligence to alleviate these drawbacks. However, these methods remain constrained by permutation sensitivity in embedding and reliance on convexity assumptions in gradient-based search. To address these limitations, our initial work introduces a novel framework that integrates permutation-invariant embedding with policy-guided search. Although effective, it still left opportunities to adapt to realistic distributed scenarios. In practice, data across local clients is highly imbalanced, heterogeneous and constrained by strict privacy regulations, limiting direct sharing. These challenges highlight the need for a framework that can integrate feature selection knowledge across clients without exposing sensitive information. In this extended journal version, we advance the framework from two perspectives: 1) developing a privacy-preserving knowledge fusion strategy to derive a unified representation space without sharing sensitive raw data. 2) incorporating a sample-aware weighting strategy to address distributional imbalance among heterogeneous local clients. Extensive experiments validate the effectiveness, robustness, and efficiency of our framework. The results further demonstrate its strong generalization ability in federated learning scenarios. The code and data are publicly available: https://anonymous.4open.science/r/FedCAPS-08BF.
【5】PatternKV: Flattening KV Representation Expands Quantization Headroom
标题:PatternKV:扁平KV表示扩展量化余量
链接:https://arxiv.org/abs/2510.05176
作者:Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
摘要:KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable structure that evolves gradually with context, while the V cache carries latent semantic regularities. Building on these insights, we propose PatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.4x while supporting 1.25x larger batches.
【6】Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models
标题:合成历史:评估扩散模型中过去的视觉表示
链接:https://arxiv.org/abs/2505.17064
作者:Maria-Teresa De Rosa Palmini, Eva Cetinic
摘要:As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. In this work, we present a systematic and reproducible methodology for evaluating how TTI systems depict different historical periods. For this purpose, we introduce the HistVis dataset, a curated collection of 30,000 synthetic images generated by three state-of-the-art diffusion models using carefully designed prompts depicting universal human activities across different historical periods. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By offering a scalable methodology and benchmark for assessing historical representation in generated imagery, this work provides an initial step toward building more historically accurate and culturally aligned TTI models.
3D|3D重建等相关(2篇)
【1】Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images
标题:日常图像中的双手3D手部运动和发音预测
链接:https://arxiv.org/abs/2510.06145
作者:Aditya Prakash, David Forsyth, Saurabh Gupta
备注:Project page: this https URL
摘要
:We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.
【2】Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
标题:Stratum:采用分层单片3D可堆叠DRAM的系统硬件协同设计,实现高效的MoE服务
链接:https://arxiv.org/abs/2510.05245
作者:Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, Mingu Kang
摘要:As Large Language Models (LLMs) continue to evolve, Mixture of Experts (MoE) architecture has emerged as a prevailing design for achieving state-of-the-art performance across a wide range of tasks. MoE models use sparse gating to activate only a handful of expert sub-networks per input, achieving billion-parameter capacity with inference costs akin to much smaller models. However, such models often pose challenges for hardware deployment due to the massive data volume introduced by the MoE layers. To address the challenges of serving MoE models, we propose Stratum, a system-hardware co-design approach that combines the novel memory technology Monolithic 3D-Stackable DRAM (Mono3D DRAM), near-memory processing (NMP), and GPU acceleration. The logic and Mono3D DRAM dies are connected through hybrid bonding, whereas the Mono3D DRAM stack and GPU are interconnected via silicon interposer. Mono3D DRAM offers higher internal bandwidth than HBM thanks to the dense vertical interconnect pitch enabled by its monolithic structure, which supports implementations of higher-performance near-memory processing. Furthermore, we tackle the latency differences introduced by aggressive vertical scaling of Mono3D DRAM along the z-dimension by constructing internal memory tiers and assigning data across layers based on access likelihood, guided by topic-based expert usage prediction to boost NMP throughput. The Stratum system achieves up to 8.29x improvement in decoding throughput and 7.66x better energy efficiency across various benchmarks compared to GPU baselines.
编码器(1篇)
【1】Aneurysm Growth Time Series Reconstruction Using Physics-informed Autoencoder
标题:使用物理信息的自动编码器重建动脉瘤生长时间序列
链接:https://arxiv.org/abs/2510.05183
作者:Jiacheng Wu
备注:21 pages, 13 figures
摘要:Arterial aneurysm (Fig.1) is a bulb-shape local expansion of human arteries, the rupture of which is a leading cause of morbidity and mortality in US. Therefore, the prediction of arterial aneurysm rupture is of great significance for aneurysm management and treatment selection. The prediction of aneurysm rupture depends on the analysis of the time series of aneurysm growth history. However, due to the long time scale of aneurysm growth, the time series of aneurysm growth is not always accessible. We here proposed a method to reconstruct the aneurysm growth time series directly from patient parameters. The prediction is based on data pairs of [patient parameters, patient aneurysm growth time history]. To obtain the mapping from patient parameters to patient aneurysm growth time history, we first apply autoencoder to obtain a compact representation of the time series for each patient. Then a mapping is learned from patient parameters to the corresponding compact representation of time series via a five-layer neural network. Moving average and convolutional output layer are implemented to explicitly taking account the time dependency of the time series. Apart from that, we also propose to use prior knowledge about the mechanism of aneurysm growth to improve the time series reconstruction results. The prior physics-based knowledge is incorporated as constraints for the optimization problem associated with autoencoder. The model can handle both algebraic and differential constraints. Our results show that including physical model information about the data will not significantly improve the time series reconstruction results if the training data is error-free. However, in the case of training data with noise and bias error, incorporating physical model constraints can significantly improve the predicted time series.
优化|敛散性(6篇)
【1】Improved High-probability Convergence Guarantees of Decentralized SGD
标题:改进的分散SGD算法的高概率收敛保证
链接:https://arxiv.org/abs/2510.06141
作者:Aleksandar Armacki, Ali H. Sayed
备注:39 pages
摘要:Convergence in high-probability (HP) has been receiving increasing interest, due to its attractive properties, such as exponentially decaying tail bounds and strong guarantees for each individual run of an algorithm. While HP guarantees are extensively studied in centralized settings, much less is understood in the decentralized, networked setup. Existing HP studies in decentralized settings impose strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise, resulting in a significant gap between assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, even for vanilla Decentralized Stochastic Gradient Descent ($\mathtt{DSGD}$) algorithm. This is contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed to guarantee MSE convergence. Motivated by this observation, we revisit HP guarantees for $\mathtt{DSGD}$ in the presence of light-tailed noise. We show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing uniformly bounded gradients and other restrictive assumptions, while simultaneously achieving order-optimal rates for both non-convex and strongly convex costs. Moreover, our improved analysis yields linear speed-up in the number of users, demonstrating that $\mathtt{DSGD}$ maintains strong performance in the HP sense and matches existing MSE guarantees. Our improved results stem from a careful analysis of the MGF of quantities of interest (norm-squared of gradient or optimality gap) and the MGF of the consensus gap between users' models. To achieve linear speed-up, we provide a novel result on the variance-reduction effect of decentralized methods in the HP sense and more fine-grained bounds on the MGF for strongly convex costs, which are both of independent interest.
【2】In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
标题:实时统计系统优化以实现有效规划和工具使用
链接
:https://arxiv.org/abs/2510.05592
作者:Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu
备注:45 pages, 12 figures. Project website: this https URL
摘要:Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.
【3】NeST-BO: Fast Local Bayesian Optimization via Newton-Step Targeting of Gradient and Hessian Information
标题:NeST-BO:通过牛顿步目标化梯度和黑森信息的快速局部Bayesian优化
链接:https://arxiv.org/abs/2510.05516
作者:Wei-Ting Tang, Akshay Kudva, Joel A. Paulson
摘要:Bayesian optimization (BO) is effective for expensive black-box problems but remains challenging in high dimensions. We propose NeST-BO, a local BO method that targets the Newton step by jointly learning gradient and Hessian information with Gaussian process surrogates, and selecting evaluations via a one-step lookahead bound on Newton-step error. We show that this bound (and hence the step error) contracts with batch size, so NeST-BO directly inherits inexact-Newton convergence: global progress under mild stability assumptions and quadratic local rates once steps are sufficiently accurate. To scale, we optimize the acquisition in low-dimensional subspaces (e.g., random embeddings or learned sparse subspaces), reducing the dominant cost of learning curvature from $O(d^2)$ to $O(m^2)$ with $m \ll d$ while preserving step targeting. Across high-dimensional synthetic and real-world problems, including cases with thousands of variables and unknown active subspaces, NeST-BO consistently yields faster convergence and lower regret than state-of-the-art local and high-dimensional BO baselines.
【4】Simultaneous Learning and Optimization via Misspecified Saddle Point Problems
标题:通过错误指定鞍点问题同时学习和优化
链接:https://arxiv.org/abs/2510.05241
作者:Mohammad Mahdi Ahmadi, Erfan Yazdandoost Hamedani
摘要:We study a class of misspecified saddle point (SP) problems, where the optimization objective depends on an unknown parameter that must be learned concurrently from data. Unlike existing studies that assume parameters are fully known or pre-estimated, our framework integrates optimization and learning into a unified formulation, enabling a more flexible problem class. To address this setting, we propose two algorithms based on the accelerated primal-dual (APD) by Hamedani & Aybat 2021. In particular, we first analyze the naive extension of the APD method by directly substituting the evolving parameter estimates into the primal-dual updates; then, we design a new learning-aware variant of the APD method that explicitly accounts for parameter dynamics by adjusting the momentum updates. Both methods achieve a provable convergence rate of $\mathcal{O}(\log K / K)$, while the learning-aware approach attains a tighter $\mathcal{O}(1)$ constant and further benefits from an adaptive step-size selection enabled by a backtracking strategy. Furthermore, we extend the framework to problems where the learning problem admits multiple optimal solutions, showing that our modified algorithm for a structured setting achieves an $\mathcal{O}(1/\sqrt{K})$ rate. To demonstrate practical impact, we evaluate our methods on a misspecified portfolio optimization problem and show superior empirical performance compared to state-of-the-art algorithms.
【5】Generative Inverse Design: From Single Point Optimization to a Diverse Design Portfolio via Conditional Variational Autoencoders
标题:生成式逆设计:通过条件变分自动编码器从单点优化到多元化设计组合
链接:https://arxiv.org/abs/2510.05160
作者:Muhammad Arif Hakimi Zamrai
摘要:Inverse design, which seeks to find optimal parameters for a target output, is a central challenge in engineering. Surrogate-based optimization (SBO) has become a standard approach, yet it is fundamentally structured to converge to a single-point solution, thereby limiting design space exploration and ignoring potentially valuable alternative topologies. This paper presents a paradigm shift from single-point optimization to generative inverse design. We introduce a framework based on a Conditional Variational Autoencoder (CVAE) that learns a probabilistic mapping between a system's design parameters and its performance, enabling the generation of a diverse portfolio of high-performing candidates conditioned on a specific performance objective. We apply this methodology to the complex, non-linear problem of minimizing airfoil self-noise, using a high-performing SBO method from a prior benchmark study as a rigorous baseline. The CVAE framework successfully generated 256 novel designs with a 94.1\% validity rate. A subsequent surrogate-based evaluation revealed that 77.2\% of these valid designs achieved superior performance compared to the single optimal design found by the SBO baseline. This work demonstrates that the generative approach not only discovers higher-quality solutions but also provides a rich portfolio of diverse candidates, fundamentally enhancing the engineering design process by enabling multi-criteria decision-making.
【6】Bilevel optimization for learning hyperparameters: Application to solving PDEs and inverse problems with Gaussian processes
标题:用于学习超参数的二层优化:应用于解决偏出方程和高斯过程反问题
链接:https://arxiv.org/abs/2510.05568
作者:Nicholas H. Nelsen, Houman Owhadi, Andrew M. Stuart, Xianjin Yang, Zongren Zou
摘要:Methods for solving scientific computing and inference problems, such as kernel- and neural network-based approaches for partial differential equations (PDEs), inverse problems, and supervised learning tasks, depend crucially on the choice of hyperparameters. Specifically, the efficacy of such methods, and in particular their accuracy, stability, and generalization properties, strongly depends on the choice of hyperparameters. While bilevel optimization offers a principled framework for hyperparameter tuning, its nested optimization structure can be computationally demanding, especially in PDE-constrained contexts. In this paper, we propose an efficient strategy for hyperparameter optimization within the bilevel framework by employing a Gauss-Newton linearization of the inner optimization step. Our approach provides closed-form updates, eliminating the need for repeated costly PDE solves. As a result, each iteration of the outer loop reduces to a single linearized PDE solve, followed by explicit gradient-based hyperparameter updates. We demonstrate the effectiveness of the proposed method through Gaussian process models applied to nonlinear PDEs and to PDE inverse problems. Extensive numerical experiments highlight substantial improvements in accuracy and robustness compared to conventional random hyperparameter initialization. In particular, experiments with additive kernels and neural network-parameterized deep kernels demonstrate the method's scalability and effectiveness for high-dimensional hyperparameter optimization.
预测|估计(7篇)
【1】Comparing LSTM-Based Sequence-to-Sequence Forecasting Strategies for 24-Hour Solar Proton Flux Profiles Using GOES Data
标题:使用GOES数据比较基于LSTM的24小时太阳质子通量剖面序列到序列预测策略
链接:https://arxiv.org/abs/2510.05399
作者:Kangwoo Yi, Bo Shen, Qin Li, Haimin Wang, Yong-Jae Moon, Jaewon Lee, Hwanhee Lee
备注:7 pages; accepted as a workshop paper at ICDM 2025
摘要:Solar Proton Events (SPEs) cause significant radiation hazards to satellites, astronauts, and technological systems. Accurate forecasting of their proton flux time profiles is crucial for early warnings and mitigation. This paper explores deep learning sequence-to-sequence (seq2seq) models based on Long Short-Term Memory networks to predict 24-hour proton flux profiles following SPE onsets. We used a dataset of 40 well-connected SPEs (1997-2017) observed by NOAA GOES, each associated with a >=M-class western-hemisphere solar flare and undisturbed proton flux profiles. Using 4-fold stratified cross-validation, we evaluate seq2seq model configurations (varying hidden units and embedding dimensions) under multiple forecasting scenarios: (i) proton-only input vs. combined proton+X-ray input, (ii) original flux data vs. trend-smoothed data, and (iii) autoregressive vs. one-shot forecasting. Our major results are as follows: First, one-shot forecasting consistently yields lower error than autoregressive prediction, avoiding the error accumulation seen in iterative approaches. Second, on the original data, proton-only models outperform proton+X-ray models. However, with trend-smoothed data, this gap narrows or reverses in proton+X-ray models. Third, trend-smoothing significantly enhances the performance of proton+X-ray models by mitigating fluctuations in the X-ray channel. Fourth, while models trained on trendsmoothed data perform best on average, the best-performing model was trained on original data, suggesting that architectural choices can sometimes outweigh the benefits of data preprocessing.
【2】Fusion-Based Neural Generalization for Predicting Temperature Fields in Industrial PET Preform Heating
标题:基于融合的神经推广用于预测工业PET预制品加热中的温度场
链接:https://arxiv.org/abs/2510.05394
作者:Ahmad Alsheikh, Andreas Fischer
备注:Workshop paper, AIP2025: Second Workshop on AI in Production (2025). Licensed under CC BY 4.0
摘要:Accurate and efficient temperature prediction is critical for optimizing the preheating process of PET preforms in industrial microwave systems prior to blow molding. We propose a novel deep learning framework for generalized temperature prediction. Unlike traditional models that require extensive retraining for each material or design variation, our method introduces a data-efficient neural architecture that leverages transfer learning and model fusion to generalize across unseen scenarios. By pretraining specialized neural regressor on distinct conditions such as recycled PET heat capacities or varying preform geometries and integrating their representations into a unified global model, we create a system capable of learning shared thermal dynamics across heterogeneous inputs. The architecture incorporates skip connections to enhance stability and prediction accuracy. Our approach reduces the need for large simulation datasets while achieving superior performance compared to models trained from scratch. Experimental validation on two case studies material variability and geometric diversity demonstrates significant improvements in generalization, establishing a scalable ML-based solution for intelligent thermal control in manufacturing environments. Moreover, the approach highlights how data-efficient generalization strategies can extend to other industrial applications involving complex physical modeling with limited data.
【3】A Neural Network Algorithm for KL Divergence Estimation with Quantitative Error Bounds
标题:具有定量误差界的KL方差估计神经网络算法
链接:https://arxiv.org/abs/2510.05386
作者:Mikil Foss, Andrew Lamperski
备注:Under Review for AISTATS 2026
摘要
:Estimating the Kullback-Leibler (KL) divergence between random variables is a fundamental problem in statistical analysis. For continuous random variables, traditional information-theoretic estimators scale poorly with dimension and/or sample size. To mitigate this challenge, a variety of methods have been proposed to estimate KL divergences and related quantities, such as mutual information, using neural networks. The existing theoretical analyses show that neural network parameters achieving low error exist. However, since they rely on non-constructive neural network approximation theorems, they do not guarantee that the existing algorithms actually achieve low error. In this paper, we propose a KL divergence estimation algorithm using a shallow neural network with randomized hidden weights and biases (i.e. a random feature method). We show that with high probability, the algorithm achieves a KL divergence estimation error of $O(m^{-1/2}+T^{-1/3})$, where $m$ is the number of neurons and $T$ is both the number of steps of the algorithm and the number of samples.
【4】ECLipsE-Gen-Local: Efficient Compositional Local Lipschitz Estimates for Deep Neural Networks
标题:ECLipsE-Gen-Local:深度神经网络的高效组成局部Lipschitz估计
链接:https://arxiv.org/abs/2510.05261
作者:Yuezhu Xu, S. Sivaranjani
摘要:The Lipschitz constant is a key measure for certifying the robustness of neural networks to input perturbations. However, computing the exact constant is NP-hard, and standard approaches to estimate the Lipschitz constant involve solving a large matrix semidefinite program (SDP) that scales poorly with network size. Further, there is a potential to efficiently leverage local information on the input region to provide tighter Lipschitz estimates. We address this problem here by proposing a compositional framework that yields tight yet scalable Lipschitz estimates for deep feedforward neural networks. Specifically, we begin by developing a generalized SDP framework that is highly flexible, accommodating heterogeneous activation function slope, and allowing Lipschitz estimates with respect to arbitrary input-output pairs and arbitrary choices of sub-networks of consecutive layers. We then decompose this generalized SDP into a sequence of small sub-problems, with computational complexity that scales linearly with respect to the network depth. We also develop a variant that achieves near-instantaneous computation through closed-form solutions to each sub-problem. All our algorithms are accompanied by theoretical guarantees on feasibility and validity. Next, we develop a series of algorithms, termed as ECLipsE-Gen-Local, that effectively incorporate local information on the input. Our experiments demonstrate that our algorithms achieve substantial speedups over a multitude of benchmarks while producing significantly tighter Lipschitz bounds than global approaches. Moreover, we show that our algorithms provide strict upper bounds for the Lipschitz constant with values approaching the exact Jacobian from autodiff when the input region is small enough. Finally, we demonstrate the practical utility of our approach by showing that our Lipschitz estimates closely align with network robustness.
【5】Carbon Emission Prediction in China Considering New Quality Productive Forces Using a Deep & Corss Learning Modeling Framework
标题:使用Deep & Corss学习建模框架考虑新质量生产力的中国碳排放预测
链接:https://arxiv.org/abs/2510.05171
作者:Haijin Xie, Gongquan Zhang
摘要:New quality productive forces (NQPF), digital economy advancement, and artificial intelligence (AI) technologies are becoming crucial for promoting sustainable urban development. This study proposes a Multi-head Attention Deep & Cross Network (MADCN) framework, combining feature interaction modeling and attention mechanisms, to predict urban carbon emissions and investigate the impacts of technological factors. The framework incorporates an interpretable learning phase using SHapley Additive exPlanations (SHAP) to assess the contributions of different features. A panel dataset covering 275 Chinese cities is utilized to test the MADCN model. Experimental results demonstrate that the MADCN model achieves superior predictive performance compared to traditional machine learning and deep learning baselines, with a Mean Squared Error (MSE) of 406,151.063, a Mean Absolute Error (MAE) of 612.304, and an R-squared value of 0.991 on the test set. SHAP analysis highlights that population, city size, urbanization rate, and GDP are among the most influential factors on carbon emissions, while NQPF, digital economy index, and AI technology level also show meaningful but relatively moderate effects. Advancing NQPF, strengthening the digital economy, and accelerating AI technology development can significantly contribute to reducing urban carbon emissions. Policymakers should prioritize integrating technological innovation into carbon reduction strategies, particularly by promoting intelligent infrastructure and enhancing digitalization across sectors, to effectively achieve dual-carbon goals.
【6】Climate Model Tuning with Online Synchronization-Based Parameter Estimation
标题:利用基于在线同步的参数估计进行气候模型调整
链接:https://arxiv.org/abs/2510.06180
作者:Jordan Seneca, Suzanne Bintanja, Frank M. Selten
备注:19 pages, 11 figures
摘要:In climate science, the tuning of climate models is a computationally intensive problem due to the combination of the high-dimensionality of the system state and long integration times. Here we demonstrate the potential of a parameter estimation algorithm which makes use of synchronization to tune a global atmospheric model at modest computational costs. We first use it to directly optimize internal model parameters. We then apply the algorithm to the weights of each member of a supermodel ensemble to optimize the overall predictions. In both cases, the algorithm is able to find parameters which result in reduced errors in the climatology of the model. Finally, we introduce a novel approach which combines both methods called adaptive supermodeling, where the internal parameters of the members of a supermodel are tuned simultaneously with the model weights such that the supermodel predictions are optimized. For a case designed to challenge the two previous methods, adaptive supermodeling achieves a performance similar to a perfect model.
【7】Differentiable Model Predictive Control on the GPU
标题:图形处理器上的差异模型预测控制
链接:https://arxiv.org/abs/2510.06179
作者:Emre Adabag, Marcus Greiff, John Subosits, Thomas Lew
摘要
:Differentiable model predictive control (MPC) offers a powerful framework for combining learning and control. However, its adoption has been limited by the inherently sequential nature of traditional optimization algorithms, which are challenging to parallelize on modern computing hardware like GPUs. In this work, we tackle this bottleneck by introducing a GPU-accelerated differentiable optimization tool for MPC. This solver leverages sequential quadratic programming and a custom preconditioned conjugate gradient (PCG) routine with tridiagonal preconditioning to exploit the problem's structure and enable efficient parallelization. We demonstrate substantial speedups over CPU- and GPU-based baselines, significantly improving upon state-of-the-art training times on benchmark reinforcement learning and imitation learning tasks. Finally, we showcase the method on the challenging task of reinforcement learning for driving at the limits of handling, where it enables robust drifting of a Toyota Supra through water puddles.
其他神经网络|深度学习|模型|建模(29篇)
【1】Thermodynamic Performance Limits for Score-Based Diffusion Models
标题:基于分数的扩散模型的热力学性能极限
链接:https://arxiv.org/abs/2510.06174
作者:Nathan X. Kodama, Michael Hinczewski
摘要:We establish a fundamental connection between score-based diffusion models and non-equilibrium thermodynamics by deriving performance limits based on entropy rates. Our main theoretical contribution is a lower bound on the negative log-likelihood of the data that relates model performance to entropy rates of diffusion processes. We numerically validate this bound on a synthetic dataset and investigate its tightness. By building a bridge to entropy rates - system, intrinsic, and exchange entropy - we provide new insights into the thermodynamic operation of these models, drawing parallels to Maxwell's demon and implications for thermodynamic computing hardware. Our framework connects generative modeling performance to fundamental physical principles through stochastic thermodynamics.
【2】Downsized and Compromised?: Assessing the Faithfulness of Model Compression
标题:缩小规模并受到损害?:评估模型压缩的可靠性
链接:https://arxiv.org/abs/2510.06125
作者:Moumita Kamal, Douglas A. Talbert
备注:Submitted to and under review at Springer Machine Learning Journal
摘要:In real-world applications, computational constraints often require transforming large models into smaller, more efficient versions through model compression. While these techniques aim to reduce size and computational cost without sacrificing performance, their evaluations have traditionally focused on the trade-off between size and accuracy, overlooking the aspect of model faithfulness. This limited view is insufficient for high-stakes domains like healthcare, finance, and criminal justice, where compressed models must remain faithful to the behavior of their original counterparts. This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics. We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression. Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups, thereby exposing shifts that aggregate fairness metrics may obscure. We demonstrate our approaches by applying quantization and pruning to artificial neural networks (ANNs) trained on three diverse and socially meaningful datasets. Our findings show that high accuracy does not guarantee faithfulness, and our statistical tests detect subtle yet significant shifts that are missed by standard metrics, such as Accuracy and Equalized Odds. The proposed metrics provide a practical and more direct method for ensuring that efficiency gains through compression do not compromise the fairness or faithfulness essential for trustworthy AI.
【3】The Physics of Data and Tasks: Theories of Locality and Compositionality in Deep Learning
标题:数据和任务的物理学:深度学习中的局部性和合成性理论
链接:https://arxiv.org/abs/2510.06106
作者:Alessandro Favero
备注:PhD dissertation. Preprint
摘要:Deep neural networks have achieved remarkable success, yet our understanding of how they learn remains limited. These models can learn high-dimensional tasks, which is generally statistically intractable due to the curse of dimensionality. This apparent paradox suggests that learnable data must have an underlying latent structure. What is the nature of this structure? How do neural networks encode and exploit it, and how does it quantitatively impact performance - for instance, how does generalization improve with the number of training examples? This thesis addresses these questions by studying the roles of locality and compositionality in data, tasks, and deep learning representations.
【4】Learning Mixtures of Linear Dynamical Systems (MoLDS) via Hybrid Tensor-EM Method
标题:通过混合张量-EM方法学习线性动态系统(MoLDS)的混合物
链接:https://arxiv.org/abs/2510.06091
作者:Lulu Gong, Shreya Saxena
备注:20 pages, 7 figures
摘要
:Mixtures of linear dynamical systems (MoLDS) provide a path to model time-series data that exhibit diverse temporal dynamics across trajectories. However, its application remains challenging in complex and noisy settings, limiting its effectiveness for neural data analysis. Tensor-based moment methods can provide global identifiability guarantees for MoLDS, but their performance degrades under noise and complexity. Commonly used expectation-maximization (EM) methods offer flexibility in fitting latent models but are highly sensitive to initialization and prone to poor local minima. Here, we propose a tensor-based method that provides identifiability guarantees for learning MoLDS, which is followed by EM updates to combine the strengths of both approaches. The novelty in our approach lies in the construction of moment tensors using the input-output data to recover globally consistent estimates of mixture weights and system parameters. These estimates can then be refined through a Kalman EM algorithm, with closed-form updates for all LDS parameters. We validate our framework on synthetic benchmarks and real-world datasets. On synthetic data, the proposed Tensor-EM method achieves more reliable recovery and improved robustness compared to either pure tensor or randomly initialized EM methods. We then analyze neural recordings from the primate somatosensory cortex while a non-human primate performs reaches in different directions. Our method successfully models and clusters different conditions as separate subsystems, consistent with supervised single-LDS fits for each condition. Finally, we apply this approach to another neural dataset where monkeys perform a sequential reaching task. These results demonstrate that MoLDS provides an effective framework for modeling complex neural data, and that Tensor-EM is a reliable approach to MoLDS learning for these applications.
【5】Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks
标题:对自己进行基准测试(BIY):为散点图相关任务准备数据集并对人工智能模型进行基准测试
链接:https://arxiv.org/abs/2510.06071
作者:João Palmeiro, Diogo Duarte, Rita Costa, Pedro Bizarro
备注:9 pages, 3 figures, short paper accepted at VISxGenAI: 1st Workshop on GenAI, Agents, and the Future of VIS (IEEE VIS 2025)
摘要:AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.
【6】Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
标题:高斯嵌入:JEPA如何秘密了解您的数据密度
链接:https://arxiv.org/abs/2510.05949
作者:Randall Balestriero, Nicolas Ballas, Mike Rabbat, Yann LeCun
摘要:Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample $x$ efficiently and in closed-form using the model's Jacobian matrix at $x$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.
【7】N-Parties Private Structure and Parameter Learning for Sum-Product Networks
标题:和积网络的N方私有结构和参数学习
链接:https://arxiv.org/abs/2510.05946
作者:Xenia Heilmann, Ernst Althaus, Mattia Cerrato, Nick Johannes Peter Rassau, Mohammad Sadeq Dousti, Stefan Kramer
摘要:A sum-product network (SPN) is a graphical model that allows several types of probabilistic inference to be performed efficiently. In this paper, we propose a privacy-preserving protocol which tackles structure generation and parameter learning of SPNs. Additionally, we provide a protocol for private inference on SPNs, subsequent to training. To preserve the privacy of the participants, we derive our protocol based on secret sharing, which guarantees privacy in the honest-but-curious setting even when at most half of the parties cooperate to disclose the data. The protocol makes use of a forest of randomly generated SPNs, which is trained and weighted privately and can then be used for private inference on data points. Our experiments indicate that preserving the privacy of all participants does not decrease log-likelihood performance on both homogeneously and heterogeneously partitioned data. We furthermore show that our protocol's performance is comparable to current state-of-the-art SPN learners in homogeneously partitioned data settings. In terms of runtime and memory usage, we demonstrate that our implementation scales well when increasing the number of parties, comparing favorably to protocols for neural networks, when they are trained to reproduce the input-output behavior of SPNs.
【8】Carré du champ flow matching: better quality-generalisation tradeoff in generative models
标题:Carré du champ流程匹配:生成模型中更好的质量概括权衡
链接:https://arxiv.org/abs/2510.05930
作者:Jacob Bamberger, Iolo Jones, Dennis Duncan, Michael M. Bronstein, Pierre Vandergheynst, Adam Gosztolai
摘要
:Deep generative models often face a fundamental tradeoff: high sample quality can come at the cost of memorisation, where the model reproduces training data rather than generalising across the underlying data geometry. We introduce Carr\'e du champ flow matching (CDC-FM), a generalisation of flow matching (FM), that improves the quality-generalisation tradeoff by regularising the probability path with a geometry-aware noise. Our method replaces the homogeneous, isotropic noise in FM with a spatially varying, anisotropic Gaussian noise whose covariance captures the local geometry of the latent data manifold. We prove that this geometric noise can be optimally estimated from the data and is scalable to large data. Further, we provide an extensive experimental evaluation on diverse datasets (synthetic manifolds, point clouds, single-cell genomics, animal motion capture, and images) as well as various neural network architectures (MLPs, CNNs, and transformers). We demonstrate that CDC-FM consistently offers a better quality-generalisation tradeoff. We observe significant improvements over standard FM in data-scarce regimes and in highly non-uniformly sampled datasets, which are often encountered in AI for science applications. Our work provides a mathematical framework for studying the interplay between data geometry, generalisation and memorisation in generative models, as well as a robust and scalable algorithm that can be readily integrated into existing flow matching pipelines.
【9】How to model Human Actions distribution with Event Sequence Data
标题:如何使用事件序列数据建模人类动作分布
链接:https://arxiv.org/abs/2510.05856
作者:Egor Surkov, Dmitry Osin, Evgeny Burnaev, Egor Shvetsov
备注:9 pages main text + 2 pages references + 6 pages appendix, 10 figures, 3 tables. Preprint version
摘要:This paper studies forecasting of the future distribution of events in human action sequences, a task essential in domains like retail, finance, healthcare, and recommendation systems where the precise temporal order is often less critical than the set of outcomes. We challenge the dominant autoregressive paradigm and investigate whether explicitly modeling the future distribution or order-invariant multi-token approaches outperform order-preserving methods. We analyze local order invariance and introduce a KL-based metric to quantify temporal drift. We find that a simple explicit distribution forecasting objective consistently surpasses complex implicit baselines. We further demonstrate that mode collapse of predicted categories is primarily driven by distributional imbalance. This work provides a principled framework for selecting modeling strategies and offers practical guidance for building more accurate and robust forecasting systems.
【10】DP-SNP-TIHMM: Differentially Private, Time-Inhomogeneous Hidden Markov Models for Synthesizing Genome-Wide Association Datasets
标题:DP-SNP-TIHM:用于合成全基因组关联数据集的差异私密、时间不均匀隐马尔科夫模型
链接:https://arxiv.org/abs/2510.05777
作者:Shadi Rahimian, Mario Fritz
摘要:Single nucleotide polymorphism (SNP) datasets are fundamental to genetic studies but pose significant privacy risks when shared. The correlation of SNPs with each other makes strong adversarial attacks such as masked-value reconstruction, kin, and membership inference attacks possible. Existing privacy-preserving approaches either apply differential privacy to statistical summaries of these datasets or offer complex methods that require post-processing and the usage of a publicly available dataset to suppress or selectively share SNPs. In this study, we introduce an innovative framework for generating synthetic SNP sequence datasets using samples derived from time-inhomogeneous hidden Markov models (TIHMMs). To preserve the privacy of the training data, we ensure that each SNP sequence contributes only a bounded influence during training, enabling strong differential privacy guarantees. Crucially, by operating on full SNP sequences and bounding their gradient contributions, our method directly addresses the privacy risks introduced by their inherent correlations. Through experiments conducted on the real-world 1000 Genomes dataset, we demonstrate the efficacy of our method using privacy budgets of $\varepsilon \in [1, 10]$ at $\delta=10^{-4}$. Notably, by allowing the transition models of the HMM to be dependent on the location in the sequence, we significantly enhance performance, enabling the synthetic datasets to closely replicate the statistical properties of non-private datasets. This framework facilitates the private sharing of genomic data while offering researchers exceptional flexibility and utility.
【11】Stable Robot Motions on Manifolds: Learning Lyapunov-Constrained Neural Manifold ODEs
标题:机器人在机器人上的稳定运动:学习Lyapunov约束的神经Manifold ODE
链接:https://arxiv.org/abs/2510.05707
作者:David Boetius, Abdelrahman Abdelnaby, Ashok Kumar, Stefan Leue, Abdalla Swikir, Fares J. Abu-Dakka
备注:12 pages, 6 figures
摘要:Learning stable dynamical systems from data is crucial for safe and reliable robot motion planning and control. However, extending stability guarantees to trajectories defined on Riemannian manifolds poses significant challenges due to the manifold's geometric constraints. To address this, we propose a general framework for learning stable dynamical systems on Riemannian manifolds using neural ordinary differential equations. Our method guarantees stability by projecting the neural vector field evolving on the manifold so that it strictly satisfies the Lyapunov stability criterion, ensuring stability at every system state. By leveraging a flexible neural parameterisation for both the base vector field and the Lyapunov function, our framework can accurately represent complex trajectories while respecting manifold constraints by evolving solutions directly on the manifold. We provide an efficient training strategy for applying our framework and demonstrate its utility by solving Riemannian LASA datasets on the unit quaternion (S^3) and symmetric positive-definite matrix manifolds, as well as robotic motions evolving on \mathbb{R}^3 \times S^3. We demonstrate the performance, scalability, and practical applicability of our approach through extensive simulations and by learning robot motions in a real-world experiment.
【12】Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models
标题:量化基于概念的侧通道模型中的准确性与可解释性权衡
链接:https://arxiv.org/abs/2510.05670
作者:David Debot, Giuseppe Marra
摘要
:Concept Bottleneck Models (CBNMs) are deep learning models that provide interpretability by enforcing a bottleneck layer where predictions are based exclusively on human-understandable concepts. However, this constraint also restricts information flow and often results in reduced predictive accuracy. Concept Sidechannel Models (CSMs) address this limitation by introducing a sidechannel that bypasses the bottleneck and carry additional task-relevant information. While this improves accuracy, it simultaneously compromises interpretability, as predictions may rely on uninterpretable representations transmitted through sidechannels. Currently, there exists no principled technique to control this fundamental trade-off. In this paper, we close this gap. First, we present a unified probabilistic concept sidechannel meta-model that subsumes existing CSMs as special cases. Building on this framework, we introduce the Sidechannel Independence Score (SIS), a metric that quantifies a CSM's reliance on its sidechannel by contrasting predictions made with and without sidechannel information. We propose SIS regularization, which explicitly penalizes sidechannel reliance to improve interpretability. Finally, we analyze how the expressivity of the predictor and the reliance of the sidechannel jointly shape interpretability, revealing inherent trade-offs across different CSM architectures. Empirical results show that state-of-the-art CSMs, when trained solely for accuracy, exhibit low representation interpretability, and that SIS regularization substantially improves their interpretability, intervenability, and the quality of learned interpretable task predictors. Our work provides both theoretical and practical tools for developing CSMs that balance accuracy and interpretability in a principled manner.
【13】InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment
标题:InstaGeo:从数据到部署的计算高效地理空间机器学习
链接:https://arxiv.org/abs/2510.05617
作者:Ibrahim Salihu Yusuf, Iffanice Houndayi, Rym Oualha, Mohamed Aziz Cherif, Kobby Panford-Quainoo, Arnu Pretorius
摘要:Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git
【14】Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning
标题:复杂的盆地几何形状对深度学习的可预测性和可重复性设置了根本限制
链接:https://arxiv.org/abs/2510.05606
作者:Andrew Ly, Pulin Gong
摘要:Fundamental limits to predictability are central to our understanding of many physical and computational systems. Here we show that, despite its remarkable capabilities, deep learning exhibits such fundamental limits rooted in the fractal, riddled geometry of its basins of attraction: any initialization that leads to one solution lies arbitrarily close to another that leads to a different one. We derive sufficient conditions for the emergence of riddled basins by analytically linking features widely observed in deep learning, including chaotic learning dynamics and symmetry-induced invariant subspaces, to reveal a general route to riddling in realistic deep networks. The resulting basins of attraction possess an infinitely fine-scale fractal structure characterized by an uncertainty exponent near zero, so that even large increases in the precision of initial conditions yield only marginal gains in outcome predictability. Riddling thus imposes a fundamental limit on the predictability and hence reproducibility of neural network training, providing a unified account of many empirical observations. These results reveal a general organizing principle of deep learning with important implications for optimization and the safe deployment of artificial intelligence.
【15】Correlating Cross-Iteration Noise for DP-SGD using Model Curvature
标题:使用模型弯曲关联DP-BCD的交叉迭代噪音
链接:https://arxiv.org/abs/2510.05416
作者:Xin Gu, Yingtai Xiao, Guanlin He, Jiamu Bai, Daniel Kifer, Kiwan Maeng
摘要:Differentially private stochastic gradient descent (DP-SGD) offers the promise of training deep learning models while mitigating many privacy risks. However, there is currently a large accuracy gap between DP-SGD and normal SGD training. This has resulted in different lines of research investigating orthogonal ways of improving privacy-preserving training. One such line of work, known as DP-MF, correlates the privacy noise across different iterations of stochastic gradient descent -- allowing later iterations to cancel out some of the noise added to earlier iterations. In this paper, we study how to improve this noise correlation. We propose a technique called NoiseCurve that uses model curvature, estimated from public unlabeled data, to improve the quality of this cross-iteration noise correlation. Our experiments on various datasets, models, and privacy parameters show that the noise correlations computed by NoiseCurve offer consistent and significant improvements in accuracy over the correlation scheme used by DP-MF.
【16】Scalable In-context Ranking with Generative Models
标题:使用生成模型的可扩展上下文内排名
链接:https://arxiv.org/abs/2510.05396
作者:Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu
摘要:In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.
【17】Physics-Informed Neural Networks with Fourier Features and Attention-Driven Decoding
标题:具有傅里叶特征和注意力驱动解码的物理信息神经网络
链接:https://arxiv.org/abs/2510.05385
作者:Rohan Arni, Carlos Blanco
备注:16 pages, 6 figures. Accepted at NeurIPS 2025 AI4Science workshop
摘要:Physics-Informed Neural Networks (PINNs) are a useful framework for approximating partial differential equation solutions using deep learning methods. In this paper, we propose a principled redesign of the PINNsformer, a Transformer-based PINN architecture. We present the Spectral PINNSformer (S-Pformer), a refinement of encoder-decoder PINNSformers that addresses two key issues; 1. the redundancy (i.e. increased parameter count) of the encoder, and 2. the mitigation of spectral bias. We find that the encoder is unnecessary for capturing spatiotemporal correlations when relying solely on self-attention, thereby reducing parameter count. Further, we integrate Fourier feature embeddings to explicitly mitigate spectral bias, enabling adaptive encoding of multiscale behaviors in the frequency domain. Our model outperforms encoder-decoder PINNSformer architectures across all benchmarks, achieving or outperforming MLP performance while reducing parameter count significantly.
【18】Mitigating Diffusion Model Hallucinations with Dynamic Guidance
标题:用动态引导缓解扩散模型幻觉
链接:https://arxiv.org/abs/2510.05356
作者:Kostas Triaridis, Alexandros Graikos, Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras
摘要:Diffusion models, despite their impressive demos, often produce hallucinatory samples with structural inconsistencies that lie outside of the support of the true data distribution. Such hallucinations can be attributed to excessive smoothing between modes of the data distribution. However, semantic interpolations are often desirable and can lead to generation diversity, thus we believe a more nuanced solution is required. In this work, we introduce Dynamic Guidance, which tackles this issue. Dynamic Guidance mitigates hallucinations by selectively sharpening the score function only along the pre-determined directions known to cause artifacts, while preserving valid semantic variations. To our knowledge, this is the first approach that addresses hallucinations at generation time rather than through post-hoc filtering. Dynamic Guidance substantially reduces hallucinations on both controlled and natural image datasets, significantly outperforming baselines.
【19】Tensor-on-tensor Regression Neural Networks for Process Modeling with High-dimensional Data
标题:用于多维数据过程建模的张量对张量回归神经网络
链接:https://arxiv.org/abs/2510.05329
作者:Qian Wang, Mohammad N. Bisheh, Kamran Paynabar
摘要:Modern sensing and metrology systems now stream terabytes of heterogeneous, high-dimensional (HD) data profiles, images, and dense point clouds, whose natural representation is multi-way tensors. Understanding such data requires regression models that preserve tensor geometry, yet remain expressive enough to capture the pronounced nonlinear interactions that dominate many industrial and mechanical processes. Existing tensor-based regressors meet the first requirement but remain essentially linear. Conversely, conventional neural networks offer nonlinearity only after flattening, thereby discarding spatial structure and incurring prohibitive parameter counts. This paper introduces a Tensor-on-Tensor Regression Neural Network (TRNN) that unifies these two paradigms.
【20】Computing frustration and near-monotonicity in deep neural networks
标题:深度神经网络中的计算挫败和近乎单调性
链接:https://arxiv.org/abs/2510.05286
作者:Joel Wendin, Erik G. Larsson, Claudio Altafini
摘要
:For the signed graph associated to a deep neural network, one can compute the frustration level, i.e., test how close or distant the graph is to structural balance. For all the pretrained deep convolutional neural networks we consider, we find that the frustration is always less than expected from null models. From a statistical physics point of view, and in particular in reference to an Ising spin glass model, the reduced frustration indicates that the amount of disorder encoded in the network is less than in the null models. From a functional point of view, low frustration (i.e., proximity to structural balance) means that the function representing the network behaves near-monotonically, i.e., more similarly to a monotone function than in the null models. Evidence of near-monotonic behavior along the partial order determined by frustration is observed for all networks we consider. This confirms that the class of deep convolutional neural networks tends to have a more ordered behavior than expected from null models, and suggests a novel form of implicit regularization.
【21】Approximate Gaussianity Beyond Initialisation in Neural Networks
标题:神经网络中初始化之外的近似高斯性
链接:https://arxiv.org/abs/2510.05218
作者:Edward Hirst, Sanjaye Ramgoolam
备注:26+34 pages, 15 figures, 12 tables
摘要:Ensembles of neural network weight matrices are studied through the training process for the MNIST classification problem, testing the efficacy of matrix models for representing their distributions, under assumptions of Gaussianity and permutation-symmetry. The general 13-parameter permutation invariant Gaussian matrix models are found to be effective models for the correlated Gaussianity in the weight matrices, beyond the range of applicability of the simple Gaussian with independent identically distributed matrix variables, and notably well beyond the initialisation step. The representation theoretic model parameters, and the graph-theoretic characterisation of the permutation invariant matrix observables give an interpretable framework for the best-fit model and for small departures from Gaussianity. Additionally, the Wasserstein distance is calculated for this class of models and used to quantify the movement of the distributions over training. Throughout the work, the effects of varied initialisation regimes, regularisation, layer depth, and layer width are tested for this formalism, identifying limits where particular departures from Gaussianity are enhanced and how more general, yet still highly-interpretable, models can be developed.
【22】A Data-Driven Prism: Multi-View Source Separation with Diffusion Model Priors
标题:数据驱动棱镜:具有扩散模型先验的多视图源分离
链接:https://arxiv.org/abs/2510.05205
作者:Sebastian Wagner-Carena, Aizhan Akhmetzhanova, Sydney Erickson
备注:Accepted to main conference of NeurIPS 2025. Code available at this https URL
摘要:A common challenge in the natural sciences is to disentangle distinct, unknown sources from observations. Examples of this source separation task include deblending galaxies in a crowded field, distinguishing the activity of individual neurons from overlapping signals, and separating seismic events from an ambient background. Traditional analyses often rely on simplified source models that fail to accurately reproduce the data. Recent advances have shown that diffusion models can directly learn complex prior distributions from noisy, incomplete data. In this work, we show that diffusion models can solve the source separation problem without explicit assumptions about the source. Our method relies only on multiple views, or the property that different sets of observations contain different linear transformations of the unknown sources. We show that our method succeeds even when no source is individually observed and the observations are noisy, incomplete, and vary in resolution. The learned diffusion models enable us to sample from the source priors, evaluate the probability of candidate sources, and draw from the joint posterior of the source distribution given an observation. We demonstrate the effectiveness of our method on a range of synthetic problems as well as real-world galaxy observations.
【23】Discretized Quadratic Integrate-and-Fire Neuron Model for Deep Spiking Neural Networks
标题:深度尖峰神经网络的离散二次积分神经元模型
链接:https://arxiv.org/abs/2510.05168
作者:Eric Jahns, Davi Moreno, Milan Stojkov, Michel A. Kinsy
备注:18 pages, 2 figures
摘要:Spiking Neural Networks (SNNs) have emerged as energy-efficient alternatives to traditional artificial neural networks, leveraging asynchronous and biologically inspired neuron dynamics. Among existing neuron models, the Leaky Integrate-and-Fire (LIF) neuron has become widely adopted in deep SNNs due to its simplicity and computational efficiency. However, this efficiency comes at the expense of expressiveness, as LIF dynamics are constrained to linear decay at each timestep. In contrast, more complex models, such as the Quadratic Integrate-and-Fire (QIF) neuron, exhibit richer, nonlinear dynamics but have seen limited adoption due to their training instability. On that note, we propose the first discretization of the QIF neuron model tailored for high-performance deep spiking neural networks and provide an in-depth analysis of its dynamics. To ensure training stability, we derive an analytical formulation for surrogate gradient windows directly from our discretizations' parameter set, minimizing gradient mismatch. We evaluate our method on CIFAR-10, CIFAR-100, ImageNet, and CIFAR-10 DVS, demonstrating its ability to outperform state-of-the-art LIF-based methods. These results establish our discretization of the QIF neuron as a compelling alternative to LIF neurons for deep SNNs, combining richer dynamics with practical scalability.
【24】Lang-PINN: From Language to Physics-Informed Neural Networks via a Multi-Agent Framework
标题:Lang-PINN:通过多代理框架从语言到物理信息神经网络
链接:https://arxiv.org/abs/2510.05158
作者:Xin He, Liangliang You, Hongduan Tian, Bo Han, Ivor Tsang, Yew-Soon Ong
备注:PINN, PDE, Agent, LLM
摘要
:Physics-informed neural networks (PINNs) provide a powerful approach for solving partial differential equations (PDEs), but constructing a usable PINN remains labor-intensive and error-prone. Scientists must interpret problems as PDE formulations, design architectures and loss functions, and implement stable training pipelines. Existing large language model (LLM) based approaches address isolated steps such as code generation or architecture suggestion, but typically assume a formal PDE is already specified and therefore lack an end-to-end perspective. We present Lang-PINN, an LLM-driven multi-agent system that builds trainable PINNs directly from natural language task descriptions. Lang-PINN coordinates four complementary agents: a PDE Agent that parses task descriptions into symbolic PDEs, a PINN Agent that selects architectures, a Code Agent that generates modular implementations, and a Feedback Agent that executes and diagnoses errors for iterative refinement. This design transforms informal task statements into executable and verifiable PINN code. Experiments show that Lang-PINN achieves substantially lower errors and greater robustness than competitive baselines: mean squared error (MSE) is reduced by up to 3--5 orders of magnitude, end-to-end execution success improves by more than 50\%, and reduces time overhead by up to 74\%.
【25】Implicit Updates for Average-Reward Temporal Difference Learning
标题:平均回报时间差异学习的隐式更新
链接:https://arxiv.org/abs/2510.06149
作者:Hwanwoo Kim, Dongkyu Derek Cho, Eric Laber
摘要:Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($\lambda$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($\lambda$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($\lambda$). In contrast to prior finite-time analyses of average-reward TD($\lambda$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($\lambda$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($\lambda$).
【26】On the Theory of Continual Learning with Gradient Descent for Neural Networks
标题:神经网络梯度下降连续学习理论
链接:https://arxiv.org/abs/2510.05573
作者:Hossein Taheri, Avishek Ghosh, Arya Mazumdar
摘要:Continual learning, the ability of a model to adapt to an ongoing sequence of tasks without forgetting the earlier ones, is a central goal of artificial intelligence. To shed light on its underlying mechanisms, we analyze the limitations of continual learning in a tractable yet representative setting. In particular, we study one-hidden-layer quadratic neural networks trained by gradient descent on an XOR cluster dataset with Gaussian noise, where different tasks correspond to different clusters with orthogonal means. Our results obtain bounds on the rate of forgetting during train and test-time in terms of the number of iterations, the sample size, the number of tasks, and the hidden-layer size. Our results reveal interesting phenomena on the role of different problem parameters in the rate of forgetting. Numerical experiments across diverse setups confirm our results, demonstrating their validity beyond the analyzed settings.
【27】Efficient learning of bosonic Gaussian unitaries
标题:玻色高斯单位元的有效学习
链接:https://arxiv.org/abs/2510.05531
作者:Marco Fanizza, Vishnu Iyer, Junseo Lee, Antonio A. Mele, Francesco A. Mele
摘要:Bosonic Gaussian unitaries are fundamental building blocks of central continuous-variable quantum technologies such as quantum-optic interferometry and bosonic error-correction schemes. In this work, we present the first time-efficient algorithm for learning bosonic Gaussian unitaries with a rigorous analysis. Our algorithm produces an estimate of the unknown unitary that is accurate to small worst-case error, measured by the physically motivated energy-constrained diamond distance. Its runtime and query complexity scale polynomially with the number of modes, the inverse target accuracy, and natural energy parameters quantifying the allowed input energy and the unitary's output-energy growth. The protocol uses only experimentally friendly photonic resources: coherent and squeezed probes, passive linear optics, and heterodyne/homodyne detection. We then employ an efficient classical post-processing routine that leverages a symplectic regularization step to project matrix estimates onto the symplectic group. In the limit of unbounded input energy, our procedure attains arbitrarily high precision using only $2m+2$ queries, where $m$ is the number of modes. To our knowledge, this is the first provably efficient learning algorithm for a multiparameter family of continuous-variable unitaries.
【28】A Probabilistic Basis for Low-Rank Matrix Learning
标题:低阶矩阵学习的概率基础
链接:https://arxiv.org/abs/2510.05447
作者:Simon Segert, Nathan Wycoff
摘要:Low rank inference on matrices is widely conducted by optimizing a cost function augmented with a penalty proportional to the nuclear norm $\Vert \cdot \Vert_*$. However, despite the assortment of computational methods for such problems, there is a surprising lack of understanding of the underlying probability distributions being referred to. In this article, we study the distribution with density $f(X)\propto e^{-\lambda\Vert X\Vert_*}$, finding many of its fundamental attributes to be analytically tractable via differential geometry. We use these facts to design an improved MCMC algorithm for low rank Bayesian inference as well as to learn the penalty parameter $\lambda$, obviating the need for hyperparameter tuning when this is difficult or impossible. Finally, we deploy these to improve the accuracy and efficiency of low rank Bayesian matrix denoising and completion algorithms in numerical experiments.
【29】Refereed Learning
标题:学习参考
链接:https://arxiv.org/abs/2510.05440
作者:Ran Canetti, Ephraim Linder, Connor Wagaman
摘要:We initiate an investigation of learning tasks in a setting where the learner is given access to two competing provers, only one of which is honest. Specifically, we consider the power of such learners in assessing purported properties of opaque models. Following prior work that considers the power of competing provers in different settings, we call this setting refereed learning. After formulating a general definition of refereed learning tasks, we show refereed learning protocols that obtain a level of accuracy that far exceeds what is obtainable at comparable cost without provers, or even with a single prover. We concentrate on the task of choosing the better one out of two black-box models, with respect to some ground truth. While we consider a range of parameters, perhaps our most notable result is in the high-precision range: For all $\varepsilon>0$ and ambient dimension $d$, our learner makes only one query to the ground truth function, communicates only $(1+\frac{1}{\varepsilon^2})\cdot\text{poly}(d)$ bits with the provers, and outputs a model whose loss is within a multiplicative factor of $(1+\varepsilon)$ of the best model's loss. Obtaining comparable loss with a single prover would require the learner to access the ground truth at almost all of the points in the domain. To obtain this bound, we develop a technique that allows the learner to sample, using the provers, from a distribution that is not efficiently samplable to begin with. We find this technique to be of independent interest. We also present lower bounds that demonstrate the optimality of our protocols in a number of respects, including prover complexity, number of samples, and need for query access.
其他(32篇)
【1】Training Dynamics Impact Post-Training Quantization Robustness
标题:训练动态影响训练后量化稳健性
链接:https://arxiv.org/abs/2510.06213
作者:Albert Catalan-Tatjer, Niccolò Ajroldi, Jonas Geiping
摘要:While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.
【2】Modulation Discovery with Differentiable Digital Signal Processing
标题:利用差异数字信号处理实现调制发现
链接:https://arxiv.org/abs/2510.06204
作者:Christopher Mitcheltree, Hao Hao Tan, Joshua D. Reiss
备注:Accepted to WASPAA 2025 (best paper award candidate). Code, audio samples, and plugins can be found at this https URL
摘要:Modulations are a critical part of sound design and music production, enabling the creation of complex and evolving audio. Modern synthesizers provide envelopes, low frequency oscillators (LFOs), and more parameter automation tools that allow users to modulate the output with ease. However, determining the modulation signals used to create a sound is difficult, and existing sound-matching / parameter estimation systems are often uninterpretable black boxes or predict high-dimensional framewise parameter values without considering the shape, structure, and routing of the underlying modulation curves. We propose a neural sound-matching approach that leverages modulation extraction, constrained control signal parameterizations, and differentiable digital signal processing (DDSP) to discover the modulations present in a sound. We demonstrate the effectiveness of our approach on highly modulated synthetic and real audio samples, its applicability to different DDSP synth architectures, and investigate the trade-off it incurs between interpretability and sound-matching accuracy. We make our code and audio samples available and provide the trained DDSP synths in a VST plugin.
【3】Reference Grounded Skill Discovery
标题:参考扎根技能发现
链接:https://arxiv.org/abs/2510.06203
作者:Seungeun Rho, Aaron Trinh, Danfei Xu, Sehoon Ha
摘要:Scaling unsupervised skill discovery algorithms to high-DoF agents remains challenging. As dimensionality increases, the exploration space grows exponentially, while the manifold of meaningful skills remains limited. Therefore, semantic meaningfulness becomes essential to effectively guide exploration in high-dimensional spaces. In this work, we present Reference-Grounded Skill Discovery (RGSD), a novel algorithm that grounds skill discovery in a semantically meaningful latent space using reference data. RGSD first performs contrastive pretraining to embed motions on a unit hypersphere, clustering each reference trajectory into a distinct direction. This grounding enables skill discovery to simultaneously involve both imitation of reference behaviors and the discovery of semantically related diverse behaviors. On a simulated SMPL humanoid with 359-D observations and 69-D actions, RGSD learns structured skills including walking, running, punching, and side stepping, and also discovers related novel behaviors. In downstream control tasks, RGSD outperforms imitation-based skill acquisition baselines. Our results suggest that lightweight reference-guided grounding offers a practical path to discovering semantically rich and structured skills in high-DoF systems.
【4】On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond
标题:强大的生成方法:自回归、扩散等
链接:https://arxiv.org/abs/2510.06190
作者:Chenxiao Yang, Cai Zhou, David Wipf, Zhiyuan Li
摘要:This paper formally studies generation processes, including auto-regressive next-token prediction and masked diffusion, that abstract beyond architectural specifics. At this level of abstraction, we quantify their benefits and limitations through measurable criteria such as computational hardness and learnability. In particular, we demonstrate that allowing generation to proceed beyond autoregression and current masked diffusion, with capabilities to rewrite and length-variable edit, can bring significant theoretical and empirical advantages, with important implications for frontier LLMs that aspire to tackle increasingly hard problems and work universally across domains beyond natural language, such as coding and science.
【5】BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects
标题:BanglaTalk:为孟加拉地区方言提供实时语音协助
链接:https://arxiv.org/abs/2510.06188
作者:Jakir Hasan, Shubhashis Roy Dipta
摘要:Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers.
【6】TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts
标题:TabPFN-Wide:针对极端功能计数的持续预训练
链接:https://arxiv.org/abs/2510.06162
作者:Christopher Kolberg, Katharina Eggensperger, Nico Pfeifer
摘要:Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional machine learning approaches. While prior-data fitted networks emerge as foundation models for tabular data, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance while exhibiting improved robustness to noise. It seamlessly scales beyond 50,000 features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results show that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world biomedical datasets many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.
【7】Edit-Based Flow Matching for Temporal Point Processes
标题:基于编辑的时态流程匹配
链接:https://arxiv.org/abs/2510.06050
作者:David Lüdke, Marten Lienen, Marcel Kollovieh, Stephan Günnemann
摘要:Temporal point processes (TPPs) are a fundamental tool for modeling event sequences in continuous time, but most existing approaches rely on autoregressive parameterizations that are limited by their sequential sampling. Recent non-autoregressive, diffusion-style models mitigate these issues by jointly interpolating between noise and data through event insertions and deletions in a discrete Markov chain. In this work, we generalize this perspective and introduce an Edit Flow process for TPPs that transports noise to data via insert, delete, and substitute edit operations. By learning the instantaneous edit rates within a continuous-time Markov chain framework, we attain a flexible and efficient model that effectively reduces the total number of necessary edit operations during generation. Empirical results demonstrate the generative flexibility of our unconditionally trained model in a wide range of unconditional and conditional generation tasks on benchmark TPPs.
【8】Fast Leave-One-Out Approximation from Fragment-Target Prevalence Vectors (molFTP) : From Dummy Masking to Key-LOO for Leakage-Free Feature Construction
标题:碎片目标流行率Vectors(molTP)的快速留一逼近:从假人掩蔽到关键LOO,以实现无泄漏特征构建
链接:https://arxiv.org/abs/2510.06029
作者:Guillaume Godin
备注:28 pages, 21 figures, 3 tables
摘要
:We introduce molFTP (molecular fragment-target prevalence), a compact representation that delivers strong predictive performance. To prevent feature leakage across cross-validation folds, we implement a dummy-masking procedure that removes information about fragments present in the held-out molecules. We further show that key leave-one-out (key-loo) closely approximates true molecule-level leave-one-out (LOO), with deviation below 8% on our datasets. This enables near full data training while preserving unbiased cross-validation estimates of model performance. Overall, molFTP provides a fast, leakage-resistant fragment-target prevalence vectorization with practical safeguards (dummy masking or key-LOO) that approximate LOO at a fraction of its cost.
【9】Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
标题:Gibbs和Langevin Monte Carlo算法在插值机制中的推广
链接:https://arxiv.org/abs/2510.06028
作者:Andreas Maurer, Erfan Mirzaei, Massimiliano Pontil
摘要:The paper provides data-dependent bounds on the test error of the Gibbs algorithm in the overparameterized interpolation regime, where low training errors are also obtained for impossible data, such as random labels in classification. The bounds are stable under approximation with Langevin Monte Carlo algorithms. Experiments on the MNIST and CIFAR-10 datasets verify that the bounds yield nontrivial predictions on true labeled data and correctly upper bound the test error for random labels. Our method indicates that generalization in the low-temperature, interpolation regime is already signaled by small training errors in the more classical high temperature regime.
【10】RamPINN: Recovering Raman Spectra From Coherent Anti-Stokes Spectra Using Embedded Physics
标题:RamPINN:使用嵌入物理从相干反斯托克斯光谱恢复拉曼光谱
链接:https://arxiv.org/abs/2510.06020
作者:Sai Karthikeya Vemuri, Adithya Ashok Chalain Valapil, Tim Büchner, Joachim Denzler
摘要:Transferring the recent advancements in deep learning into scientific disciplines is hindered by the lack of the required large-scale datasets for training. We argue that in these knowledge-rich domains, the established body of scientific theory provides reliable inductive biases in the form of governing physical laws. We address the ill-posed inverse problem of recovering Raman spectra from noisy Coherent Anti-Stokes Raman Scattering (CARS) measurements, as the true Raman signal here is suppressed by a dominating non-resonant background. We propose RamPINN, a model that learns to recover Raman spectra from given CARS spectra. Our core methodological contribution is a physics-informed neural network that utilizes a dual-decoder architecture to disentangle resonant and non-resonant signals. This is done by enforcing the Kramers-Kronig causality relations via a differentiable Hilbert transform loss on the resonant and a smoothness prior on the non-resonant part of the signal. Trained entirely on synthetic data, RamPINN demonstrates strong zero-shot generalization to real-world experimental data, explicitly closing this gap and significantly outperforming existing baselines. Furthermore, we show that training with these physics-based losses alone, without access to any ground-truth Raman spectra, still yields competitive results. This work highlights a broader concept: formal scientific rules can act as a potent inductive bias, enabling robust, self-supervised learning in data-limited scientific domains.
【11】Information-Theoretic Policy Pre-Training with Empowerment
标题:信息理论政策预训练与赋权
链接:https://arxiv.org/abs/2510.05996
作者:Moritz Schneider, Robert Krug, Narunas Vaskevicius, Luigi Palmieri, Michael Volpp, Joschka Boedecker
摘要:Empowerment, an information-theoretic measure of an agent's potential influence on its environment, has emerged as a powerful intrinsic motivation and exploration framework for reinforcement learning (RL). Besides for unsupervised RL and skill learning algorithms, the specific use of empowerment as a pre-training signal has received limited attention in the literature. We show that empowerment can be used as a pre-training signal for data-efficient downstream task adaptation. For this we extend the traditional notion of empowerment by introducing discounted empowerment, which balances the agent's control over the environment across short- and long-term horizons. Leveraging this formulation, we propose a novel pre-training paradigm that initializes policies to maximize discounted empowerment, enabling agents to acquire a robust understanding of environmental dynamics. We analyze empowerment-based pre-training for various existing RL algorithms and empirically demonstrate its potential as a general-purpose initialization strategy: empowerment-maximizing policies with long horizons are data-efficient and effective, leading to improved adaptability in downstream tasks. Our findings pave the way for future research to scale this framework to high-dimensional and complex tasks, further advancing the field of RL.
【12】Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods
标题:关注混合注意力:解决转换方法中的问题
链接:https://arxiv.org/abs/2510.05901
作者:Martin Benfeghoul, Teresa Delgado, Adnan Oomerjee, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas
摘要:Transformers' quadratic computational complexity limits their scalability despite remarkable performance. While linear attention reduces this to linear complexity, pre-training such models from scratch remains, in most cases, prohibitively expensive. Recent post-training linearisation methods convert pre-trained Transformers to linear models efficiently, often using hybrid approaches that combine linear attention with sliding-window softmax. We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component, relying almost entirely on SWA. Component-level diagnostics reveal this previously undetected behaviour stems from overlooked evaluation practices on common-sense benchmarks. We propose three solutions to ensure balanced component usage: (i) inference-time hybridisation of linear-only conversions with sliding-window softmax; (ii) HedgeCATs, combining attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled Sliding-window Dropout (SSD), which stochastically suppresses the softmax branch during training to prevent component collapse. Our methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption, restoring the validity of performance attributions in hybrid conversions.
【13】Möbius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs
标题:加权有向非循环重图上向量函数的莫比乌斯变换和Shapley值
链接:https://arxiv.org/abs/2510.05786
作者:Patrick Forré, Abel Jansma
备注:43 pages, 2 figures
摘要:We generalize the concept of M\"obius inversion and Shapley values to directed acyclic multigraphs and weighted versions thereof. We further allow value functions (games) and thus their M\"obius transforms (synergy function) and Shapley values to have values in any abelian group that is a module over a ring that contains the graph weights, e.g. vector-valued functions. To achieve this and overcome the obstruction that the classical axioms (linearity, efficiency, null player, symmetry) are not strong enough to uniquely determine Shapley values in this more general setting, we analyze Shapley values from two novel points of view: 1) We introduce projection operators that allow us to interpret Shapley values as the recursive projection and re-attribution of higher-order synergies to lower-order ones; 2) we propose a strengthening of the null player axiom and a localized symmetry axiom, namely the weak elements and flat hierarchy axioms. The former allows us to remove coalitions with vanishing synergy while preserving the rest of the hierarchical structure. The latter treats player-coalition bonds uniformly in the corner case of hierarchically flat graphs. Together with linearity these axioms already imply a unique explicit formula for the Shapley values, as well as classical properties like efficiency, null player, symmetry, and novel ones like the projection property. This whole framework then specializes to finite inclusion algebras, lattices, partial orders and mereologies, and also recovers certain previously known cases as corner cases, and presents others from a new perspective. The admission of general weighted directed acyclic multigraph structured hierarchies and vector-valued functions and Shapley values opens up the possibility for new analytic tools and application areas, like machine learning, language processing, explainable artificial intelligence, and many more.
【14】Transcribing Rhythmic Patterns of the Guitar Track in Polyphonic Music
标题:复调音乐中吉他音轨节奏模式的转录
链接:https://arxiv.org/abs/2510.05756
作者:Aleksandr Lukoianov, Anssi Klapuri
备注:Accepted to WASPAA 2025
摘要:Whereas chord transcription has received considerable attention during the past couple of decades, far less work has been devoted to transcribing and encoding the rhythmic patterns that occur in a song. The topic is especially relevant for instruments such as the rhythm guitar, which is typically played by strumming rhythmic patterns that repeat and vary over time. However, in many cases one cannot objectively define a single "right" rhythmic pattern for a given song section. To create a dataset with well-defined ground-truth labels, we asked expert musicians to transcribe the rhythmic patterns in 410 popular songs and record cover versions where the guitar tracks followed those transcriptions. To transcribe the strums and their corresponding rhythmic patterns, we propose a three-step framework. Firstly, we perform approximate stem separation to extract the guitar part from the polyphonic mixture. Secondly, we detect individual strums within the separated guitar audio, using a pre-trained foundation model (MERT) as a backbone. Finally, we carry out a pattern-decoding process in which the transcribed sequence of guitar strums is represented by patterns drawn from an expert-curated vocabulary. We show that it is possible to transcribe the rhythmic patterns of the guitar track in polyphonic music with quite high accuracy, producing a representation that is human-readable and includes automatically detected bar lines and time signature markers. We perform ablation studies and error analysis and propose a set of evaluation metrics to assess the accuracy and readability of the predicted rhythmic pattern sequence.
【15】Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
标题:改善离散扩散揭露政策超越明确参考政策
链接:https://arxiv.org/abs/2510.05725
作者:Chunsan Hong, Seonho An, Min-Soo Kim, Jong Chul Ye
备注:Preprint
摘要:Masked diffusion models (MDMs) have recently emerged as a novel framework for language modeling. MDMs generate sentences by iteratively denoising masked sequences, filling in [MASK] tokens step by step. Although MDMs support any-order sampling, performance is highly sensitive to the choice of which position to unmask next. Prior work typically relies on rule-based schedules (e.g., max-confidence, max-margin), which provide ad hoc improvements. In contrast, we replace these heuristics with a learned scheduler. Specifically, we cast denoising as a KL-regularized Markov decision process (MDP) with an explicit reference policy and optimize a regularized objective that admits policy improvement and convergence guarantees under standard assumptions. We prove that the optimized policy under this framework generates samples that more closely match the data distribution than heuristic schedules. Empirically, across four benchmarks, our learned policy consistently outperforms max-confidence: for example, on SUDOKU, where unmasking order is critical, it yields a 20.1% gain over random and a 11.2% gain over max-confidence.
【16】vAttention: Verified Sparse Attention
标题:vAttention:已验证稀疏注意力
链接:https://arxiv.org/abs/2510.05688
作者:Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
摘要
:State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(\epsilon, \delta)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.
【17】Monte Carlo-Type Neural Operator for Differential Equations
标题:微积分方程的Monte Carlo型神经运算器
链接:https://arxiv.org/abs/2510.05620
作者:Salah Eddine Choutri, Prajwal Chauhan, Othmane Mazhar, Saif Eddin Jabari
摘要:The Monte Carlo-type Neural Operator (MCNO) introduces a framework for learning solution operators of one-dimensional partial differential equations (PDEs) by directly learning the kernel function and approximating the associated integral operator using a Monte Carlo-type approach. Unlike Fourier Neural Operators (FNOs), which rely on spectral representations and assume translation-invariant kernels, MCNO makes no such assumptions. The kernel is represented as a learnable tensor over sampled input-output pairs, and sampling is performed once, uniformly at random from a discretized grid. This design enables generalization across multiple grid resolutions without relying on fixed global basis functions or repeated sampling during training, while an interpolation step maps between arbitrary input and output grids to further enhance flexibility. Experiments on standard 1D PDE benchmarks show that MCNO achieves competitive accuracy with efficient computational cost. We also provide a theoretical analysis proving that the Monte Carlo estimator yields a bounded bias and variance under mild regularity assumptions. This result holds in any spatial dimension, suggesting that MCNO may extend naturally beyond one-dimensional problems. More broadly, this work explores how Monte Carlo-type integration can be incorporated into neural operator frameworks for continuous-domain PDEs, providing a theoretically supported alternative to spectral methods (such as FNO) and to graph-based Monte Carlo approaches (such as the Graph Kernel Neural Operator, GNO).
【18】Channel Simulation and Distributed Compression with Ensemble Rejection Sampling
标题:通道模拟和具有集合抑制采样的分布式压缩
链接:https://arxiv.org/abs/2510.05552
作者:Buu Phan, Ashish Khisti
摘要:We study channel simulation and distributed matching, two fundamental problems with several applications to machine learning, using a recently introduced generalization of the standard rejection sampling (RS) algorithm known as Ensemble Rejection Sampling (ERS). For channel simulation, we propose a new coding scheme based on ERS that achieves a near-optimal coding rate. In this process, we demonstrate that standard RS can also achieve a near-optimal coding rate and generalize the result of Braverman and Garg (2014) to the continuous alphabet setting. Next, as our main contribution, we present a distributed matching lemma for ERS, which serves as the rejection sampling counterpart to the Poisson Matching Lemma (PML) introduced by Li and Anantharam (2021). Our result also generalizes a recent work on importance matching lemma (Phan et al, 2024) and, to our knowledge, is the first result on distributed matching in the family of rejection sampling schemes where the matching probability is close to PML. We demonstrate the practical significance of our approach over prior works by applying it to distributed compression. The effectiveness of our proposed scheme is validated through experiments involving synthetic Gaussian sources and distributed image compression using the MNIST dataset.
【19】Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
标题:激活知情的帕累托引导低等级压缩,实现高效的LLM/VLM
链接:https://arxiv.org/abs/2510.05544
作者:Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
摘要:Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.
【20】Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
标题:在离线和在线WLHF/DPO协调中可证明同时缓解腐败、过度优化和冗长
链接:https://arxiv.org/abs/2510.05526
作者:Ziyi Chen, Junyi Li, Peiran Yu, Heng Huang
摘要
:Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are important techniques to align large language models (LLM) with human preference. However, the quality of RLHF and DPO training is seriously compromised by \textit{\textbf{C}orrupted} preference, reward \textit{\textbf{O}veroptimization}, and bias towards \textit{\textbf{V}erbosity}. To our knowledge, most existing works tackle only one of these important issues, and the few other works require much computation to estimate multiple reward models and lack theoretical guarantee of generalization ability. In this work, we propose RLHF-\textbf{COV} and DPO-\textbf{COV} algorithms that can simultaneously mitigate these three issues, in both offline and online settings. This ability is theoretically demonstrated by obtaining length-regularized generalization error rates for our DPO-COV algorithms trained on corrupted data, which match the best-known rates for simpler cases with clean data and without length regularization. Moreover, our DPO-COV algorithm is simple to implement without reward estimation, and is proved to be equivalent to our RLHF-COV algorithm, which directly implies the equivalence between the vanilla RLHF and DPO algorithms. Experiments demonstrate the effectiveness of our DPO-COV algorithms under both offline and online settings.
【21】NorMuon: Making Muon more efficient and scalable
标题:NorMuon:让Muon更高效和可扩展
链接:https://arxiv.org/abs/2510.05491
作者:Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, Tuo Zhao
摘要:The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon's emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon's conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.
【22】The Method of Infinite Descent
标题:无限后裔的方法
链接:https://arxiv.org/abs/2510.05489
作者:Reza T. Batley, Sourav Saha
摘要:Training - the optimisation of complex models - is traditionally performed through small, local, iterative updates [D. E. Rumelhart, G. E. Hinton, R. J. Williams, Nature 323, 533-536 (1986)]. Approximating solutions through truncated gradients is a paradigm dating back to Cauchy [A.-L. Cauchy, Comptes Rendus Math\'ematique 25, 536-538 (1847)] and Newton [I. Newton, The Method of Fluxions and Infinite Series (Henry Woodfall, London, 1736)]. This work introduces the Method of Infinite Descent, a semi-analytic optimisation paradigm that reformulates training as the direct solution to the first-order optimality condition. By analytical resummation of its Taylor expansion, this method yields an exact, algebraic equation for the update step. Realisation of the infinite Taylor tower's cascading resummation is formally derived, and an exploitative algorithm for the direct solve step is proposed. This principle is demonstrated with the herein-introduced AION (Analytic, Infinitely-Optimisable Network) architecture. AION is a model designed expressly to satisfy the algebraic closure required by Infinite Descent. In a simple test problem, AION reaches the optimum in a single descent step. Together, this optimiser-model pair exemplify how analytic structure enables exact, non-iterative convergence. Infinite Descent extends beyond this example, applying to any appropriately closed architecture. This suggests a new class of semi-analytically optimisable models: the \emph{Infinity Class}; sufficient conditions for class membership are discussed. This offers a pathway toward non-iterative learning.
【23】TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation
标题:TensorBLEU:用于每句训练中评估的基于GOP的VECUU评分实现
链接:https://arxiv.org/abs/2510.05485
作者:Adam Filipek
备注:9 pages, 3 figures
摘要:Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using \texttt{torch.unique}, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.
【24】ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics
标题:ATOM:一种用于多任务分子动力学的预训练神经操作器
链接:https://arxiv.org/abs/2510.05482
作者:Luke Thompson, Davy Guan, Dai Shi, Slade Matthews, Junbin Gao, Andi Han
摘要
:Molecular dynamics (MD) simulations underpin modern computational drug dis- covery, materials science, and biochemistry. Recent machine learning models provide high-fidelity MD predictions without the need to repeatedly solve quantum mechanical forces, enabling significant speedups over conventional pipelines. Yet many such methods typically enforce strict equivariance and rely on sequential rollouts, thus limiting their flexibility and simulation efficiency. They are also com- monly single-task, trained on individual molecules and fixed timeframes, which restricts generalization to unseen compounds and extended timesteps. To address these issues, we propose Atomistic Transformer Operator for Molecules (ATOM), a pretrained transformer neural operator for multitask molecular dynamics. ATOM adopts a quasi-equivariant design that requires no explicit molecular graph and employs a temporal attention mechanism, allowing for the accurate parallel decod- ing of multiple future states. To support operator pretraining across chemicals and timescales, we curate TG80, a large, diverse, and numerically stable MD dataset with over 2.5 million femtoseconds of trajectories across 80 compounds. ATOM achieves state-of-the-art performance on established single-task benchmarks, such as MD17, RMD17 and MD22. After multitask pretraining on TG80, ATOM shows exceptional zero-shot generalization to unseen molecules across varying time hori- zons. We believe ATOM represents a significant step toward accurate, efficient, and transferable molecular dynamics models
【25】Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding
标题:起草、验证和改进:迈向训练意识的推测解码
链接:https://arxiv.org/abs/2510.05421
作者:Shrenik Bhansali, Larry Heck
摘要:Autoregressive (AR) decoding is a major latency bottleneck for large language models. Speculative decoding (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an LLM into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple \emph{KL$\rightarrow$RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16\times$ wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.
【26】KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction
标题:KVLinC:具有Hadamard旋转和线性纠正的KV缓存量化
链接:https://arxiv.org/abs/2510.05373
作者:Utkarsh Saxena, Kaushik Roy
备注:14 pages, 7 figures, 6 tables
摘要:Quantizing the key-value (KV) cache is a promising strategy for improving the inference efficiency of large language models (LLMs). However, aggressive quantization to very low precision (e.g., 2 bits) introduces significant errors in the stored key and value tensors, which propagate through the dot-product attention mechanism and ultimately degrade generation quality. To address this, we propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization in the extreme low-precision regime. KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters that explicitly compensate for errors introduced by quantized keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3 model families, KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression. Furthermore, we implement a custom attention kernel that results in upto 2.55x faster inference compared to Flash Attention baseline, enabling efficient long-context LLM inference.
【27】Physics-informed Attention-enhanced Fourier Neural Operator for Solar Magnetic Field Extrapolations
标题:用于太阳磁场外推的基于物理的注意力增强傅里叶神经运算器
链接:https://arxiv.org/abs/2510.05351
作者:Jinghao Cao, Qin Li, Mengnan Du, Haimin Wang, Bo Shen
备注:10 pages; accepted as workshop paper in ICDM 2025; this https URL
摘要:We propose Physics-informed Attention-enhanced Fourier Neural Operator (PIANO) to solve the Nonlinear Force-Free Field (NLFFF) problem in solar physics. Unlike conventional approaches that rely on iterative numerical methods, our proposed PIANO directly learns the 3D magnetic field structure from 2D boundary conditions. Specifically, PIANO integrates Efficient Channel Attention (ECA) mechanisms with Dilated Convolutions (DC), which enhances the model's ability to capture multimodal input by prioritizing critical channels relevant to the magnetic field's variations. Furthermore, we apply physics-informed loss by enforcing the force-free and divergence-free conditions in the training process so that our prediction is consistent with underlying physics with high accuracy. Experimental results on the ISEE NLFFF dataset show that our PIANO not only outperforms state-of-the-art neural operators in terms of accuracy but also shows strong consistency with the physical characteristics of NLFFF data across magnetic fields reconstructed from various solar active regions. The GitHub of this project is available https://github.com/Autumnstar-cjh/PIANO
【28】CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers
标题:CMT-Benchmark:专家研究人员建立的凝聚物质理论基准
链接:https://arxiv.org/abs/2510.05228
作者:Haining Pan, James V. Roggeveen, Erez Berg, Juan Carrasquilla, Debanjan Chowdhury, Surya Ganguli, Federico Ghimenti, Juraj Hasik, Henry Hunt, Hong-Chen Jiang, Mason Kamb, Ying-Jer Kao, Ehsan Khatami, Michael J. Lawler, Di Luo, Titus Neupert, Xiaoliang Qi, Michael P. Brenner, Eun-Ah Kim
备注:19 pages, 3 figures
摘要
:Large language models (LLMs) have shown remarkable progress in coding and math problem-solving, but evaluation on advanced research-level problems in hard sciences remains scarce. To fill this gap, we present CMT-Benchmark, a dataset of 50 problems covering condensed matter theory (CMT) at the level of an expert researcher. Topics span analytical and computational approaches in quantum many-body, and classical statistical mechanics. The dataset was designed and verified by a panel of expert researchers from around the world. We built the dataset through a collaborative environment that challenges the panel to write and refine problems they would want a research assistant to solve, including Hartree-Fock, exact diagonalization, quantum/variational Monte Carlo, density matrix renormalization group (DMRG), quantum/classical statistical mechanics, and model building. We evaluate LLMs by programmatically checking solutions against expert-supplied ground truth. We developed machine-grading, including symbolic handling of non-commuting operators via normal ordering. They generalize across tasks too. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. The best model, GPT5, solves 30\% of the problems; average across 17 models (GPT, Gemini, Claude, DeepSeek, Llama) is 11.4$\pm$2.1\%. Moreover, 18 problems are solved by none of the 17 models, and 26 by at most one. These unsolved problems span Quantum Monte Carlo, Variational Monte Carlo, and DMRG. Answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe this benchmark will guide development toward capable AI research assistants and tutors.
【29】Exact Causal Attention with 10% Fewer Operations
标题:精确的因果注意力,减少10%的操作
链接:https://arxiv.org/abs/2510.05175
作者:Dmitry Rybin, Yushun Zhang, Ding Tian, Zhihang Lin, Ruoyu Sun, Zhi-Quan Luo
摘要:We present Fast Causal Attention (FCA), an algorithm that computes exact Causal Attention using 10\% fewer operations. FCA accelerates a special class of matrix multiplications where either one operand or the output matrix is upper- or lower-triangular. This includes all operations in forward and backward pass of Causal Attention, such as masked product $\mathrm{Mask}(QK^{T})$. For these matrix multiplications on GPU, FCA reaches noticeable accelerations over the default PyTorch implementations and Triton compiled kernels. FCA is built upon algebraic identities discovered via machine learning and combinatorial search.
【30】SATER: A Self-Aware and Token-Efficient Approach to Routing and Cascading
标题:SATER:一种自感知且令牌高效的路由和级联方法
链接:https://arxiv.org/abs/2510.05164
作者:Yuanzhe Shen, Yide Liu, Zisu Huang, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang
备注:Accepted to EMNLP 2025 Main
摘要:Large language models (LLMs) demonstrate remarkable performance across diverse tasks, yet their effectiveness frequently depends on costly commercial APIs or cloud services. Model selection thus entails a critical trade-off between performance and cost: high-performing LLMs typically incur substantial expenses, whereas budget-friendly small language models (SLMs) are constrained by limited capabilities. Current research primarily proposes two routing strategies: pre-generation routing and cascade routing. Both approaches have distinct characteristics, with cascade routing typically offering superior cost-effectiveness and accuracy despite its higher latency. To further address the limitations of both approaches, we introduce SATER, a dual-mode compatible approach that fine-tunes models through shortest-response preference optimization and a confidence-aware rejection mechanism. SATER significantly reduces redundant outputs and response times, while improving both the performance of pre-generation routing and the efficiency of cascade routing. Experiments across three SLMs and six datasets, varying in type and complexity, demonstrate that SATER achieves comparable performance while consistently reducing computational costs by over 50\% and cascade latency by over 80\%.
【31】Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain
标题:Agentland的恶意:深入人工智能供应链后门的兔子洞
链接:https://arxiv.org/abs/2510.05159
作者:Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Nicolas Chapados, Quentin Cappart, Alexandre Lacoste, Krishnamurthy Dj Dvijotham, Alexandre Drouin
备注:27 pages
摘要:The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.
【32】Non-iid hypothesis testing: from classical to quantum
标题:非Iid假设检验:从经典到量子
链接:https://arxiv.org/abs/2510.06147
作者:Giacomo De Palma, Marco Fanizza, Connor Mowry, Ryan O'Donnell
备注:33 pages, 2 figures
摘要
:We study hypothesis testing (aka state certification) in the non-identically distributed setting. A recent work (Garg et al. 2023) considered the classical case, in which one is given (independent) samples from $T$ unknown probability distributions $p_1, \dots, p_T$ on $[d] = \{1, 2, \dots, d\}$, and one wishes to accept/reject the hypothesis that their average $p_{\mathrm{avg}}$ equals a known hypothesis distribution $q$. Garg et al. showed that if one has just $c = 2$ samples from each $p_i$, and provided $T \gg \frac{\sqrt{d}}{\epsilon^2} + \frac{1}{\epsilon^4}$, one can (whp) distinguish $p_{\mathrm{avg}} = q$ from $d_{\mathrm{TV}}(p_{\mathrm{avg}},q) > \epsilon$. This nearly matches the optimal result for the classical iid setting (namely, $T \gg \frac{\sqrt{d}}{\epsilon^2}$). Besides optimally improving this result (and generalizing to tolerant testing with more stringent distance measures), we study the analogous problem of hypothesis testing for non-identical quantum states. Here we uncover an unexpected phenomenon: for any $d$-dimensional hypothesis state $\sigma$, and given just a single copy ($c = 1$) of each state $\rho_1, \dots, \rho_T$, one can distinguish $\rho_{\mathrm{avg}} = \sigma$ from $D_{\mathrm{tr}}(\rho_{\mathrm{avg}},\sigma) > \epsilon$ provided $T \gg d/\epsilon^2$. (Again, we generalize to tolerant testing with more stringent distance measures.) This matches the optimal result for the iid case, which is surprising because doing this with $c = 1$ is provably impossible in the classical case. We also show that the analogous phenomenon happens for the non-iid extension of identity testing between unknown states. A technical tool we introduce may be of independent interest: an Efron-Stein inequality, and more generally an Efron-Stein decomposition, in the quantum setting.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递