点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计104篇
大模型相关(18篇)
【1】Value-Aware Numerical Representations for Transformer Language Models
标题:Transformer语言模型的价值感知数字表示
链接:https://arxiv.org/abs/2601.09706
作者:Andreea Dutulescu,Stefan Ruseti,Mihai Dascalu
摘要:基于transformer的语言模型通常在数学推理基准测试中取得很好的结果,但在基本的数值理解和算术运算中仍然很脆弱。一个主要的限制是数字被处理为符号标记,其嵌入没有显式编码数值,导致系统错误。我们引入了一个值感知的数值表示,它用一个专用的前缀令牌来增强标准的令牌化输入,该令牌的嵌入显式地以底层数值为条件。这种机制将幅度信息直接注入模型的输入空间,同时保持与现有的标记器和仅解码器的Transformer架构兼容。算术任务的评估表明,所提出的方法优于数字格式,任务和操作数长度的基线。这些结果表明,显式编码数值是一种有效的和有效的方式来提高基本的数值鲁棒性的语言模型。
摘要:Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.
【2】Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection
标题:使用生成数据进行路由:无注释LLM技能估计和专家选择
链接:https://arxiv.org/abs/2601.09692
作者:Tianyi Niu,Justin Chih-Yao Chen,Genta Indra Winata,Shi-Xiong Zhang,Supriyo Chakraborty,Sambit Sahu,Yue Zhang,Elias Stengel-Eskin,Mohit Bansal
备注:Code: https://github.com/tianyiniu/RoutingGenData
摘要:大型语言模型(LLM)路由器动态地为给定的输入选择最佳模型。现有的方法通常假设访问地面实况标记的数据,这在实践中通常是不可用的,特别是当用户请求分布是异构的和未知的。我们介绍了路由与生成的数据(RGD),一个具有挑战性的设置,路由器专门训练生成的查询和生成器LLM从高层次的任务描述产生的答案。我们评估查询-应答路由器(使用查询和标签)和查询路由器在四个不同的基准和12个模型,发现查询-应答路由器退化速度比查询路由器发电机质量下降。我们的分析揭示了有效生成器的两个关键特征:它们必须准确地响应自己的问题,并且它们的问题必须在模型池中产生足够的性能差异。然后,我们将展示如何过滤这些特征可以提高生成的数据的质量。我们进一步提出CASCAL,一种新的查询路由器,估计模型的正确性,通过共识投票,并通过层次聚类识别模型特定的技能利基。CASCAL对生成器质量的鲁棒性更强,在弱生成器数据上训练时,其绝对准确率超过最佳问答路由器4.6%。
摘要:Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
【3】LLM for Large-Scale Optimization Model Auto-Formulation: A Lightweight Few-Shot Learning Approach
标题:LLM用于大规模优化模型自动制定:一种轻量级的Few-Shot学习方法
链接:https://arxiv.org/abs/2601.09635
作者:Kuo Liang,Yuhang Lu,Jianming Mao,Shuyi Sun,Chunwei Yang,Congcong Zeng,Xiao Jin,Hanzhang Qin,Ruihao Zhu,Chung-Piaw Teo
备注:Updated version of https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5329027
摘要:大规模优化是现代商业决策的关键支柱。然而,构建这些模型通常是劳动密集型和耗时的。我们通过提出LEAN-LLM-OPT来解决这个问题,LEAN-LLM-OPT是一个用于LLM辅助的大规模优化自动制定的LightWEight自动工作流构建框架。LEAN-LLM-OPT将问题描述与相关数据集一起作为输入,并协调LLM代理团队以生成优化公式。具体来说,在接收到查询时,两个上游LLM代理动态地构建工作流,该工作流逐步指定如何制定类似问题的优化模型。然后,下游LLM代理遵循此工作流以生成最终输出。利用LLM的文本处理功能和常见建模实践,工作流将建模任务分解为一系列结构化子任务,并将机械数据处理操作卸载到辅助工具。这种设计减轻了下游代理与规划和数据处理相关的负担,使其能够专注于最具挑战性的组件,这些组件无法轻易标准化。大量的模拟表明,使用GPT-4.1和开源gpt-oss-20 B实例化的LEAN-LLM-OPT在大规模优化建模任务上实现了强大的性能,并且与最先进的方法相比具有竞争力。此外,在新加坡航空公司基于选择的收益管理用例中,LEAN-LLM-OPT通过在一系列场景中实现领先的性能,展示了实用价值。在此过程中,我们介绍了Large-Scale-OR和Air-NRM,这是大规模优化自动制定的第一个全面基准。该工作的代码和数据可在https://github.com/CoraLiang01/lean-llm-opt上获得。
摘要
:Large-scale optimization is a key backbone of modern business decision-making. However, building these models is often labor-intensive and time-consuming. We address this by proposing LEAN-LLM-OPT, a LightwEight AgeNtic workflow construction framework for LLM-assisted large-scale OPTimization auto-formulation. LEAN-LLM-OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step-by-step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. Leveraging LLMs' text-processing capabilities and common modeling practices, the workflow decomposes the modeling task into a sequence of structured sub-tasks and offloads mechanical data-handling operations to auxiliary tools. This design alleviates the downstream agent's burden related to planning and data handling, allowing it to focus on the most challenging components that cannot be readily standardized. Extensive simulations show that LEAN-LLM-OPT, instantiated with GPT-4.1 and the open source gpt-oss-20B, achieves strong performance on large-scale optimization modeling tasks and is competitive with state-of-the-art approaches. In addition, in a Singapore Airlines choice-based revenue management use case, LEAN-LLM-OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large-Scale-OR and Air-NRM, the first comprehensive benchmarks for large-scale optimization auto-formulation. The code and data of this work is available at https://github.com/CoraLiang01/lean-llm-opt.
【4】From Prompt to Protocol: Fast Charging Batteries with Large Language Models
标题:从提示到协议:使用大型语言模型为电池快速充电
链接:https://arxiv.org/abs/2601.09626
作者:Ge Lei,Ferran Brosa Planella,Sterling G. Baird,Samuel J. Cooper
摘要:有效地优化电池充电协议是具有挑战性的,因为每次评估都很慢,成本很高,而且不可区分。许多现有的方法通过严重限制协议搜索空间来解决这个困难,这限制了可以探索的协议的多样性,从而防止发现更高性能的解决方案。我们介绍了两种无梯度的LLM驱动的闭环方法:LLM到优化器(P2O),它使用LLM来提出基于神经网络的小型协议的代码,然后通过内部循环进行训练,以及LLM到协议(P2P),它只是为当前及其标量参数编写一个显式函数。在我们的案例研究中,LLM引导的P2O优于贝叶斯优化,进化算法和随机搜索设计的神经网络。在现实的快速充电场景中,P2O和P2P在健康状态(快速充电循环下基于容量保持的健康指标)方面比最先进的多步恒流(CC)基线提高了约4.2%,P2P在匹配的评估预算(相同数量的协议评估)下实现了这一目标。这些结果表明,LLM可以扩展空间的协议功能形式,纳入基于语言的约束,并使高成本的实验设置有效的优化。
摘要:Efficiently optimizing battery charging protocols is challenging because each evaluation is slow, costly, and non-differentiable. Many existing approaches address this difficulty by heavily constraining the protocol search space, which limits the diversity of protocols that can be explored, preventing the discovery of higher-performing solutions. We introduce two gradient-free, LLM-driven closed-loop methods: Prompt-to-Optimizer (P2O), which uses an LLM to propose the code for small neural-network-based protocols, which are then trained by an inner loop, and Prompt-to-Protocol (P2P), which simply writes an explicit function for the current and its scalar parameters. Across our case studies, LLM-guided P2O outperforms neural networks designed by Bayesian optimization, evolutionary algorithms, and random search. In a realistic fast charging scenario, both P2O and P2P yield around a 4.2 percent improvement in state of health (capacity retention based health metric under fast charging cycling) over a state-of-the-art multi-step constant current (CC) baseline, with P2P achieving this under matched evaluation budgets (same number of protocol evaluations). These results demonstrate that LLMs can expand the space of protocol functional forms, incorporate language-based constraints, and enable efficient optimization in high cost experimental settings.
【5】Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs
标题:关于消费者Blackwell图形处理器的私人LLM推理:中小企业经济高效本地部署的实用指南
链接:https://arxiv.org/abs/2601.09527
作者:Jonathan Knoop,Hendrik Holtmann
备注:15 pages, 18 tables, 7 figures. Includes link to GitHub repository and Docker image for reproducibility
摘要:中小企业越来越多地寻求云LLM API的替代方案,这引发了数据隐私问题。专用的云GPU实例提供了更好的隐私保护,但保证有限,成本也很高,而专业的内部部署硬件(A100,H100)仍然非常昂贵。我们对NVIDIA的Blackwell消费级GPU进行了系统的评估(RTX 5060 Ti、5070 Ti、5090)用于生产LLM推断,基准测试四种开放重量型号(Qwen 3 -8B、Gemma 3 - 12 B、Gemma 3 - 27 B、GPT-OSS-20 B)跨越79种量化格式的配置(BF 16,W 4A 16,NVFP 4,MXFP 4),上下文长度(8 k-64 k)和三种工作负载:RAG,多LoRA代理服务和高并发API。RTX 5090的吞吐量比5060 Ti高3.5- 4.6倍,RAG的延迟低21倍,但预算GPU可为亚秒级延迟的API工作负载实现最高的每美元吞吐量。NVFP 4量化提供了比BF 16高1.6倍的吞吐量,降低了41%的能量,只有2-4%的质量损失。自托管推理的成本为每百万个令牌0.001 -0.04美元(仅电力),比多层云API便宜40- 200倍,硬件在中等数量(每天3000万个令牌)下在四个月内实现收支平衡。我们的研究结果表明,消费者GPU可以可靠地取代大多数SME工作负载的云推理,但延迟关键的长上下文RAG除外,高端GPU仍然是必不可少的。我们提供部署指导,并发布可重复的中小企业规模部署的所有基准数据。
摘要:SMEs increasingly seek alternatives to cloud LLM APIs, which raise data privacy concerns. Dedicated cloud GPU instances offer improved privacy but with limited guarantees and ongoing costs, while professional on-premise hardware (A100, H100) remains prohibitively expensive. We present a systematic evaluation of NVIDIA's Blackwell consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090) for production LLM inference, benchmarking four open-weight models (Qwen3-8B, Gemma3-12B, Gemma3-27B, GPT-OSS-20B) across 79 configurations spanning quantization formats (BF16, W4A16, NVFP4, MXFP4), context lengths (8k-64k), and three workloads: RAG, multi-LoRA agentic serving, and high-concurrency APIs. The RTX 5090 delivers 3.5-4.6x higher throughput than the 5060 Ti with 21x lower latency for RAG, but budget GPUs achieve the highest throughput-per-dollar for API workloads with sub-second latency. NVFP4 quantization provides 1.6x throughput over BF16 with 41% energy reduction and only 2-4% quality loss. Self-hosted inference costs $0.001-0.04 per million tokens (electricity only), which is 40-200x cheaper than budget-tier cloud APIs, with hardware breaking even in under four months at moderate volume (30M tokens/day). Our results show that consumer GPUs can reliably replace cloud inference for most SME workloads, except latency-critical long-context RAG, where high-end GPUs remain essential. We provide deployment guidance and release all benchmark data for reproducible SME-scale deployments.
【6】CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion
标题:CLARIE:通过自主适配器路由和扩展进行视觉-语言-动作模型的持续学习
链接:https://arxiv.org/abs/2601.09512
作者:Ralf Römer,Yi Zhang,Angela P. Schoellig
备注:Project page: https://tum-lsy.github.io/clare. 9 pages, 5 figures
摘要:为了教机器人完成复杂的操作任务,现在通常的做法是根据特定任务的数据对预先训练好的视觉-语言-动作模型(VLA)进行微调。然而,由于该配方更新了现有的表示,因此不适合在现实世界中长期运行,机器人必须不断适应新的任务和环境,同时保留已经获得的知识。现有的机器人持续学习方法通常需要存储以前的数据(样本),处理长任务序列,或者依赖于任务标识符进行部署。为了解决这些局限性,我们提出了一个通用的,参数高效的框架,用于无范例的持续学习与VLA。CNOE将轻量级模块化适配器引入到选定的前馈层中,并在学习新任务时仅在必要时自主扩展模型,由逐层特征相似性指导。在部署期间,基于自动编码器的路由机制动态激活最相关的适配器,而不需要任务标签。通过对LIBERO基准测试的大量实验,我们表明,在没有灾难性忘记早期任务的情况下,BEE在新任务上实现了高性能,甚至显着优于基于范例的方法。代码和数据可在https://tum-lsy.github.io/clare上获得。
摘要:To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for exemplar-free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected feedforward layers and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods. Code and data are available at https://tum-lsy.github.io/clare.
【7】Enhancing Spatial Reasoning in Large Language Models for Metal-Organic Frameworks Structure Prediction
标题:增强大型语言模型中的空间推理以用于金属有机框架结构预测
链接:https://arxiv.org/abs/2601.09285
作者:Mianzhi Pan,JianFei Li,Peishuo Liu,Botian Wang,Yawen Ouyang,Yiming Rong,Hao Zhou,Jianbing Zhang
摘要:金属有机框架(MOFs)是一种多孔晶体材料,具有广泛的应用,如碳捕获和药物输送,但准确预测其3D结构仍然是一个重大挑战。虽然大型语言模型(LLM)在生成晶体方面表现出了希望,但它们在MOF中的应用受到MOF高原子复杂性的阻碍。受深度生成模型中块模式成功的启发,我们通过引入MOF-LLM(第一个专门适用于块级MOF结构预测的LLM框架),开创了LLM在该领域的应用。为了有效地利用LLM进行模块化组装任务,我们的训练范式集成了空间感知的持续预训练(CPT),结构监督微调(SFT)和匹配驱动的强化学习(RL)。通过将显式空间先验和优化结构稳定性,通过软自适应策略优化(SAPO),我们的方法大大提高了空间推理能力的Qwen-3 8B模型准确的MOF结构预测。综合实验表明,MOF-LLM优于最先进的基于去噪和基于LLM的方法,同时表现出优越的采样效率。
摘要:Metal-organic frameworks (MOFs) are porous crystalline materials with broad applications such as carbon capture and drug delivery, yet accurately predicting their 3D structures remains a significant challenge. While Large Language Models (LLMs) have shown promise in generating crystals, their application to MOFs is hindered by MOFs' high atomic complexity. Inspired by the success of block-wise paradigms in deep generative models, we pioneer the use of LLMs in this domain by introducing MOF-LLM, the first LLM framework specifically adapted for block-level MOF structure prediction. To effectively harness LLMs for this modular assembly task, our training paradigm integrates spatial-aware continual pre-training (CPT), structural supervised fine-tuning (SFT), and matching-driven reinforcement learning (RL). By incorporating explicit spatial priors and optimizing structural stability via Soft Adaptive Policy Optimization (SAPO), our approach substantially enhances the spatial reasoning capability of a Qwen-3 8B model for accurate MOF structure prediction. Comprehensive experiments demonstrate that MOF-LLM outperforms state-of-the-art denoising-based and LLM-based methods while exhibiting superior sampling efficiency.
【8】LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference
标题:LatencyPrism:SLO保证LLM推理的在线非侵入性延迟雕塑
链接:https://arxiv.org/abs/2601.09258
作者:Du Yin,Jiayi Ren,Xiayu Sun,Tianyao Zhou,Haizhu Zhou,Ruiyan Ma,Danyang Zhang
备注:12 pages, 6 figures
摘要:LLM推理延迟严重影响用户体验和运营成本,直接影响SLO约束下的吞吐量。即使是短暂的延迟尖峰也会降低服务质量,尽管平均性能可以接受。然而,具有不同软件框架和XPU架构的分布式推理环境与动态工作负载相结合,使延迟分析具有挑战性。由于受到需要重新启动甚至暂停服务的侵入式设计的限制,以及无法适应异构推理环境的硬件绑定实现,现有的AI分析方法通常不足以进行实时生产分析。 我们提出了LatencyPrism,第一个零入侵多平台延迟雕刻系统。它旨在打破整个管道的推理延迟,主动警告推理延迟异常,并保证遵守SLO,所有这些都不需要修改代码或重新启动服务。LatencyPrism已经在数千个XPU上部署了六个多月。它可以在批处理级别实现低开销的实时监控,并在毫秒内触发警报。这种方法区分了工作负载驱动的延迟变化和异常,表明F1分数为0.98的潜在问题。我们还对根本原因分析进行了广泛的实验和调查,以证明LatencyPrism的能力。
摘要:LLM inference latency critically determines user experience and operational costs, directly impacting throughput under SLO constraints. Even brief latency spikes degrade service quality despite acceptable average performance. However, distributed inference environments featuring diverse software frameworks and XPU architectures combined with dynamic workloads make latency analysis challenging. Constrained by intrusive designs that necessitate service restarts or even suspension, and by hardware-bound implementations that fail to adapt to heterogeneous inference environments, existing AI profiling methods are often inadequate for real-time production analysis. We present LatencyPrism, the first zero-intrusion multi-platform latency sculpting system. It aims to break down the inference latency across pipeline, proactively alert on inference latency anomalies, and guarantee adherence to SLOs, all without requiring code modifications or service restarts. LatencyPrism has been deployed across thousands of XPUs for over six months. It enables low-overhead real-time monitoring at batch level with alerts triggered in milliseconds. This approach distinguishes between workload-driven latency variations and anomalies indicating underlying issues with an F1-score of 0.98. We also conduct extensive experiments and investigations into root cause analysis to demonstrate LatencyPrism's capability.
【9】$D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
标题:$D ' 2 Prune $:通过双重泰勒扩展和注意力分布意识来精简大型语言模型
链接:https://arxiv.org/abs/2601.09176
作者:Lang Xiong,Ning Liu,Ao Ren,Yuheng Bai,Haining Fang,BinYan Zhang,Zhe Jiang,Yujuan Tan,Duo Liu
摘要:大型语言模型(LLM)由于其庞大的计算需求而面临着重大的部署挑战。虽然修剪提供了一个有前途的压缩解决方案,但现有方法存在两个关键限制:(1)它们忽略了校准数据和测试数据之间的激活分布变化,导致错误估计不准确;(2)它们忽略了注意力模块中激活的长尾分布特征。为了解决这些限制,本文提出了一种新的修剪方法,$D^2Prune$。首先,我们提出了一种基于双泰勒展开的方法,该方法联合建模权重和激活扰动以进行精确的误差估计,从而实现精确的修剪掩码选择和权重更新,并在修剪过程中使误差最小化。其次,我们提出了一个注意力感知的动态更新策略,通过联合最小化注意力分布的KL发散和重建误差来保持长尾注意力模式。广泛的实验表明,D^2Prune $在各种LLM上始终优于SOTA方法(例如,OPT-125 M、LLaMA 2/3和Qwen 3)。此外,动态注意力更新机制也很好地推广到基于ViT的视觉模型,如DeiT,在ImageNet-1 K上实现了卓越的准确性。
摘要:Large language models (LLMs) face significant deployment challenges due to their massive computational demands. % While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, $D^2Prune$. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that $D^2Prune$ consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.
【10】BalDRO: A Distributionally Robust Optimization based Framework for Large Language Model Unlearning
标题:BalDRO:一个基于分布式鲁棒优化的大型语言模型去学习框架
链接:https://arxiv.org/abs/2601.09172
作者:Pengyang Shao,Naixin Zhai,Lei Chen,Yonghui Yang,Fengbin Zhu,Xun Yang,Meng Wang
摘要:随着大型语言模型(LLM)越来越多地塑造在线内容,从经过良好训练的LLM中删除目标信息(也称为LLM unlearning)已成为Web治理的关键。一个关键的挑战在于遗忘集中的样本不平衡:不同的样本表现出不同的遗忘难度,导致异步遗忘,其中一些知识仍然没有被充分擦除,而另一些则被过度遗忘。为了解决这个问题,我们提出了BalDRO,这是一个新颖而有效的平衡LLM学习框架。BalDRO将unlearning公式化为min-sup过程:内部步骤识别最坏情况的数据分布,强调难以unlearn的样本,而外部步骤更新此分布下的模型参数。我们通过两种有效的变体实例化BalDRO:BalDRO-G,一种基于离散GroupDRO的近似,专注于高损失子集,BalDRO-DV,一种连续的Donsker-Varadhan对偶方法,可以在标准训练管道内实现平滑的自适应加权。在TOFU和MUSE上的实验表明,BalDRO与现有方法相比,显著提高了遗忘质量和模型实用性,并且我们发布了可重复性代码。
摘要
:As Large Language Models (LLMs) increasingly shape online content, removing targeted information from well-trained LLMs (also known as LLM unlearning) has become critical for web governance. A key challenge lies in sample-wise imbalance within the forget set: different samples exhibit widely varying unlearning difficulty, leading to asynchronous forgetting where some knowledge remains insufficiently erased while others become over-forgotten. To address this, we propose BalDRO, a novel and efficient framework for balanced LLM unlearning. BalDRO formulates unlearning as a min-sup process: an inner step identifies a worst-case data distribution that emphasizes hard-to-unlearn samples, while an outer step updates model parameters under this distribution. We instantiate BalDRO via two efficient variants: BalDRO-G, a discrete GroupDRO-based approximation focusing on high-loss subsets, and BalDRO-DV, a continuous Donsker-Varadhan dual method enabling smooth adaptive weighting within standard training pipelines. Experiments on TOFU and MUSE show that BalDRO significantly improves both forgetting quality and model utility over existing methods, and we release code for reproducibility.
【11】Interpretable Probability Estimation with LLMs via Shapley Reconstruction
标题:通过Shapley重建使用LLM的可解释概率估计
链接:https://arxiv.org/abs/2601.09151
作者:Yang Nan,Qihao Wen,Jiahao Wang,Pengfei He,Ravi Tandon,Yong Ge,Han Xu
摘要:大型语言模型(LLM)通过利用其广泛的知识和推理能力,展示了估计不确定事件概率的潜力。这种能力可以应用于支持不同领域的智能决策,例如财务预测和预防性医疗保健。然而,直接提示LLM进行概率估计面临着重大挑战:它们的输出通常是嘈杂的,并且潜在的预测过程是不透明的。在本文中,我们提出了PRISM:通过Shapley测度的概率重建,这是一个为基于LLM的概率估计带来透明度和精度的框架。PRISM通过使用Shapley值量化每个输入因子的边际贡献来分解LLM的预测。这些因素级的贡献,然后聚集重建校准的最终估计。在我们的实验中,我们证明了PRISM在直接提示和其他基线上提高了预测准确性,包括金融,医疗保健和农业在内的多个领域。除了性能之外,PRISM还提供了一个透明的预测管道:我们的案例研究可视化了各个因素如何影响最终估计,有助于建立对基于LLM的决策支持系统的信任。
摘要:Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare. However, directly prompting LLMs for probability estimation faces significant challenges: their outputs are often noisy, and the underlying predicting process is opaque. In this paper, we propose PRISM: Probability Reconstruction via Shapley Measures, a framework that brings transparency and precision to LLM-based probability estimation. PRISM decomposes an LLM's prediction by quantifying the marginal contribution of each input factor using Shapley values. These factor-level contributions are then aggregated to reconstruct a calibrated final estimate. In our experiments, we demonstrate PRISM improves predictive accuracy over direct prompting and other baselines, across multiple domains including finance, healthcare, and agriculture. Beyond performance, PRISM provides a transparent prediction pipeline: our case studies visualize how individual factors shape the final estimate, helping build trust in LLM-based decision support systems.
【12】EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
标题:EvasionBench:通过多模型共识和LLM作为法官检测金融问答中的规避答案
链接:https://arxiv.org/abs/2601.09142
作者:Shijian Ma,Yan Lin,Yi Yang
备注:Shijian Ma and Yan Lin contributed equally. Corresponding author: Yan Lin
摘要:在财报电话会议中发现回避的答案对于财务透明度至关重要,但缺乏大规模基准阻碍了进展。我们引入EvasionBench,包括30,000个训练样本和1,000个人类注释的测试样本(Cohen的Kappa 0.835),跨越三个逃避水平。我们的主要贡献是一个多模型注释框架,它利用了一个核心见解:前沿LLM之间的分歧表明了对训练最有价值的硬示例。我们挖掘两个强注释器冲突的边界情况,使用法官来解决标签。这种方法的性能比单模型蒸馏高2.4%,尽管训练损失更高(0.421 vs 0.393),但判断解决的样本提高了泛化能力-这证明了分歧挖掘作为隐式正则化。我们的训练模型Eva-4 B(4 B参数)达到了81.3%的准确率,比其基础模型高出25个百分点,并以推理成本的一小部分接近前沿LLM性能。
摘要:Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
【13】Fine Grained Evaluation of LLMs-as-Judges
标题:对法学硕士担任法官的细粒度评估
链接:https://arxiv.org/abs/2601.08919
作者:Sourav Saha,Mandar Mitra
摘要:最近的大量研究都集中在大型语言模型如何 (LLM)可以代替人类作为“法官”来评估质量 由各种文本/图像处理系统产生的输出。内 在这个更广泛的背景下,一些研究调查了具体的 LLM如何有效地用作相关性评估器的问题, 信息检索(IR)中的标准特设任务。我们扩展了这些 通过研究额外的问题。最重要的是,我们使用 由INEX倡议创建的基于维基百科的测试集合,以及 提示LLM不仅判断文件是否相关/ 不相关,但要强调文件中的相关段落, 视为有用。参与创建的人类相关性评估员 这一系列都得到了类似的指示,即,他们被要求 突出显示文档中响应 在查询中表达的信息需求。这使我们能够评估 LLM作为法官的质量不仅在文件层面,而且 量化这些“法官”出于正确的理由而正确的频率。 我们的研究结果表明,法学硕士作为法官的工作最好在人类 监管
摘要:A good deal of recent research has focused on how Large Language Models (LLMs) may be used as `judges' in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these `judges' are right for the right reasons. Our findings suggest that LLMs-as-judges work best under human supervision.
【14】Spectral Generative Flow Models: A Physics-Inspired Replacement for Vectorized Large Language Models
标题:谱生成流模型:以物理为灵感的对载体化大型语言模型的替代品
链接:https://arxiv.org/abs/2601.08893
作者:Andrew Kiruluta
摘要:我们介绍了频谱生成流模型(SGFM),一个物理启发的替代基于transformer的大型语言模型。SGFM将文本或视频表示为由注意力处理的离散令牌序列,而不是将生成视为由多尺度小波基中的受约束随机动力学支配的连续场的演化。这种提法取代了全球性的关注与当地运营商,光谱投影,和Navier-Stokes类运输,产生一个生成机制接地连续性,几何形状和物理结构。 我们的框架提供了三个关键的创新:(i)场理论本体,其中文本和视频统一为随机偏微分方程的轨迹;(ii)小波域表示,诱导稀疏性,尺度分离和计算效率;和(iii)约束随机流,强制执行稳定性,一致性和不确定性传播。这些组件共同定义了一个生成架构,从根本上背离了自回归建模和基于扩散的方法。SGFM为下一代生成模型中的长程一致性、多模态通用性和物理结构化归纳偏差提供了一条原则性路径。
摘要:We introduce Spectral Generative Flow Models (SGFMs), a physics-inspired alternative to transformer-based large language models. Instead of representing text or video as sequences of discrete tokens processed by attention, SGFMs treat generation as the evolution of a continuous field governed by constrained stochastic dynamics in a multiscale wavelet basis. This formulation replaces global attention with local operators, spectral projections, and Navier--Stokes-like transport, yielding a generative mechanism grounded in continuity, geometry, and physical structure. Our framework provides three key innovations: (i) a field-theoretic ontology in which text and video are unified as trajectories of a stochastic partial differential equation; (ii) a wavelet-domain representation that induces sparsity, scale separation, and computational efficiency; and (iii) a constrained stochastic flow that enforces stability, coherence, and uncertainty propagation. Together, these components define a generative architecture that departs fundamentally from autoregressive modeling and diffusion-based approaches. SGFMs offer a principled path toward long-range coherence, multimodal generality, and physically structured inductive bias in next-generation generative models.
【15】Directional Attractors in LLM Reasoning: How Similarity Retrieval Steers Iterative Summarization Based Reasoning
标题:LLM推理中的方向吸引子:相似性检索如何引导基于迭代总结的推理
链接:https://arxiv.org/abs/2601.08846
作者:Cagatay Tekin,Charbel Barakat,Luis Joseph Luna Limgenco
备注:6 pages, 2 figures. Code available at: github.com/cagopat/InftyThink-with-Cross-Chain-Memory
摘要:基于迭代摘要的推理框架(如InftyThink)通过控制上下文增长来实现大型语言模型(LLM)中的长期推理,但它们会在任务中重复生成类似的推理策略。我们介绍了InftyThink与跨链内存,扩展,增强迭代推理与嵌入式语义缓存以前成功的推理模式。在每个推理步骤中,模型检索并以语义上最相似的存储词元为条件,引导推理,而不会不加区别地扩展上下文窗口。MATH 500,AIME 2024和GPQA-Diamond上的实验表明,语义词元检索提高了结构化域的准确性,同时暴露了包括异构域的测试中的故障模式。推理轨迹的几何分析表明,缓存检索诱导方向性偏差嵌入空间,导致一致的固定(提高基线精度)和打破(基线精度下降)吸引子。我们的研究结果突出了基于相似性的记忆自我改进LLM推理的好处和局限性。
摘要:Iterative summarization based reasoning frameworks such as InftyThink enable long-horizon reasoning in large language models (LLMs) by controlling context growth, but they repeatedly regenerate similar reasoning strategies across tasks. We introduce InftyThink with Cross-Chain Memory, an extension that augments iterative reasoning with an embedding-based semantic cache of previously successful reasoning patterns. At each reasoning step, the model retrieves and conditions on the most semantically similar stored lemmas, guiding inference without expanding the context window indiscriminately. Experiments on MATH500, AIME2024, and GPQA-Diamond demonstrate that semantic lemma retrieval improves accuracy in structured domains while exposing failure modes in tests that include heterogeneous domains. Geometric analyses of reasoning trajectories reveal that cache retrieval induces directional biases in embedding space, leading to consistent fix (improve baseline accuracy) and break (degradation in baseline accuracy) attractors. Our results highlight both the benefits and limits of similarity-based memory for self-improving LLM reasoning.
【16】Emissions and Performance Trade-off Between Small and Large Language Models
标题:小型语言模型和大型语言模型之间的排放和性能权衡
链接:https://arxiv.org/abs/2601.08844
作者:Anandita Garg,Uma Gaba,Deepan Muthirayan,Anish Roy Chowdhury
备注:6 pages. Accepted as a full paper to the 3rd International Conference on Foundation and Large Language Models (IEEE FLLM) 2025
摘要:大型语言模型(LLM)的出现引起了人们对其巨大碳足迹的担忧,从能源密集型训练开始,并通过重复推理继续下去。本研究探讨了使用微调的小语言模型(SLM)作为预定义任务的可持续替代方案的潜力。在这里,我们对自然语言处理,推理和编程下选定任务的LLM和微调SLM之间的性能排放权衡进行了比较分析。我们的研究结果表明,在六个选定的任务中的四个,SLM保持相当的性能,在推理过程中的碳排放量显着减少。我们的研究结果证明了较小模型在减轻资源密集型LLM的环境影响方面的可行性,从而向可持续的绿色AI迈进。
摘要:The advent of Large Language Models (LLMs) has raised concerns about their enormous carbon footprint, starting with energy-intensive training and continuing through repeated inference. This study investigates the potential of using fine-tuned Small Language Models (SLMs) as a sustainable alternative for predefined tasks. Here, we present a comparative analysis of the performance-emissions trade-off between LLMs and fine-tuned SLMs across selected tasks under Natural Language Processing, Reasoning and Programming. Our results show that in four out of the six selected tasks, SLMs maintained comparable performances for a significant reduction in carbon emissions during inference. Our findings demonstrate the viability of smaller models in mitigating the environmental impact of resource-heavy LLMs, thus advancing towards sustainable, green AI.
【17】Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness
标题:有条件的LLM评分:一致性、不确定性和稳健性
链接:https://arxiv.org/abs/2601.08843
作者:Haotian Deng,Chris Farber,Jiyoon Lee,David Tang
摘要:由于学生回答的语言差异以及对细致入微、与规则一致的部分学分的需要,自动简短答案评分(ASAG)仍然是一项具有挑战性的任务。虽然大型语言模型(LLM)提供了一个很有前途的解决方案,但它们作为基于规则的设置中的自动判断的可靠性需要严格的评估。在本文中,我们系统地评估了基于规则的简答题评分的法学硕士法官的表现。我们研究了三个关键方面:LLM分级与不同规则复杂性的专家判断的一致性,基于共识的延迟机制促进的不确定性和准确性之间的权衡,以及模型在随机输入扰动和对抗性攻击下的鲁棒性。使用SciEntsBank基准测试和Qwen 2.5- 72 B,我们发现对齐对于二进制任务来说是很强的,但是随着规则粒度的增加而降低。我们的“信任曲线”分析表明了一个明确的权衡,过滤低置信度预测可以提高剩余子集的准确性。此外,鲁棒性实验表明,虽然该模型是弹性的提示注入,它是敏感的同义词替换。我们的工作提供了关键的洞察力的能力和限制的rubric-conditioned LLM法官,突出的重要性,不确定性估计和鲁棒性测试的可靠部署。
摘要
:Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model's robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity. Our "Trust Curve" analysis demonstrates a clear trade-off where filtering low-confidence predictions improves accuracy on the remaining subset. Additionally, robustness experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions. Our work provides critical insights into the capabilities and limitations of rubric-conditioned LLM judges, highlighting the importance of uncertainty estimation and robustness testing for reliable deployment.
【18】CrowdLLM: Building LLM-Based Digital Populations Augmented with Generative Models
标题:CrowdLLM:利用生成模型构建基于LLM的数字人群
链接:https://arxiv.org/abs/2512.07890
作者:Ryan Feng Lin,Keyu Tian,Hanming Zheng,Congjing Zhang,Li Zeng,Shuai Huang
摘要:大型语言模型(LLM)的出现引发了人们对创建基于LLM的数字人群的兴趣,这些数字人群可以应用于许多应用,如社交模拟,众包,营销和推荐系统。数字人群可以降低招募人类参与者的成本,并减轻与人类受试者研究相关的许多担忧。然而,研究发现,大多数现有的作品仅依赖于LLM,无法充分捕捉真实人群的准确性和多样性。为了解决这一限制,我们提出了CrowdLLM,它集成了预训练的LLM和生成模型,以增强数字群体的多样性和保真度。我们对CrowdLLM进行理论分析,了解其在创建具有成本效益,具有足够代表性,可扩展的数字人群方面的巨大潜力,这些数字人群可以与真实人群的质量相匹配。还跨多个领域(例如,众包,投票,用户评级)和模拟研究表明,CrowdLLM在人类数据的准确性和分布保真度方面都取得了令人鼓舞的表现。
摘要:The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.
Graph相关(图学习|图神经网络|图优化等)(2篇)
【1】FairGU: Fairness-aware Graph Unlearning in Social Network
标题:FairGU:社交网络中公平意识的图形忘记学习
链接:https://arxiv.org/abs/2601.09469
作者:Renqiang Luo,Yongshuai Yang,Huafei Huang,Qing Qing,Mingliang Hou,Ziqi Xu,Yi Yu,Jingjing Zhou,Feng Xia
备注:9 pages, 2 figs, WWW 2026 accepted
摘要:图去学习已经成为支持可持续和隐私保护社交网络的关键机制,使模型能够消除删除节点的影响,从而更好地保护用户信息。然而,我们观察到,现有的图unlearning技术不足以保护敏感属性,往往导致退化的算法公平性相比,传统的图学习方法。为了解决这个问题,我们引入了FairGU,这是一个公平感知的图学习框架,旨在在学习过程中保持效用和公平性。FairGU将专用的公平感知模块与有效的数据保护策略集成在一起,确保在删除节点时,敏感属性既不会被无意中放大,也不会在结构上暴露。通过对多个真实世界数据集的广泛实验,我们证明FairGU在准确性和公平性指标方面始终优于最先进的图去学习方法和公平性增强的图学习基线。我们的研究结果强调了当前非学习实践中以前被忽视的风险,并将FairGU确立为下一代社会可持续网络系统的强大而公平的解决方案。这些代码可在https://github.com/LuoRenqiang/FairGU上获得。
摘要:Graph unlearning has emerged as a critical mechanism for supporting sustainable and privacy-preserving social networks, enabling models to remove the influence of deleted nodes and thereby better safeguard user information. However, we observe that existing graph unlearning techniques insufficiently protect sensitive attributes, often leading to degraded algorithmic fairness compared with traditional graph learning methods. To address this gap, we introduce FairGU, a fairness-aware graph unlearning framework designed to preserve both utility and fairness during the unlearning process. FairGU integrates a dedicated fairness-aware module with effective data protection strategies, ensuring that sensitive attributes are neither inadvertently amplified nor structurally exposed when nodes are removed. Through extensive experiments on multiple real-world datasets, we demonstrate that FairGU consistently outperforms state-of-the-art graph unlearning methods and fairness-enhanced graph learning baselines in terms of both accuracy and fairness metrics. Our findings highlight a previously overlooked risk in current unlearning practices and establish FairGU as a robust and equitable solution for the next generation of socially sustainable networked systems. The codes are available at https://github.com/LuoRenqiang/FairGU.
【2】HGATSolver: A Heterogeneous Graph Attention Solver for Fluid-Structure Interaction
标题:HGATSolver:流固相互作用的异类图形注意力解决器
链接:https://arxiv.org/abs/2601.09251
作者:Qin-Yi Zhang,Hong Wang,Siyao Liu,Haichuan Lin,Linying Cao,Xiao-Hu Zhou,Chen Chen,Shuangyi Wang,Zeng-Guang Hou
摘要:流固耦合系统涉及到流体和固体这两个不同的物理域,它们由不同的偏微分方程控制,并在一个动态界面上耦合。虽然基于学习的求解器为昂贵的数值模拟提供了一个有前途的替代方案,但现有的方法很难在一个统一的框架内捕获FSI的异构动力学。由于界面耦合导致的跨域响应不一致以及流体和固体区域之间学习困难的差异进一步加剧了这一挑战,导致预测过程中的不稳定性。为了应对这些挑战,我们提出了异构图注意力求解器(HGATSolver)。HGATSolver将系统编码为异构图,通过流体,固体和界面区域的不同节点和边类型将物理结构直接嵌入模型中。这使得专门的消息传递机制适合每个物理域。为了稳定显式时间步进,我们引入了一种新的物理条件的门控机制,作为一个可学习的,自适应的松弛因子。此外,域间一致性平衡损失基于预测的不确定性动态地平衡跨域的优化目标。在两个构建的FSI基准和一个公共数据集上的大量实验表明,HGATSolver实现了最先进的性能,为耦合多物理场系统的代理建模建立了一个有效的框架。
摘要:Fluid-structure interaction (FSI) systems involve distinct physical domains, fluid and solid, governed by different partial differential equations and coupled at a dynamic interface. While learning-based solvers offer a promising alternative to costly numerical simulations, existing methods struggle to capture the heterogeneous dynamics of FSI within a unified framework. This challenge is further exacerbated by inconsistencies in response across domains due to interface coupling and by disparities in learning difficulty across fluid and solid regions, leading to instability during prediction. To address these challenges, we propose the Heterogeneous Graph Attention Solver (HGATSolver). HGATSolver encodes the system as a heterogeneous graph, embedding physical structure directly into the model via distinct node and edge types for fluid, solid, and interface regions. This enables specialized message-passing mechanisms tailored to each physical domain. To stabilize explicit time stepping, we introduce a novel physics-conditioned gating mechanism that serves as a learnable, adaptive relaxation factor. Furthermore, an Inter-domain Gradient-Balancing Loss dynamically balances the optimization objectives across domains based on predictive uncertainty. Extensive experiments on two constructed FSI benchmarks and a public dataset demonstrate that HGATSolver achieves state-of-the-art performance, establishing an effective framework for surrogate modeling of coupled multi-physics systems.
Transformer(7篇)
【1】Energy-Entropy Regularization: The True Power of Minimal Looped Transformers
标题:能量-熵正规化:最小环Transformer的真正力量
链接:https://arxiv.org/abs/2601.09588
作者:Wai-Lun Lam
备注:19 pages, 2 figures
摘要:最近的研究表明,与标准的深度架构相比,循环Transformers具有更好的推理能力。目前在基准任务上训练单头循环架构的方法由于高度非凸和不规则的损失情况而经常失败或产生次优性能。在这些设置中,优化通常停滞在损失景观的局部最小值和鞍点,从而阻止模型发现全局最小点。这些单头循环Transformer模型的内部机制仍然知之甚少,从头开始训练它们仍然是一个重大挑战。在本文中,我们提出了一种新的训练框架,利用Tsallis熵和Hamilton动力学来改变损失景观的几何形状。通过将参数更新视为物理流,我们成功地训练了一个模型维数$d = 8$的单头循环Transformer,以解决输入序列长度为1000个令牌的感应头任务。这一成功揭示了卓越推理能力背后的内在机制。
摘要:Recent research suggests that looped Transformers have superior reasoning capabilities compared to standard deep architectures. Current approaches to training single-head looped architectures on benchmark tasks frequently fail or yield suboptimal performance due to a highly non-convex and irregular loss landscape. In these settings, optimization often stagnates in poor local minima and saddle points of the loss landscape, preventing the model from discovering the global minimum point. The internal mechanisms of these single-head looped transformer models remain poorly understood, and training them from scratch remains a significant challenge. In this paper, we propose a novel training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. By treating the parameter updates as a physical flow, we successfully trained a single-head looped Transformer with model dimension $d = 8$ to solve induction head task with input sequence length of 1000 tokens. This success reveals the internal mechanism behind the superior reasoning capability.
【2】Searth Transformer: A Transformer Architecture Incorporating Earth's Geospheric Physical Priors for Global Mid-Range Weather Forecasting
标题:Searth Transformer:一种综合地球地球圈物理先验信息以用于全球中期天气预报的Transformer架构
链接:https://arxiv.org/abs/2601.09467
作者:Tianye Li,Qi Liu,Hao Li,Lei Chen,Wencong Cheng,Fei Zheng,Xiangao Xia,Ya Wang,Gang Huang,Weiwei Wang,Xuan Tong,Ziqing Zu,Yi Fang,Shenming Fu,Jiang Jiang,Haochen Li,Mingxing Li,Jiangjiang Xia
摘要:准确的全球中期天气预报是地球系统科学的基础。大多数现有的基于Transformer的预测模型采用以视觉为中心的架构,忽略了地球的球形几何形状和纬向周期性。此外,传统的自回归训练在计算上是昂贵的,并且由于误差积累而限制了预测范围。为了解决这些挑战,我们提出了移动地球Transformer(Searth Transformer),一个物理信息架构,将带状周期性和带状边界纳入基于窗口的自我注意力,以实现物理上一致的全球信息交换。我们进一步介绍了一个中继自回归(RAR)微调策略,使学习远程大气演变在有限的内存和计算预算。在这些方法的基础上,我们发展了一个全球中期天气预报模式--盐田。YanTian实现了比欧洲中期天气预报中心的高分辨率预报更高的准确性,并在1度分辨率下与最先进的人工智能模型竞争,同时需要比标准自回归微调低约200倍的计算成本。此外,盐田获得了更长的熟练预报提前期Z500(10.3天)比HRES(9天)。除了天气预报之外,这项工作还为复杂的全球尺度地球物理环流系统的预测建模奠定了坚实的算法基础,为地球系统科学提供了新的途径。
摘要:Accurate global medium-range weather forecasting is fundamental to Earth system science. Most existing Transformer-based forecasting models adopt vision-centric architectures that neglect the Earth's spherical geometry and zonal periodicity. In addition, conventional autoregressive training is computationally expensive and limits forecast horizons due to error accumulation. To address these challenges, we propose the Shifted Earth Transformer (Searth Transformer), a physics-informed architecture that incorporates zonal periodicity and meridional boundaries into window-based self-attention for physically consistent global information exchange. We further introduce a Relay Autoregressive (RAR) fine-tuning strategy that enables learning long-range atmospheric evolution under constrained memory and computational budgets. Based on these methods, we develop YanTian, a global medium-range weather forecasting model. YanTian achieves higher accuracy than the high-resolution forecast of the European Centre for Medium-Range Weather Forecasts and performs competitively with state-of-the-art AI models at one-degree resolution, while requiring roughly 200 times lower computational cost than standard autoregressive fine-tuning. Furthermore, YanTian attains a longer skillful forecast lead time for Z500 (10.3 days) than HRES (9 days). Beyond weather forecasting, this work establishes a robust algorithmic foundation for predictive modeling of complex global-scale geophysical circulation systems, offering new pathways for Earth system science.
【3】Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
标题:Transformer比CNN更了解古罗马硬币图案吗?
链接:https://arxiv.org/abs/2601.09433
作者:David Reid,Ognjen Arandjelovic
摘要:对古代硬币的自动分析有可能帮助研究人员从大量硬币收藏中提取更多的历史见解,并帮助收藏家了解他们正在购买或出售的东西。最近在这一领域的研究表明,通过使用卷积神经网络(CNN),重点关注语义元素的识别,因为它们通常被描绘在古代硬币上。本文首次将最近提出的Vision Transformer(ViT)深度学习架构应用于硬币语义元素识别任务,使用多模态数据(图像和非结构化文本)的全自动学习。本文总结了以前的研究在该领域,讨论了培训和实施的ViT和CNN模型的古钱币分析,并提供了他们的性能评估。发现ViT模型在准确性方面优于新训练的CNN模型。
摘要:Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.
【4】Draw it like Euclid: Teaching transformer models to generate CAD profiles using ruler and compass construction steps
标题:像欧几里得一样绘制:教Transformer模型使用尺子和指南针构建步骤生成CAD轮廓
链接:https://arxiv.org/abs/2601.09428
作者:Siyi Li,Joseph G. Lambourne,Longfei Zhang,Pradeep Kumar Jayaraman,Karl. D. D. Willis
摘要:我们介绍了一种通过一系列简单的几何构造(包括曲线偏移、旋转和相交)生成计算机辅助设计(CAD)轮廓的新方法。这些序列从设计师提供的几何图形开始,逐步建立最终轮廓的点和曲线。我们证明,在设计师的输入几何形状和最终配置文件之间添加建设步骤,以类似的方式提高生成质量,在语言模型中引入一个思想链。与参数化CAD模型中的约束类似,构造序列将建模形状中的自由度减少到可以由设计者调整的一小组参数值,从而允许对被评估为浮点精度的构造几何进行参数化编辑。此外,我们还表明,将强化学习应用于构造序列可以在广泛的指标上得到进一步的改进,包括一些没有明确优化的指标。
摘要
:We introduce a new method of generating Computer Aided Design (CAD) profiles via a sequence of simple geometric constructions including curve offsetting, rotations and intersections. These sequences start with geometry provided by a designer and build up the points and curves of the final profile step by step. We demonstrate that adding construction steps between the designer's input geometry and the final profile improves generation quality in a similar way to the introduction of a chain of thought in language models. Similar to the constraints in a parametric CAD model, the construction sequences reduce the degrees of freedom in the modeled shape to a small set of parameter values which can be adjusted by the designer, allowing parametric editing with the constructed geometry evaluated to floating point precision. In addition we show that applying reinforcement learning to the construction sequences gives further improvements over a wide range of metrics, including some which were not explicitly optimized.
【5】Comparative Assessment of Concrete Compressive Strength Prediction at Industry Scale Using Embedding-based Neural Networks, Transformers, and Traditional Machine Learning Approaches
标题:使用基于嵌入的神经网络、变换器和传统机器学习方法对工业规模混凝土抗压强度预测进行比较评估
链接:https://arxiv.org/abs/2601.09096
作者:Md Asiful Islam,Md Ahmed Al Muzaddid,Afia Jahin Prema,Sreenath Reddy Vuske
摘要:混凝土是世界上使用最广泛的建筑材料;然而,由于材料的异质性,可变的混合比例以及对现场和环境条件的敏感性,抗压强度的可靠预测仍然具有挑战性。人工智能的最新进展使数据驱动的建模框架能够支持建筑质量控制中的自动决策。该研究利用由大约70,000个抗压强度测试记录组成的行业规模数据集来评估和比较多种预测方法,包括线性回归,决策树,随机森林,基于transformer的神经网络和基于嵌入的神经网络。该模型结合了关键的混合料设计和放置变量,如水灰比,胶凝材料含量,坍落度,空气含量,温度和放置条件。结果表明,基于嵌入的神经网络始终优于传统的机器学习和基于transformer的模型,实现了约2.5%的平均28天预测误差。这种准确性水平与常规实验室测试的可变性相当,证明了基于嵌入的学习框架在大规模施工操作中实现自动化、数据驱动的质量控制和决策支持的潜力。
摘要:Concrete is the most widely used construction material worldwide; however, reliable prediction of compressive strength remains challenging due to material heterogeneity, variable mix proportions, and sensitivity to field and environmental conditions. Recent advances in artificial intelligence enable data-driven modeling frameworks capable of supporting automated decision-making in construction quality control. This study leverages an industry-scale dataset consisting of approximately 70,000 compressive strength test records to evaluate and compare multiple predictive approaches, including linear regression, decision trees, random forests, transformer-based neural networks, and embedding-based neural networks. The models incorporate key mixture design and placement variables such as water cement ratio, cementitious material content, slump, air content, temperature, and placement conditions. Results indicate that the embedding-based neural network consistently outperforms traditional machine learning and transformer-based models, achieving a mean 28-day prediction error of approximately 2.5%. This level of accuracy is comparable to routine laboratory testing variability, demonstrating the potential of embedding-based learning frameworks to enable automated, data-driven quality control and decision support in large-scale construction operations.
【6】Layer-Parallel Training for Transformers
标题:Transformer分层并行训练
链接:https://arxiv.org/abs/2601.09026
作者:Shuai Jiang,Marc Salvado,Eric C. Cyr,Alena Kopaničáková,Rolf Krause,Jacob B. Schroder
备注:20 pages, 12 figures
摘要:我们提出了一种新的培训方法,Transformers使用多层次,层并行的方法。通过神经ODE制定的Transformers,我们的应用程序的一个多层次的并行时间算法的前向和反向传播阶段的培训实现了并行加速层的维度。随着网络深度的增加,这极大地增强了并行可扩展性,这对于越来越大的基础模型特别有用。然而,实现这一点会引入导致梯度中的系统偏差的误差,这反过来会在接近最小值时降低收敛性。我们开发了一种算法来检测这种关键转变,并切换到串行训练或系统地提高层并行训练的准确性。结果,包括BERT,GPT2,ViT和机器翻译架构,证明并行加速以及与串行预训练相称的准确性,同时微调不受影响。
摘要:We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.
【7】Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers
标题:热身的普遍动态稳定衰变:理解Transformer之外的WSD
链接:https://arxiv.org/abs/2601.09000
作者:Annalisa Belloni,Lorenzo Noci,Antonio Orvieto
备注:Accepted at the 2025 HiLD and MOSS Workshops at ICML
摘要:WSD(Warmup Stable Decay)学习率调度器最近变得流行起来,主要是因为它在训练大型语言模型时具有良好的性能和灵活性。WSD的卓越性能-与余弦衰减相比,仅使用一小部分训练的衰减学习率-是否是基于transformer的语言模型特有的现象,可能为其训练动态提供新的理论见解,这仍然是一个悬而未决的问题。受使用学习率计算器作为理解景观几何的新镜头的启发(例如,河谷,连接的最小值,渐进锐化),在这项工作中,我们比较了亚当优化器在Pythia类语言模型上的WSD路径与训练用于分类CIFAR 10图像的小型CNN的WSD路径。我们观察到大多数训练信号,优化器路径特征和锐度动态在这种架构中定性相似。这种一致性指出了新旧非凸问题损失景观的共同几何特征,并暗示了未来围绕高维优化问题几何的研究问题。
摘要:The Warmup Stable Decay (WSD) learning rate scheduler has recently become popular, largely due to its good performance and flexibility when training large language models. It remains an open question whether the remarkable performance of WSD - using a decaying learning rate for only a fraction of training compared to cosine decay - is a phenomenon specific to transformer-based language models that can potentially offer new theoretical insights into their training dynamics. Inspired by the usage of learning rate schedulers as a new lens into understanding landscape geometry (e.g., river valley, connected minima, progressive sharpening), in this work we compare the WSD path of the Adam optimizer on a Pythia-like language model to that of a small CNN trained to classify CIFAR10 images. We observe most training signals, optimizer path features, and sharpness dynamics to be qualitatively similar in such architectures. This consistency points to shared geometric characteristics of the loss landscapes of old and new nonconvex problems, and hints to future research questions around the geometry of high dimensional optimization problems.
GAN|对抗|攻击|生成相关(3篇)
【1】Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation
标题:推测解码的安妮松弛以实现更快的自回归图像生成
链接:https://arxiv.org/abs/2601.09212
作者:Xingyao Li,Fengzhuo Zhang,Cunxiao Du,Hui Ji
备注:Accepted to AAAI 2026
摘要:尽管在自回归图像生成方面取得了重大进展,但由于AR模型的连续性和图像标记的模糊性,即使在使用推测解码时,推理仍然很慢。最近的作品试图解决这一问题与宽松的投机解码,但缺乏理论基础。在本文中,我们建立了放松SD的理论基础,并提出了COOL-SD,一个退火放松投机解码建立在两个关键的见解。第一个分析的总变差(TV)之间的距离的目标模型和宽松的推测性解码,并产生一个最佳的reserve分布,最大限度地减少了距离的上限。第二个使用扰动分析,揭示了退火行为放松投机解码,激励我们的退火设计。这些见解使COOL-SD能够以可比的质量更快地生成图像,或者以类似的延迟实现更好的质量。实验验证了COOL-SD的有效性,在速度和质量的权衡上显示出比以前的方法一致的改进。
摘要:Despite significant progress in autoregressive image generation, inference remains slow due to the sequential nature of AR models and the ambiguity of image tokens, even when using speculative decoding. Recent works attempt to address this with relaxed speculative decoding but lack theoretical grounding. In this paper, we establish the theoretical basis of relaxed SD and propose COOL-SD, an annealed relaxation of speculative decoding built on two key insights. The first analyzes the total variation (TV) distance between the target model and relaxed speculative decoding and yields an optimal resampling distribution that minimizes an upper bound of the distance. The second uses perturbation analysis to reveal an annealing behaviour in relaxed speculative decoding, motivating our annealed design. Together, these insights enable COOL-SD to generate images faster with comparable quality, or achieve better quality at similar latency. Experiments validate the effectiveness of COOL-SD, showing consistent improvements over prior methods in speed-quality trade-offs.
【2】Breaking the Bottlenecks: Scalable Diffusion Models for 3D Molecular Generation
标题:突破瓶颈:用于3D分子生成的可扩展扩散模型
链接:https://arxiv.org/abs/2601.08963
作者:Adrita Das,Peiran Jiang,Dantong Zhu,Barnabas Poczos,Jose Lugo-Martinez
摘要:扩散模型已经成为分子设计的一类强大的生成模型,能够捕获复杂的结构分布,并在3D分子生成中实现高保真度。然而,它们的广泛使用仍然受到长采样轨迹、反向过程中的随机方差以及去噪动态中有限的结构意识的限制。直接去噪扩散模型(DDDM)通过用确定性去噪步骤取代随机反向MCMC更新来减轻这些低效率,从而大大减少推理时间。然而,这种确定性更新的理论基础仍然不透明。在这项工作中,我们通过Huang et al. 2024的反向转换内核(RTK)框架的镜头对DDDM进行了原则性的重新解释,在共享的概率形式主义下统一了确定性和随机扩散。通过将DDDM逆过程表示为近似核算子,我们表明直接去噪过程隐含地优化了噪声和清洁样本之间的结构化传输映射。这个角度阐明了为什么确定性去噪实现了有效的推理。除了理论上的清晰,这种重构解决了分子扩散中的几个长期存在的瓶颈。RTK视图通过强制执行条件良好的反向内核来确保数值稳定性,通过消除随机方差来提高样本一致性,并实现尊重SE(3)等方差的可扩展和保留冗余的去噪器。从经验上讲,我们证明了RTK引导的确定性去噪比随机扩散模型实现了更快的收敛和更高的结构保真度,同时保留了GEOM-DRUGS数据集的化学有效性。代码、模型和数据集在我们的项目存储库中是公开的。
摘要:Diffusion models have emerged as a powerful class of generative models for molecular design, capable of capturing complex structural distributions and achieving high fidelity in 3D molecule generation. However, their widespread use remains constrained by long sampling trajectories, stochastic variance in the reverse process, and limited structural awareness in denoising dynamics. The Directly Denoising Diffusion Model (DDDM) mitigates these inefficiencies by replacing stochastic reverse MCMC updates with deterministic denoising step, substantially reducing inference time. Yet, the theoretical underpinnings of such deterministic updates have remained opaque. In this work, we provide a principled reinterpretation of DDDM through the lens of the Reverse Transition Kernel (RTK) framework by Huang et al. 2024, unifying deterministic and stochastic diffusion under a shared probabilistic formalism. By expressing the DDDM reverse process as an approximate kernel operator, we show that the direct denoising process implicitly optimizes a structured transport map between noisy and clean samples. This perspective elucidates why deterministic denoising achieves efficient inference. Beyond theoretical clarity, this reframing resolves several long-standing bottlenecks in molecular diffusion. The RTK view ensures numerical stability by enforcing well-conditioned reverse kernels, improves sample consistency by eliminating stochastic variance, and enables scalable and symmetry-preserving denoisers that respect SE(3) equivariance. Empirically, we demonstrate that RTK-guided deterministic denoising achieves faster convergence and higher structural fidelity than stochastic diffusion models, while preserving chemical validity across GEOM-DRUGS dataset. Code, models, and datasets are publicly available in our project repository.
【3】From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
标题:从对抗性诗歌到对抗性故事:一个可解释性研究议程
链接:https://arxiv.org/abs/2601.08837
作者:Piercosma Bisconti,Marcello Galisai,Matteo Prandi,Federico Pierucci,Olga Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Marcantonio Brancale,Daniele Nardi
摘要:LLM中的安全机制仍然容易受到通过文化编码结构重新构建有害请求的攻击。我们介绍对抗性故事,一种越狱技术,在赛博朋克叙事中嵌入有害内容,并提示模型执行功能分析,灵感来自弗拉基米尔·普罗普的民间故事形态学。通过将任务铸造为结构分解,攻击诱导模型将有害程序重建为合法的叙事解释。在来自9家供应商的26个前沿模型中,我们观察到平均攻击成功率为71.3%,没有模型家族证明可靠。再加上我们之前在对抗性诗歌方面的工作,这些发现表明,基于结构的越狱构成了一个广泛的脆弱性类别,而不是孤立的技术。文化编码框架的空间可以调解有害的意图是巨大的,可能是取之不尽的模式匹配防御。因此,理解这些攻击成功的原因至关重要:我们概述了一个机械可解释性研究议程,以研究叙事线索如何重塑模型表征,以及模型是否可以独立于表面形式学习识别有害意图。
摘要:Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.
半/弱/无/有监督|不确定性|主动学习(1篇)
【1】Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer
标题:使用随机量化器进行音乐理解的线性复杂性自我监督学习
链接:https://arxiv.org/abs/2601.09603
作者:Petros Vavaroutsos,Theodoros Palamas,Pantelis Vikatos
备注:accepted by ACM/SIGAPP Symposium on Applied Computing (SAC 2026)
摘要:近年来,基础模型由于其出色的性能而变得非常受欢迎,主要是在首次引入它们的自然语言(NLP)任务中。这些模型通常由数亿甚至数十亿个参数组成,使其在训练和生产系统中成为资源密集型,从而导致成本增加。本文重点研究了在应用于音乐信息检索(MIR)任务时减少基金会的模型大小。我们的研究结合了Branchformer架构和SummaryMixing,它们首先应用于语音识别,以及随机量化过程。为了促进可重复性,我们对公开可用的数据集进行了预训练,并辅之以与文献中报道的其他私人数据集规模相当的专有数据集。我们通过使用由各种下游MIR任务组成的框架来确保强大的评估。我们的研究结果表明,我们的架构实现了竞争力的性能相比,其他国家的最先进的模型,使用多头自注意力,同时减少模型的大小从8.5%到12.3%。
摘要
:In recent years, foundation models have become very popular due to their exceptional performance, mainly in natural language (NLP) tasks where they were first introduced. These models usually consist of hundreds of millions, or even billions, of parameters, making them resource-intensive during training and in production systems, leading to increased costs. This paper focuses on the reduction of a foundation's model size when applied to music information retrieval (MIR) tasks. Our research combines the Branchformer architecture with SummaryMixing, which were first applied in speech recognition, along with a random quantization process. To facilitate reproducibility, we conduct pre-training on publicly available datasets, complemented by a proprietary dataset comparable in scale to other private datasets reported in the literature. We ensure robust evaluation by using a framework consisting of a variety of downstream MIR tasks. Our results show that our architecture achieves competitive performance when compared with other state-of-the-art models that use multi-head self-attention, while reducing the model size from 8.5% up to 12.3%.
迁移|Zero/Few/One-Shot|自适应(5篇)
【1】Class Adaptive Conformal Training
标题:班级适应性适形训练
链接:https://arxiv.org/abs/2601.09522
作者:Badr-Eddine Marani,Julio Silva-Rodriguez,Ismail Ben Ayed,Maria Vakalopoulou,Stergios Christodoulidis,Jose Dolz
摘要:深度神经网络在各种任务中取得了显著的成功,但它们经常受到不可靠的概率估计的影响。因此,他们可能对自己的预测过于自信。共形预测(CP)为不确定性量化提供了一个原则性框架,产生具有严格覆盖保证的预测集。现有的共形训练方法优化了整体集合大小,但以类条件方式塑造预测集合并不简单,并且通常需要数据分布的先验知识。在这项工作中,我们引入了类自适应共形训练(CaCT),它将共形训练公式化为一个增强的拉格朗日优化问题,该问题自适应地学习以类条件地形成预测集,而不做任何分布假设。在多个基准数据集上的实验,包括标准和长尾图像识别以及文本分类,表明CaCT始终优于先前的共形训练方法,产生更小,信息量更大的预测集,同时保持所需的覆盖率保证。
摘要:Deep neural networks have achieved remarkable success across a variety of tasks, yet they often suffer from unreliable probability estimates. As a result, they can be overconfident in their predictions. Conformal Prediction (CP) offers a principled framework for uncertainty quantification, yielding prediction sets with rigorous coverage guarantees. Existing conformal training methods optimize for overall set size, but shaping the prediction sets in a class-conditional manner is not straightforward and typically requires prior knowledge of the data distribution. In this work, we introduce Class Adaptive Conformal Training (CaCT), which formulates conformal training as an augmented Lagrangian optimization problem that adaptively learns to shape prediction sets class-conditionally without making any distributional assumptions. Experiments on multiple benchmark datasets, including standard and long-tailed image recognition as well as text classification, demonstrate that CaCT consistently outperforms prior conformal training methods, producing significantly smaller and more informative prediction sets while maintaining the desired coverage guarantees.
【2】GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR
标题:GeoRA:RL VR的几何感知低等级适应
链接:https://arxiv.org/abs/2601.09361
作者:Jiaying Zhang,Lei Shi,Jiguo Li,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He
摘要:具有可验证奖励的强化学习(RLVR)对于推进大规模推理模型至关重要。然而,现有的参数有效的方法,如PiSSA和MiLoRA,是专为监督微调(SFT),并没有考虑到不同的优化动态和几何结构的RLVR。应用这些方法直接导致谱崩溃和优化的不稳定性,严重限制了模型的性能。同时,由于非结构化计算,利用更新稀疏性的替代方法在现代硬件上遇到了显著的效率瓶颈。为了解决这些挑战,我们提出了GeoRA(几何感知低秩自适应),它利用RL更新子空间的各向异性和可压缩性。GeoRA通过在几何约束子空间内通过奇异值分解(SVD)提取主方向,同时冻结残差分量,从而对适配器进行优化。该方法保留了预先训练的几何结构,并通过密集运算符实现高效的GPU计算。在Qwen和Llama上的实验表明,GeoRA缓解了几何不对齐引起的优化瓶颈。它在关键数学基准上始终优于已建立的低等级基线,实现了最先进的(SOTA)结果。此外,GeoRA在域外任务中表现出优越的泛化能力和灾难性遗忘的弹性。
摘要:Reinforcement Learning with Verifiable Rewards (RLVR) is crucial for advancing large-scale reasoning models. However, existing parameter-efficient methods, such as PiSSA and MiLoRA, are designed for Supervised Fine-Tuning (SFT) and do not account for the distinct optimization dynamics and geometric structures of RLVR. Applying these methods directly leads to spectral collapse and optimization instability, which severely limit model performance. Meanwhile, alternative approaches that leverage update sparsity encounter significant efficiency bottlenecks on modern hardware due to unstructured computations. To address these challenges, we propose GeoRA (Geometry-Aware Low-Rank Adaptation), which exploits the anisotropic and compressible nature of RL update subspaces. GeoRA initializes adapters by extracting principal directions via Singular Value Decomposition (SVD) within a geometrically constrained subspace while freezing the residual components. This method preserves the pre-trained geometric structure and enables efficient GPU computation through dense operators. Experiments on Qwen and Llama demonstrate that GeoRA mitigates optimization bottlenecks caused by geometric misalignment. It consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving state-of-the-art (SOTA) results. Moreover, GeoRA shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks.
【3】Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
标题:先想象然后计划:通过世界模型从自适应前瞻中进行代理学习
链接:https://arxiv.org/abs/2601.08955
作者:Youwei Liu,Jian Wang,Hanlin Wang,Beichen Guo,Wenjie Li
摘要:世界模型的最新进展已经显示出对环境状态的未来动态建模的希望,使智能体能够在不访问真实环境的情况下进行推理和行动。目前的方法主要执行单步或固定范围的推出,使其潜在的复杂任务规划未得到充分利用。我们提出了想象,然后计划(ITP),通过前瞻性的想象,代理学习的统一框架,其中代理的政策模型与学习世界模型相互作用,产生多步的“隐藏”轨迹。由于想象视野可能因任务和阶段而异,因此我们通过权衡最终目标和任务进度来引入一种新型的自适应前瞻机制。由此产生的想象轨迹提供了关于未来后果的丰富信号,例如已取得的进展和潜在的冲突,这些信号与当前的观察结果相融合,形成了一个部分可观察和可想象的马尔可夫决策过程,以指导政策学习。我们用无训练和有训练的变体实例化\texttt{ITP}。在代表性的代理基准测试的广泛实验表明,\texttt{ITP}显着优于竞争对手的基线。进一步的分析验证了我们的自适应前瞻在很大程度上提高了代理的推理能力,为解决更广泛,复杂的任务提供了有价值的见解。
摘要:Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yielding multi-step ``imagined'' trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents' reasoning capability, providing valuable insights into addressing broader, complex tasks.
【4】Adaptive few-shot learning for robust part quality classification in two-photon lithography
标题:自适应少次学习用于双量子曝光中稳健的零件质量分类
链接:https://arxiv.org/abs/2601.08885
作者:Sixian Jia,Ruo-Syuan Mei,Chenhui Shao
摘要:None
摘要:Two-photon lithography (TPL) is an advanced additive manufacturing (AM) technique for fabricating high-precision micro-structures. While computer vision (CV) is proofed for automated quality control, existing models are often static, rendering them ineffective in dynamic manufacturing environments. These models typically cannot detect new, unseen defect classes, be efficiently updated from scarce data, or adapt to new part geometries. To address this gap, this paper presents an adaptive CV framework for the entire life-cycle of quality model maintenance. The proposed framework is built upon a same, scale-robust backbone model and integrates three key methodologies: (1) a statistical hypothesis testing framework based on Linear Discriminant Analysis (LDA) for novelty detection, (2) a two-stage, rehearsal-based strategy for few-shot incremental learning, and (3) a few-shot Domain-Adversarial Neural Network (DANN) for few-shot domain adaptation. The framework was evaluated on a TPL dataset featuring hemisphere as source domain and cube as target domain structures, with each domain categorized into good, minor damaged, and damaged quality classes. The hypothesis testing method successfully identified new class batches with 99-100% accuracy. The incremental learning method integrated a new class to 92% accuracy using only K=20 samples. The domain adaptation model bridged the severe domain gap, achieving 96.19% accuracy on the target domain using only K=5 shots. These results demonstrate a robust and data-efficient solution for deploying and maintaining CV models in evolving production scenarios.
【5】Tail-Sensitive KL and Rényi Convergence of Unadjusted Hamiltonian Monte Carlo via One-Shot Couplings
标题:通过单次耦合的未调整Hamilton Monte Carlo的尾部敏感KL和Rényi收敛
链接:https://arxiv.org/abs/2601.09019
作者:Nawaf Bou-Rabee,Siddharth Mitra,Andre Wibisono
备注:64 pages
摘要:Hamilton Monte Carlo(HMC)算法是在高维环境中最广泛使用的采样方法之一,然而在量化相对密度失配的发散(例如Kullback-Leibler(KL)和Rényi发散)中,其收敛性质知之甚少。这些分歧自然支配接受概率和热启动要求的大都市调整马尔可夫链。在这项工作中,我们开发了一个框架,升级Wasserstein收敛保证未调整的哈密顿蒙特卡罗(uHMC),以保证在尾部敏感的KL和Rényi分歧。我们的方法是基于单次耦合,我们用它来建立一个正则化属性的uHMC过渡内核。这种正则化允许Wasserstein-2混合时间和渐近偏差界限被提升到KL发散,类似的Orlicz-Wasserstein界限被提升到Rényi发散,类似于Bou-Rabee和Eberle(2023)的早期工作,通过核平滑将Wasserstein-1界限升级为总变差距离。因此,我们的研究结果提供了相对密度不匹配的定量控制,澄清了离散化偏差在强发散中的作用,并产生了与未调整采样和生成大都市调整马尔可夫链的热启动相关的原则性保证。
摘要:Hamiltonian Monte Carlo (HMC) algorithms are among the most widely used sampling methods in high dimensional settings, yet their convergence properties are poorly understood in divergences that quantify relative density mismatch, such as Kullback-Leibler (KL) and Rényi divergences. These divergences naturally govern acceptance probabilities and warm-start requirements for Metropolis-adjusted Markov chains. In this work, we develop a framework for upgrading Wasserstein convergence guarantees for unadjusted Hamiltonian Monte Carlo (uHMC) to guarantees in tail-sensitive KL and Rényi divergences. Our approach is based on one-shot couplings, which we use to establish a regularization property of the uHMC transition kernel. This regularization allows Wasserstein-2 mixing-time and asymptotic bias bounds to be lifted to KL divergence, and analogous Orlicz-Wasserstein bounds to be lifted to Rényi divergence, paralleling earlier work of Bou-Rabee and Eberle (2023) that upgrade Wasserstein-1 bounds to total variation distance via kernel smoothing. As a consequence, our results provide quantitative control of relative density mismatch, clarify the role of discretization bias in strong divergences, and yield principled guarantees relevant both for unadjusted sampling and for generating warm starts for Metropolis-adjusted Markov chains.
强化学习(1篇)
【1】SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache
标题:SRT:通过具有树结构缓存的推测性推出加速强化学习
链接:https://arxiv.org/abs/2601.09083
作者:Chi-Chih Chang,Siqi Zhu,Zhichen Zeng,Haibin Lin,Jiaxuan You,Mohamed S. Abdelfattah,Ziheng Jiang,Xuehai Qian
摘要:我们提出了带有树结构缓存(SRT)的推测性推出,这是一种简单的无模型方法,可以加速语言模型的策略强化学习(RL),而不会牺牲分布正确性。SRT通过将先前生成的延续存储在每个提示的树结构缓存中,利用跨训练步骤的相同提示的卷展的经验相似性。在生成期间,当前策略使用此树作为执行推测解码的草案模型。为了保持缓存的新鲜度并提高草稿模型的质量,SRT会从正在进行的部署中在线更新树,并在空闲GPU气泡期间主动执行运行前生成。集成到标准RL管道中(\textit{e.g.},通过PPO、GRPO和DAPO)和多回合设置,SRT持续减少生成和步骤延迟,降低每个令牌的推理成本,在部署期间实现高达2.08倍的挂钟时间加速。
摘要:We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for language models without sacrificing distributional correctness. SRT exploits the empirical similarity of rollouts for the same prompt across training steps by storing previously generated continuations in a per-prompt tree-structured cache. During generation, the current policy uses this tree as the draft model for performing speculative decoding. To keep the cache fresh and improve draft model quality, SRT updates trees online from ongoing rollouts and proactively performs run-ahead generation during idle GPU bubbles. Integrated into standard RL pipelines (\textit{e.g.}, PPO, GRPO and DAPO) and multi-turn settings, SRT consistently reduces generation and step latency and lowers per-token inference cost, achieving up to 2.08x wall-clock time speedup during rollout.
元学习(1篇)
【1】Meta-learning to Address Data Shift in Time Series Classification
标题:元学习解决时间序列分类中的数据转移
链接:https://arxiv.org/abs/2601.09018
作者:Samuel Myren,Nidhi Parikh,Natalie Klein
摘要:在工程和科学领域,传统的深度学习(TDL)模型在训练和测试数据共享相同分布时表现良好。然而,现实世界的数据的动态性质,广义上称为\textit{数据移位},使TDL模型容易快速性能下降,需要昂贵的重新标记和低效的再培训。元学习使模型能够快速适应新的数据,并且只需要很少的例子,为缓解这些挑战提供了一个有希望的替代方案。在这里,我们系统地将TDL与微调和基于优化的元学习算法进行比较,以评估它们解决时间序列分类中数据偏移的能力。我们介绍了一个受控的,面向任务的地震基准测试(SeisTask),并表明元学习通常可以实现更快,更稳定的适应,减少数据稀缺制度和更小的模型架构中的过拟合。随着数据可用性和模型容量的增加,它的优势逐渐减弱,TDL进行了微调。最后,我们研究了任务多样性如何影响元学习,并发现训练和测试分布之间的一致性,而不仅仅是多样性,可以提高性能。总的来说,这项工作提供了一个系统的评估时,以及为什么元学习优于TDL下的数据转移,并有助于SeisTask作为基准推进自适应学习研究在时间序列域。
摘要
:Across engineering and scientific domains, traditional deep learning (TDL) models perform well when training and test data share the same distribution. However, the dynamic nature of real-world data, broadly termed \textit{data shift}, renders TDL models prone to rapid performance degradation, requiring costly relabeling and inefficient retraining. Meta-learning, which enables models to adapt quickly to new data with few examples, offers a promising alternative for mitigating these challenges. Here, we systematically compare TDL with fine-tuning and optimization-based meta-learning algorithms to assess their ability to address data shift in time-series classification. We introduce a controlled, task-oriented seismic benchmark (SeisTask) and show that meta-learning typically achieves faster and more stable adaptation with reduced overfitting in data-scarce regimes and smaller model architectures. As data availability and model capacity increase, its advantages diminish, with TDL with fine-tuning performing comparably. Finally, we examine how task diversity influences meta-learning and find that alignment between training and test distributions, rather than diversity alone, drives performance gains. Overall, this work provides a systematic evaluation of when and why meta-learning outperforms TDL under data shift and contributes SeisTask as a benchmark for advancing adaptive learning research in time-series domains.
医学相关(2篇)
【1】Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design
标题:对比几何学习解锁基于统一结构和配体的药物设计
链接:https://arxiv.org/abs/2601.09693
作者:Lisa Schneckenreiter,Sohvi Luukkonen,Lukas Friedrich,Daniel Kuhn,Günter Klambauer
备注:ELLIS ML4Molecules Workshop 2025, ELLIS Unconference, Copenhagen 2025
摘要:基于结构和基于配体的计算药物设计传统上依赖于不相交的数据源和建模假设,限制了它们的大规模联合使用。在这项工作中,我们介绍了用于统一计算药物设计的对比几何学习(ConGLUDe),这是一个统一基于结构和配体的训练的单一对比几何模型。ConGLUDe将产生全蛋白表示和预测结合位点的隐式嵌入的几何蛋白编码器与快速配体编码器偶联,从而消除了对预定义口袋的需要。通过对比学习将配体与全局蛋白质表示和多个候选结合位点进行比对,ConGLUDe除了虚拟筛选和目标捕捞外,还支持配体条件口袋预测,同时在蛋白质-配体复合物和大规模生物活性数据上进行联合训练。在不同的基准测试中,ConGLUDe在没有提供结合口袋信息作为输入的设置中实现了最先进的zero-shot虚拟筛选性能,在具有挑战性的目标捕捞任务上大大优于现有方法,并展示了具有竞争力的配体条件口袋选择。这些结果突出了统一的结构-配体训练的优势,并将ConGLUDe定位为药物发现通用基础模型的一步。
摘要:Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre-defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves state-of-the-art zero-shot virtual screening performance in settings where no binding pocket information is provided as input, substantially outperforms existing methods on a challenging target fishing task, and demonstrates competitive ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.
【2】Enhancing Imbalanced Electrocardiogram Classification: A Novel Approach Integrating Data Augmentation through Wavelet Transform and Interclass Fusion
标题:增强不平衡心电图分类:一种通过子波变换和类间融合集成数据增强的新方法
链接:https://arxiv.org/abs/2601.09103
作者:Haijian Shao,Wei Liu,Xing Deng,Daze Lu
备注:18 pages, 9 figures, 3 tables, 1 algorithm
摘要:不平衡的心电图(ECG)数据阻碍了算法在自动处理和解释心血管诊断信息方面的有效性和弹性,这反过来又阻碍了基于深度学习的ECG分类。值得注意的是,某些不常遇到的心脏疾病在这些数据集中不成比例地代表不足。虽然特定ECG信号类型的算法生成和过采样可以减轻类别偏斜,但关于此类技术在ECG分类中的有效性缺乏共识。此外,ECG采集的方法和场景引入噪声,进一步使ECG数据的处理复杂化。本文提出了一种显著增强的ECG分类器,可同时解决ECG分析中的类别不平衡和噪声相关挑战,如CPSC 2018数据集所示。具体来说,提出了基于小波变换的特征融合应用,重点研究了基于小波变换的类间融合,生成训练特征库和测试集特征库。随后,原始训练和测试数据与各自的特征数据库合并,从而产生更平衡的训练和测试数据集。采用这种方法,我们的ECG模型对正常、AF、I-AVB、LBBB、RBBB、PAC、PVC、STD和STE的识别准确率分别高达99%、98%、97%、98%、96%、92%和93%。此外,这些类别的平均识别准确率在92%到98%之间。值得注意的是,我们提出的数据融合方法在CPSC 2018数据集中的ECG分类准确性方面超过了任何已知算法。
摘要:Imbalanced electrocardiogram (ECG) data hampers the efficacy and resilience of algorithms in the automated processing and interpretation of cardiovascular diagnostic information, which in turn impedes deep learning-based ECG classification. Notably, certain cardiac conditions that are infrequently encountered are disproportionately underrepresented in these datasets. Although algorithmic generation and oversampling of specific ECG signal types can mitigate class skew, there is a lack of consensus regarding the effectiveness of such techniques in ECG classification. Furthermore, the methodologies and scenarios of ECG acquisition introduce noise, further complicating the processing of ECG data. This paper presents a significantly enhanced ECG classifier that simultaneously addresses both class imbalance and noise-related challenges in ECG analysis, as observed in the CPSC 2018 dataset. Specifically, we propose the application of feature fusion based on the wavelet transform, with a focus on wavelet transform-based interclass fusion, to generate the training feature library and the test set feature library. Subsequently, the original training and test data are amalgamated with their respective feature databases, resulting in more balanced training and test datasets. Employing this approach, our ECG model achieves recognition accuracies of up to 99%, 98%, 97%, 98%, 96%, 92%, and 93% for Normal, AF, I-AVB, LBBB, RBBB, PAC, PVC, STD, and STE, respectively. Furthermore, the average recognition accuracy for these categories ranges between 92\% and 98\%. Notably, our proposed data fusion methodology surpasses any known algorithms in terms of ECG classification accuracy in the CPSC 2018 dataset.
蒸馏|知识提取(2篇)
【1】Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation
标题:多教师招生提炼:概率领域知识聚合的数学框架
链接:https://arxiv.org/abs/2601.09165
作者:Aaron R. Flouro,Shawn P. Chadwick
备注:7 pages, 1 table
摘要:基于概率域蒸馏框架的稀疏KD,我们开发了一个公理化的,运营商理论框架多教师集成知识蒸馏。而不是规定一个特定的聚合公式,我们定义了五个核心公理有效的知识聚合运营商,包括凸性,积极性,连续性,重量单调性和温度的一致性。我们证明了满足这些公理的算子族的存在性和非唯一性,建立了多个不同的聚合机制符合相同的基本原则。 在这个框架内,我们建立运营商不可知的保证表明,多教师聚合减少了随机方差和系统的监督偏差下异构教师,同时提供詹森型界限,日志损失保证,和安全衰减属性。对于教师权重线性的聚合算子,我们进一步建立了标准独立性假设下的经典系综方差缩减结果,并扩展到相关误差制度。该框架提供了理论基础,多教师蒸馏从不同的前沿模型,同时承认多种有效的实施策略。
摘要
:Building on the probability-domain distillation framework of Sparse-KD, we develop an axiomatic, operator-theoretic framework for multi-teacher ensemble knowledge distillation. Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators, encompassing convexity, positivity, continuity, weight monotonicity, and temperature coherence. We prove the existence and non-uniqueness of operator families satisfying these axioms, establishing that multiple distinct aggregation mechanisms conform to the same foundational principles. Within this framework, we establish operator-agnostic guarantees showing that multi-teacher aggregation reduces both stochastic variance and systematic supervisory bias under heterogeneous teachers, while providing Jensen-type bounds, log-loss guarantees, and safety attenuation properties. For aggregation operators linear in teacher weights, we further establish classical ensemble variance-reduction results under standard independence assumptions, with extensions to correlated-error regimes. The framework provides theoretical grounding for multi-teacher distillation from diverse frontier models while admitting multiple valid implementation strategies.
【2】Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
标题:用于卓越长CoT推理的分布对齐序列蒸馏
链接:https://arxiv.org/abs/2601.09088
作者:Shaotian Yan,Kaiyuan Liu,Chen Shen,Bing Wang,Sinan Fan,Jun Zhang,Yue Wu,Zheng Wang,Jieping Ye
备注:Project Page: https://github.com/D2I-ai/dasd-thinking
摘要:在本报告中,我们介绍了DASD-4 B-Thinking,这是一个轻量级但功能强大的完全开源推理模型。它在数学、科学推理和代码生成方面具有挑战性的基准测试中,在可比规模的开源模型中实现了SOTA性能,甚至优于几个更大的模型。首先,我们批判性地重新审视一个广泛采用的蒸馏范式在社会上:SFT教师生成的反应,也被称为序列级蒸馏。虽然最近的一系列工作之后,这一计划已经证明了显着的效率和强大的实证性能,他们主要是基于SFT的角度。因此,这些方法主要集中在设计SFT数据过滤的启发式规则,而在很大程度上忽略了蒸馏本身的核心原则-使学生模型能够学习教师的完整输出分布,从而继承其泛化能力。具体来说,我们确定了目前的实践中的三个关键限制:一)教师的序列水平分布的代表性不足;二)教师的输出分布和学生的学习能力之间的错位;和iii)暴露偏差所产生的教师强制培训与自回归推理。总之,这些缺点反映了在整个蒸馏过程中缺乏明确的师生互动,使蒸馏的本质未得到充分利用。为了解决这些问题,我们提出了几种方法上的创新,共同形成了一个增强的序列级蒸馏训练管道。值得注意的是,DASD-4 B-Thinking仅使用了448 K个训练样本就获得了有竞争力的结果,这比大多数现有开源项目所使用的样本少了一个数量级。为了支持社区研究,我们公开发布了我们的模型和训练数据集。
摘要:In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.
聚类(3篇)
【1】Cluster Workload Allocation: Semantic Soft Affinity Using Natural Language Processing
标题:集群收件箱分配:使用自然语言处理的语义软亲和力
链接:https://arxiv.org/abs/2601.09282
作者:Leszek Sliwko,Jolanta Mizeria-Pietraszko
摘要:集群工作负载分配通常需要复杂的配置,从而造成可用性差距。本文介绍了一种语义,意图驱动的调度范式集群系统使用自然语言处理。该系统采用通过Kubernetes调度器扩展器集成的大型语言模型(LLM)来解释自然语言分配提示注释,以获得软亲和力偏好。开发了一个具有集群状态缓存和意图分析器(使用AWS Bedrock)的原型。经验评估表明,对于Amazon Nova Pro/Premier和Mistral Pixtral Large等顶级模型,LLM解析准确率很高(评估地面实况数据集上的子集准确率>95%),显著优于基线引擎。在六个场景中安排质量测试表明,与标准Kubernetes配置相比,原型实现了更好或同等的放置,特别是在复杂和量化的场景以及处理相互冲突的软偏好方面表现出色。结果验证了使用LLM进行可访问调度,但突出了同步LLM延迟等限制,这表明异步处理可用于生产准备。这项工作证实了语义软亲和简化工作负载编排的可行性。
摘要:Cluster workload allocation often requires complex configurations, creating a usability gap. This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing. The system employs a Large Language Model (LLM) integrated via a Kubernetes scheduler extender to interpret natural language allocation hint annotations for soft affinity preferences. A prototype featuring a cluster state cache and an intent analyzer (using AWS Bedrock) was developed. Empirical evaluation demonstrated high LLM parsing accuracy (>95% Subset Accuracy on an evaluation ground-truth dataset) for top-tier models like Amazon Nova Pro/Premier and Mistral Pixtral Large, significantly outperforming a baseline engine. Scheduling quality tests across six scenarios showed the prototype achieved superior or equivalent placement compared to standard Kubernetes configurations, particularly excelling in complex and quantitative scenarios and handling conflicting soft preferences. The results validate using LLMs for accessible scheduling but highlight limitations like synchronous LLM latency, suggesting asynchronous processing for production readiness. This work confirms the viability of semantic soft affinity for simplifying workload orchestration.
【2】Efficient Clustering in Stochastic Bandits
标题:随机盗贼中的有效聚集
链接:https://arxiv.org/abs/2601.09162
作者:G Dhinesh Chandran,Kota Srinivas Reddy,Srikrishna Bhashyam
摘要:我们研究了固定置信度设置下的Bandit聚类(BC)问题,其目标是通过在每个时间步从自适应选择的手臂顺序采样,同时确保在停止时间的固定错误概率,将数据序列(手臂)的集合分组为簇。我们考虑一个设置,在一个集群中的武器可能有不同的分布。与现有的结果在这种情况下,假设高斯分布的武器,我们研究了更广泛的一类向量参数分布,满足温和的正则性条件。现有的渐近最优BC算法需要解决一个优化问题作为其采样规则的一部分,在每一步,这是计算成本高。我们提出了一种高效的Bandit聚类算法(EBC),它不是解决整个优化问题,而是在每个时间步向最优值迈出一步,使其在保持渐进最优的同时具有计算效率。我们还提出了一个启发式的EBC的变体,称为EBC-H,它进一步简化了采样规则,与臂选择的基础上计算的数量作为停止规则的一部分。我们突出了EBC和EBC-H的计算效率,通过比较它们的每个样本的运行时间与现有的算法。通过对合成数据集的模拟,支持EBC的渐近最优性。通过对合成数据集和真实数据集的模拟,我们展示了EBC和EBC-H相对于现有方法的性能增益。
摘要
:We study the Bandit Clustering (BC) problem under the fixed confidence setting, where the objective is to group a collection of data sequences (arms) into clusters through sequential sampling from adaptively selected arms at each time step while ensuring a fixed error probability at the stopping time. We consider a setting where arms in a cluster may have different distributions. Unlike existing results in this setting, which assume Gaussian-distributed arms, we study a broader class of vector-parametric distributions that satisfy mild regularity conditions. Existing asymptotically optimal BC algorithms require solving an optimization problem as part of their sampling rule at each step, which is computationally costly. We propose an Efficient Bandit Clustering algorithm (EBC), which, instead of solving the full optimization problem, takes a single step toward the optimal value at each time step, making it computationally efficient while remaining asymptotically optimal. We also propose a heuristic variant of EBC, called EBC-H, which further simplifies the sampling rule, with arm selection based on quantities computed as part of the stopping rule. We highlight the computational efficiency of EBC and EBC-H by comparing their per-sample run time with that of existing algorithms. The asymptotic optimality of EBC is supported through simulations on the synthetic datasets. Through simulations on both synthetic and real-world datasets, we show the performance gain of EBC and EBC-H over existing approaches.
【3】Deep Incomplete Multi-View Clustering via Hierarchical Imputation and Alignment
标题:通过分层插补和对齐的深度不完整多视图聚集
链接:https://arxiv.org/abs/2601.09051
作者:Yiming Du,Ziyu Wang,Jian Li,Rui Ning,Lusi Li
备注:Accepted by AAAI 2026
摘要:不完全多视图聚类(IMVC)的目的是从具有部分观测的多视图数据中发现共享的聚类结构。核心挑战在于准确地估算丢失的视图而不引入偏见,同时保持视图之间的语义一致性和集群内的紧凑性。为了解决这些挑战,我们提出了DIMVC-HIA,这是一种新型的深度IMVC框架,它集成了分层插补和对齐,具有四个关键组件:(1)用于潜在特征提取的视图特定自编码器,加上视图共享聚类预测器以产生软聚类分配;(2)分层插补模块,其首先基于跨视图对比相似性来估计缺失聚类分配,然后使用视图内、簇内统计来重建丢失的特征;(3)基于能量的语义对齐模块,其通过最小化围绕低能量簇锚的能量方差来促进簇内紧凑性;以及(4)对比分配对齐模块,其增强跨视图一致性并鼓励自信的、良好分离的簇预测。基准测试的实验表明,我们的框架在不同程度的缺失下实现了卓越的性能。
摘要:Incomplete multi-view clustering (IMVC) aims to discover shared cluster structures from multi-view data with partial observations. The core challenges lie in accurately imputing missing views without introducing bias, while maintaining semantic consistency across views and compactness within clusters. To address these challenges, we propose DIMVC-HIA, a novel deep IMVC framework that integrates hierarchical imputation and alignment with four key components: (1) view-specific autoencoders for latent feature extraction, coupled with a view-shared clustering predictor to produce soft cluster assignments; (2) a hierarchical imputation module that first estimates missing cluster assignments based on cross-view contrastive similarity, and then reconstructs missing features using intra-view, intra-cluster statistics; (3) an energy-based semantic alignment module, which promotes intra-cluster compactness by minimizing energy variance around low-energy cluster anchors; and (4) a contrastive assignment alignment module, which enhances cross-view consistency and encourages confident, well-separated cluster predictions. Experiments on benchmarks demonstrate that our framework achieves superior performance under varying levels of missingness.
联邦学习|隐私保护|加密(2篇)
【1】Single-Round Clustered Federated Learning via Data Collaboration Analysis for Non-IID Data
标题:通过非IID数据的数据协作分析进行单轮分组联邦学习
链接:https://arxiv.org/abs/2601.09304
作者:Sota Sugawara,Yuji Kawamata,Akihiro Toyoda,Tomoru Nakayama,Yukihiko Okada
备注:9 pages, 3 figures
摘要:Federated Learning(FL)支持跨多个客户端的分布式学习,而无需共享原始数据。当客户端之间的统计异质性严重时,集群联合学习(CFL)可以通过对相似客户端进行分组并训练集群模型来提高性能。然而,大多数CFL方法依赖于多个通信回合进行聚类估计和模型更新,这限制了它们在通信回合严格约束下的实用性。我们提出了基于数据协作的集群联合学习(DC-CFL),这是一个单轮框架,仅使用DC分析中共享的信息完成客户端聚类和集群学习。DC-CFL通过标签分布之间的总变化距离量化客户端之间的相似性,使用层次聚类估计聚类,并通过DC分析执行聚类学习。在典型的非IID条件下对多个开放数据集的实验表明,DC-CFL实现了与多轮基线相当的准确性,同时只需要一轮通信。这些结果表明,当多轮通信不切实际时,DC-CFL是协作AI模型开发的一种实用替代方案。
摘要:Federated Learning (FL) enables distributed learning across multiple clients without sharing raw data. When statistical heterogeneity across clients is severe, Clustered Federated Learning (CFL) can improve performance by grouping similar clients and training cluster-wise models. However, most CFL approaches rely on multiple communication rounds for cluster estimation and model updates, which limits their practicality under tight constraints on communication rounds. We propose Data Collaboration-based Clustered Federated Learning (DC-CFL), a single-round framework that completes both client clustering and cluster-wise learning, using only the information shared in DC analysis. DC-CFL quantifies inter-client similarity via total variation distance between label distributions, estimates clusters using hierarchical clustering, and performs cluster-wise learning via DC analysis. Experiments on multiple open datasets under representative non-IID conditions show that DC-CFL achieves accuracy comparable to multi-round baselines while requiring only one communication round. These results indicate that DC-CFL is a practical alternative for collaborative AI model development when multiple communication rounds are impractical.
【2】Lean Clients, Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning
标题:精益客户,完全准确性:混合零阶和一阶分离联邦学习
链接:https://arxiv.org/abs/2601.09076
作者:Zhoubin Kou,Zihan Chen,Jing Yang,Cong Shen
摘要:Split Federated Learning(SFL)支持在资源受限的边缘设备和富计算服务器之间进行协作训练。通信开销是SFL中的一个核心问题,可以通过辅助网络来减轻。然而,基本的客户端计算挑战仍然存在,因为反向传播需要大量的内存和计算成本,严重限制了边缘设备可以支持的模型的规模。为了实现更资源有效的客户端计算和减少客户端-服务器通信,我们提出了HERON-SFL,这是一种新型的混合优化框架,它集成了零阶(ZO)优化用于本地客户端训练,同时保留了服务器上的一阶(FO)优化。在辅助网络的帮助下,ZO更新使客户端能够在每一步使用扰动的仅向前评估来近似局部梯度,消除了内存密集型激活缓存,并避免了传统训练过程中的显式梯度计算。利用低有效秩假设,我们从理论上证明了HERON-SFL的收敛速度是独立的模型维数,解决了一个关键的可扩展性问题共同ZO算法。从经验上看,在ResNet训练和语言模型(LM)微调任务中,HERON-SFL匹配基准精度,同时将客户端峰值内存减少高达64%,每步客户端计算成本减少高达33%,大大扩展了可以在资源有限的设备上训练或调整的模型范围。
摘要:Split Federated Learning (SFL) enables collaborative training between resource-constrained edge devices and a compute-rich server. Communication overhead is a central issue in SFL and can be mitigated with auxiliary networks. Yet, the fundamental client-side computation challenge remains, as back-propagation requires substantial memory and computation costs, severely limiting the scale of models that edge devices can support. To enable more resource-efficient client computation and reduce the client-server communication, we propose HERON-SFL, a novel hybrid optimization framework that integrates zeroth-order (ZO) optimization for local client training while retaining first-order (FO) optimization on the server. With the assistance of auxiliary networks, ZO updates enable clients to approximate local gradients using perturbed forward-only evaluations per step, eliminating memory-intensive activation caching and avoiding explicit gradient computation in the traditional training process. Leveraging the low effective rank assumption, we theoretically prove that HERON-SFL's convergence rate is independent of model dimensionality, addressing a key scalability concern common to ZO algorithms. Empirically, on ResNet training and language model (LM) fine-tuning tasks, HERON-SFL matches benchmark accuracy while reducing client peak memory by up to 64% and client-side compute cost by up to 33% per step, substantially expanding the range of models that can be trained or adapted on resource-limited devices.
推理|分析|理解|解释(8篇)
【1】Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
标题:Fast-ThinkAct:通过可言语化的潜在规划进行高效的视觉-语言-动作推理
链接:https://arxiv.org/abs/2601.09708
作者
:Chi-Pin Huang,Yunze Man,Zhiding Yu,Min-Hung Chen,Jan Kautz,Yu-Chiang Frank Wang,Fu-En Yang
备注:Project page: https://jasper0314-huang.github.io/fast-thinkact/
摘要:视觉-语言-动作(VLA)任务需要在复杂的视觉场景中进行推理,并在动态环境中执行自适应动作。虽然最近对推理VLA的研究表明,显式思维链(CoT)可以提高泛化能力,但由于冗长的推理轨迹,它们的推理延迟很高。我们提出了快速的ThinkAct,一个有效的推理框架,实现紧凑,但性能规划,通过verbalizable潜在的推理。Fast-ThinkAct通过从教师那里提取信息,学习有效地利用潜在的CoTs进行推理,由偏好引导的目标驱动,以调整操纵轨迹,从而转移语言和视觉规划能力以实现具体控制。这使得推理增强的策略学习能够有效地将紧凑的推理与动作执行联系起来。在不同的具体操作和推理基准的广泛实验表明,Fast-ThinkAct实现了强大的性能,与最先进的推理VLA相比,推理延迟降低了89.3%,同时保持了有效的长期规划,Few-Shot适应和故障恢复。
摘要:Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
【2】Toward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric
标题:对遗忘困难的理解:一个机制的观点和电路引导的难度测量
链接:https://arxiv.org/abs/2601.09624
作者:Jiali Cheng,Ziheng Chen,Chirag Agarwal,Hadi Amiri
摘要:机器非学习对于构建可信和兼容的语言模型变得至关重要。然而,遗忘的成功在不同的样本中有很大的差异:一些样本被可靠地抹去,而另一些样本则在相同的过程中持续存在。我们认为,这种差异不仅是一个数据方面的现象,但也反映了模型的内部机制,编码和保护记忆的信息。我们从基于模型电路的机械学角度研究这个问题-结构化的相互作用途径,决定了预测是如何形成的。我们提出了电路引导的遗忘难度(CUD),这是一种使用电路级信号为每个样本分配连续难度分数的{\em pre-unlearning}度量。大量的实验表明,CUD可靠地分离本质上容易和困难的样本,并保持稳定的unlearning方法。我们确定了关键的电路级模式,揭示了困难的机械签名:容易忘记的样本与集中在原始模型的早期到中期部分的较短,较浅的相互作用有关,而硬样本依赖于更长,更深的路径,更接近后期计算。与现有的定性研究相比,CUD迈出了第一步,对遗忘困难进行了原则性的、细粒度的和可解释的分析,并激励了基于模型机制的遗忘方法的发展。
摘要:Machine unlearning is becoming essential for building trustworthy and compliant language models. Yet unlearning success varies considerably across individual samples: some are reliably erased, while others persist despite the same procedure. We argue that this disparity is not only a data-side phenomenon, but also reflects model-internal mechanisms that encode and protect memorized information. We study this problem from a mechanistic perspective based on model circuits--structured interaction pathways that govern how predictions are formed. We propose Circuit-guided Unlearning Difficulty (CUD), a {\em pre-unlearning} metric that assigns each sample a continuous difficulty score using circuit-level signals. Extensive experiments demonstrate that CUD reliably separates intrinsically easy and hard samples, and remains stable across unlearning methods. We identify key circuit-level patterns that reveal a mechanistic signature of difficulty: easy-to-unlearn samples are associated with shorter, shallower interactions concentrated in earlier-to-intermediate parts of the original model, whereas hard samples rely on longer and deeper pathways closer to late-stage computation. Compared to existing qualitative studies, CUD takes a first step toward a principled, fine-grained, and interpretable analysis of unlearning difficulty; and motivates the development of unlearning methods grounded in model mechanisms.
【3】On the Hardness of Computing Counterfactual and Semifactual Explanations in XAI
标题:论XAI中计算反事实和半事实解释的难度
链接:https://arxiv.org/abs/2601.09455
作者:André Artelt,Martin Olsen,Kevin Tierney
备注:Accepted in Transactions on Machine Learning Research (TMLR), 2025 -- https://openreview.net/pdf?id=aELzBw0q1O
摘要:为机器学习模型的选择提供清晰的解释对于这些模型在关键应用中的部署至关重要。反事实和半事实的解释已经成为两种机制,为用户提供深入了解他们的模型的输出。我们提供了一个概述的计算复杂性的结果,在文献中产生这些解释,发现在许多情况下,产生的解释是计算困难的。我们通过进一步贡献我们自己的不可近似性结果来大大加强这一论点,这些结果表明,不仅解释往往很难产生,而且在某些假设下,它们也很难近似。我们讨论了这些复杂性结果对XAI社区和寻求规范AI解释的政策制定者的影响。
摘要:Providing clear explanations to the choices of machine learning models is essential for these models to be deployed in crucial applications. Counterfactual and semi-factual explanations have emerged as two mechanisms for providing users with insights into the outputs of their models. We provide an overview of the computational complexity results in the literature for generating these explanations, finding that in many cases, generating explanations is computationally hard. We strengthen the argument for this considerably by further contributing our own inapproximability results showing that not only are explanations often hard to generate, but under certain assumptions, they are also hard to approximate. We discuss the implications of these complexity results for the XAI community and for policymakers seeking to regulate explanations in AI.
【4】Explainable Autoencoder-Based Anomaly Detection in IEC 61850 GOOSE Networks
标题:IEC61850 GOOSE网络中基于自动编码器的可解释异常检测
链接:https://arxiv.org/abs/2601.09287
作者:Dafne Lozano-Paredes,Luis Bote-Curiel,Juan Ramón Feijóo-Martínez,Ismael Gómez-Talal,José Luis Rojo-Álvarez
摘要:IEC 61850通用面向对象变电站事件(GOOSE)协议在数字变电站的实时保护和自动化方面发挥着关键作用,但其缺乏本地安全机制可能会使电力系统面临复杂的网络攻击。传统的基于规则和监督的入侵检测技术难以检测协议兼容和零日攻击下显着的类不平衡和有限的可用性标记的数据。本文提出了一个可解释的,无监督的多视图异常检测框架,IEC 61850 GOOSE网络,明确分离语义完整性和时间可用性。该方法采用非对称自编码器训练,只有在真正的操作GOOSE流量学习不同的潜在表示的序列为基础的协议语义和定时相关的传输动态在正常的流量。异常检测使用与统计接地阈值混合的重建误差来实现,从而在没有指定攻击类型的情况下实现鲁棒检测。通过将检测结果与IEC 61850协议特征直接联系起来,可对故障级别的重建分析提供内在的可解释性。建议的框架进行评估,使用真实的变电站交通的培训和一个公共数据集,包含正常的交通和消息抑制,数据操纵,和拒绝服务攻击的测试。实验结果表明,攻击检测率超过99%,误报率低于总流量的5%,在极端类别不平衡和可解释的异常属性下,表现出跨环境的强大泛化能力和有效操作。
摘要
:The IEC 61850 Generic Object-Oriented Substation Event (GOOSE) protocol plays a critical role in real-time protection and automation of digital substations, yet its lack of native security mechanisms can expose power systems to sophisticated cyberattacks. Traditional rule-based and supervised intrusion detection techniques struggle to detect protocol-compliant and zero-day attacks under significant class imbalance and limited availability of labeled data. This paper proposes an explainable, unsupervised multi-view anomaly detection framework for IEC 61850 GOOSE networks that explicitly separates semantic integrity and temporal availability. The approach employs asymmetric autoencoders trained only on real operational GOOSE traffic to learn distinct latent representations of sequence-based protocol semantics and timing-related transmission dynamics in normal traffic. Anomaly detection is implemented using reconstruction errors mixed with statistically grounded thresholds, enabling robust detection without specified attack types. Feature-level reconstruction analysis provides intrinsic explainability by directly linking detection outcomes to IEC 61850 protocol characteristics. The proposed framework is evaluated using real substation traffic for training and a public dataset containing normal traffic and message suppression, data manipulation, and denial-of-service attacks for testing. Experimental results show attack detection rates above 99% with false positives remaining below 5% of total traffic, demonstrating strong generalization across environments and effective operation under extreme class imbalance and interpretable anomaly attribution.
【5】Deep Learning-based Binary Analysis for Vulnerability Detection in x86-64 Machine Code
标题:基于深度学习的二元分析用于x86-64机器代码漏洞检测
链接:https://arxiv.org/abs/2601.09157
作者:Mitchell Petingola
摘要:虽然目前基于深度学习的漏洞检测的大部分研究都依赖于反汇编的二进制文件,但本文探讨了直接从原始x86-64机器代码中提取特征的可行性。虽然汇编语言对人类来说更容易解释,但它需要更复杂的模型来捕获令牌级上下文。相比之下,机器代码可以实现更高效,轻量级的模型,并保留所有可能在反汇编中丢失的信息。本文通过对两种特定的深度学习模型架构进行探索性研究来完成漏洞检测任务,旨在系统地评估它们在三种漏洞类型中的性能。结果表明,基于图的模型始终优于顺序模型,强调控制流关系的重要性,机器代码包含足够的信息,有效的漏洞发现。
摘要:While much of the current research in deep learning-based vulnerability detection relies on disassembled binaries, this paper explores the feasibility of extracting features directly from raw x86-64 machine code. Although assembly language is more interpretable for humans, it requires more complex models to capture token-level context. In contrast, machine code may enable more efficient, lightweight models and preserve all information that might be lost in disassembly. This paper approaches the task of vulnerability detection through an exploratory study on two specific deep learning model architectures and aims to systematically evaluate their performance across three vulnerability types. The results demonstrate that graph-based models consistently outperform sequential models, emphasizing the importance of control flow relationships, and that machine code contains sufficient information for effective vulnerability discovery.
【6】KTCF: Actionable Recourse in Knowledge Tracing via Counterfactual Explanations for Education
标题:KTCF:通过教育反事实解释进行知识追踪的可行动追索
链接:https://arxiv.org/abs/2601.09156
作者:Woojin Kim,Changkwon Lee,Hyeoncheol Kim
备注:Accepted to AAAI-26 Special Track AI for Social Impact (oral presentation)
摘要:使用人工智能来改善教学和学习有利于提高教育的适应性和可扩展性。知识追踪(KT)是公认的学生建模任务,由于其优越的性能和应用潜力的教育。为此,我们概念化和调查反事实解释的连接从XAI KT教育。反事实的解释提供了可诉的追索权,本质上是因果关系和地方性的,教育利益相关者很容易理解,他们往往不是专家。我们提出了KTCF,一个反事实的解释生成方法KT占知识概念的关系,和后处理方案,将反事实的解释成一个序列的教育指令。我们在一个大规模的教育数据集上进行了实验,结果表明,我们的KTCF方法比现有方法具有更好的性能和鲁棒性,在各个指标上的改进幅度从5.7%到34%不等。此外,我们提供了我们的后处理方案的定性评估,表明由此产生的教育说明有助于减少大量的学习负担。我们表明,反事实有可能促进人工智能在教育中的负责任和实际应用。未来的工作XAI KT可能会受益于教育接地概念化和开发以股东为中心的方法。
摘要:Using Artificial Intelligence to improve teaching and learning benefits greater adaptivity and scalability in education. Knowledge Tracing (KT) is recognized for student modeling task due to its superior performance and application potential in education. To this end, we conceptualize and investigate counterfactual explanation as the connection from XAI for KT to education. Counterfactual explanations offer actionable recourse, are inherently causal and local, and easy for educational stakeholders to understand who are often non-experts. We propose KTCF, a counterfactual explanation generation method for KT that accounts for knowledge concept relationships, and a post-processing scheme that converts a counterfactual explanation into a sequence of educational instructions. We experiment on a large-scale educational dataset and show our KTCF method achieves superior and robust performance over existing methods, with improvements ranging from 5.7% to 34% across metrics. Additionally, we provide a qualitative evaluation of our post-processing scheme, demonstrating that the resulting educational instructions help in reducing large study burden. We show that counterfactuals have the potential to advance the responsible and practical use of AI in education. Future works on XAI for KT may benefit from educationally grounded conceptualization and developing stakeholder-centered methods.
【7】Physics-Guided Counterfactual Explanations for Large-Scale Multivariate Time Series: Application in Scalable and Interpretable SEP Event Prediction
标题:大规模多元时间序列的物理引导反事实解释:在可扩展和可解释的SEN事件预测中的应用
链接:https://arxiv.org/abs/2601.08999
作者:Pranjal Patil,Anli Ji,Berkay Aydin
备注:This is a pre-print of an accepted paper at IEEE BigData 2025, SS 11:Towards an Understanding of Artificial Intelligence: Bridging Theory, Explainability, and Practical Applications
摘要:准确预测太阳高能粒子事件对于保护卫星、宇航员和天基基础设施至关重要。现代空间气象监测从地球同步运行环境卫星等来源产生大量高频、多变量时间序列数据。在这些数据上训练的机器学习(ML)模型显示出强大的预测能力,但大多数现有方法忽略了特定领域的可行性约束。反事实解释已经成为提高模型可解释性的关键工具,但现有的方法很少执行物理可解释性。这项工作介绍了一个物理指导的反事实解释框架,一种新的方法,用于在时间序列分类任务中生成反事实解释,保持与基本物理原理一致。应用于太阳高能粒子(SEP)预测,该框架实现了动态时间弯曲(DTW)距离减少80%以上,增加了接近度,产生了具有更高稀疏性的反事实解释,并与DiCE等最先进的基线相比减少了近50%的运行时间。除了数字上的改进,这个框架确保了生成的反事实解释在科学领域是物理上合理的和可操作的。总之,该框架生成了有效且物理上一致的反事实解释,同时为大数据环境中可扩展的反事实生成奠定了基础。
摘要:Accurate prediction of solar energetic particle events is vital for safeguarding satellites, astronauts, and space-based infrastructure. Modern space weather monitoring generates massive volumes of high-frequency, multivariate time series (MVTS) data from sources such as the Geostationary perational Environmental Satellites (GOES). Machine learning (ML) models trained on this data show strong predictive power, but most existing methods overlook domain-specific feasibility constraints. Counterfactual explanations have emerged as a key tool for improving model interpretability, yet existing approaches rarely enforce physical plausibility. This work introduces a Physics-Guided Counterfactual Explanation framework, a novel method for generating counterfactual explanations in time series classification tasks that remain consistent with underlying physical principles. Applied to solar energetic particles (SEP) forecasting, this framework achieves over 80% reduction in Dynamic Time Warping (DTW) distance increasing the proximity, produces counterfactual explanations with higher sparsity, and reduces runtime by nearly 50% compared to state-of-the-art baselines such as DiCE. Beyond numerical improvements, this framework ensures that generated counterfactual explanations are physically plausible and actionable in scientific domains. In summary, the framework generates counterfactual explanations that are both valid and physically consistent, while laying the foundation for scalable counterfactual generation in big data environments.
【8】ForensicFormer: Hierarchical Multi-Scale Reasoning for Cross-Domain Image Forgery Detection
标题:ForensicFormer:跨域图像伪造检测的分层多尺度推理
链接:https://arxiv.org/abs/2601.08873
作者:Hema Hariharan Samson
备注:9 pages, 4 figures, 5 tables. Technical report on hierarchical multi-scale image forgery detection
摘要:人工智能生成的图像和复杂的编辑工具的激增使得传统的取证方法对于跨域伪造检测无效。我们提出了ForensicFormer,一个分层的多尺度框架,通过交叉注意Transformers统一了低级别的工件检测,中级边界分析和高级语义推理。与之前的单范例方法不同,该方法在分布外数据集上实现了<75%的准确度,我们的方法在七个不同的测试集上保持了86.8%的平均准确度,涵盖了传统的操作,GAN生成的图像和扩散模型输出-这是对最先进的通用检测器的显着改进。我们表现出优越的鲁棒性JPEG压缩(83%的准确度在Q=70与66%的基线),并提供像素级伪造本地化与0.76 F1分数。广泛的消融研究证实,每个分层组件有助于4-10%的准确性提高,定性分析揭示了与人类专家推理一致的可解释的法医特征。我们的工作将经典图像取证和现代深度学习联系在一起,为操纵技术先验未知的现实世界部署提供了实用的解决方案。
摘要:The proliferation of AI-generated imagery and sophisticated editing tools has rendered traditional forensic methods ineffective for cross-domain forgery detection. We present ForensicFormer, a hierarchical multi-scale framework that unifies low-level artifact detection, mid-level boundary analysis, and high-level semantic reasoning via cross-attention transformers. Unlike prior single-paradigm approaches, which achieve <75% accuracy on out-of-distribution datasets, our method maintains 86.8% average accuracy across seven diverse test sets, spanning traditional manipulations, GAN-generated images, and diffusion model outputs - a significant improvement over state-of-the-art universal detectors. We demonstrate superior robustness to JPEG compression (83% accuracy at Q=70 vs. 66% for baselines) and provide pixel-level forgery localization with a 0.76 F1-score. Extensive ablation studies validate that each hierarchical component contributes 4-10% accuracy improvement, and qualitative analysis reveals interpretable forensic features aligned with human expert reasoning. Our work bridges classical image forensics and modern deep learning, offering a practical solution for real-world deployment where manipulation techniques are unknown a priori.
检测相关(3篇)
【1】Towards Robust Cross-Dataset Object Detection Generalization under Domain Specificity
标题:在领域特定性下实现稳健的跨数据集对象检测通用
链接:https://arxiv.org/abs/2601.09497
作者:Ritabrata Chakraborty,Hrishit Mitra,Shivakumara Palaiahnakote,Umapada Pal
备注:15 pages, 4 figures, 6 tables
摘要:对象检测器通常在分布中表现良好,但在不同的基准上会急剧下降。我们通过设置特异性的镜头研究跨数据集对象检测(CD-OD)。我们将基准分组为具有不同日常场景的设置不可知数据集和与狭窄环境相关的设置特定数据集,并在所有训练-测试对中评估标准检测器系列。这揭示了CD-OD中的一个清晰结构:在同一环境类型内的转移相对稳定,而跨环境类型的转移大幅下降,并且通常是不对称的。最严重的故障发生时,从特定的来源转移到不可知的目标,并持续开放标签对齐后,表明域转移占主导地位,在最难的政权。为了解开标签不匹配的域转移,我们比较了封闭标签传输与开放标签协议,该协议使用CLIP相似性将预测类映射到最近的目标标签。开放标签评估产生一致但有界的收益,许多纠正的情况下,对应于语义近错过支持的图像证据。总体而言,我们提供了一个原则性的CD-OD设置特异性和实用的指导下,评估探测器的分布偏移下的特性。代码将在\href{[https://github.com/Ritabrata04/cdod-icpr.git}{https://github.com/Ritabrata04/cdod-icpr}发布。
摘要:Object detectors often perform well in-distribution, yet degrade sharply on a different benchmark. We study cross-dataset object detection (CD-OD) through a lens of setting specificity. We group benchmarks into setting-agnostic datasets with diverse everyday scenes and setting-specific datasets tied to a narrow environment, and evaluate a standard detector family across all train--test pairs. This reveals a clear structure in CD-OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open-label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed-label transfer with an open-label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open-label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near-misses supported by the image evidence. Overall, we provide a principled characterization of CD-OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href{[https://github.com/Ritabrata04/cdod-icpr.git}{https://github.com/Ritabrata04/cdod-icpr}.
【2】N-EIoU-YOLOv9: A Signal-Aware Bounding Box Regression Loss for Lightweight Mobile Detection of Rice Leaf Diseases
标题:N-EIoU-YOLOv 9:用于水稻叶片病轻量级移动检测的信号感知边界盒回归损失
链接:https://arxiv.org/abs/2601.09170
作者:Dung Ta Nguyen Duc,Thanh Bui Dang,Hoang Le Minh,Tung Nguyen Viet,Huong Nguyen Thanh,Dong Trinh Cong
摘要:在这项工作中,我们提出了N EIoU YOLOv9,这是一个轻量级的检测框架,基于来自非单调梯度聚焦和几何解耦原理的信号感知边界框回归损失,被称为N EIoU(非单调有效交集)。所提出的损失通过将非单调聚焦与解耦宽度和高度优化相结合来重塑定位梯度,从而增强具有低重叠的硬样本的弱回归信号,同时减少梯度干扰。这种设计对于农业病害图像中常见的小目标和低对比度目标特别有效。建议的N EIoU损失被集成到一个轻量级的YOLOv9t架构中,并在自己收集的田间数据集上进行评估,该数据集包括四种疾病类别和健康叶片的5908张水稻叶片图像。实验结果表明,与标准CIoU损失相比,性能得到了一致的提高,平均精度达到90.3%,比基线提高了4.3%,在更严格的评估标准下,定位精度得到了提高。为了进行实际验证,优化的模型使用TensorFlow Lite和Float16量化部署在Android设备上,实现了每帧156毫秒的平均推理时间,同时保持准确性。这些结果证实了所提出的方法有效地平衡了基于边缘的农业监测系统的准确性、优化稳定性和计算效率。
摘要:In this work, we propose N EIoU YOLOv9, a lightweight detection framework based on a signal aware bounding box regression loss derived from non monotonic gradient focusing and geometric decoupling principles, referred to as N EIoU (Non monotonic Efficient Intersection over Union). The proposed loss reshapes localization gradients by combining non monotonic focusing with decoupled width and height optimization, thereby enhancing weak regression signals for hard samples with low overlap while reducing gradient interference. This design is particularly effective for small and low contrast targets commonly observed in agricultural disease imagery. The proposed N EIoU loss is integrated into a lightweight YOLOv9t architecture and evaluated on a self collected field dataset comprising 5908 rice leaf images across four disease categories and healthy leaves. Experimental results demonstrate consistent performance gains over the standard CIoU loss, achieving a mean Average Precision of 90.3 percent, corresponding to a 4.3 percent improvement over the baseline, with improved localization accuracy under stricter evaluation criteria. For practical validation, the optimized model is deployed on an Android device using TensorFlow Lite with Float16 quantization, achieving an average inference time of 156 milliseconds per frame while maintaining accuracy. These results confirm that the proposed approach effectively balances accuracy, optimization stability, and computational efficiency for edge based agricultural monitoring systems.
【3】DriftGuard: A Hierarchical Framework for Concept Drift Detection and Remediation in Supply Chain Forecasting
标题:DriftGuard:供应链预测中概念漂移检测和修复的分层框架
链接:https://arxiv.org/abs/2601.08928
作者:Shahnawaz Alam,Mohammed Abdul Rahman,Bareera Sadeqa
摘要
:供应链预测模型随着现实世界条件的变化而退化。促销活动发生变化,消费者偏好发生变化,供应中断改变了需求模式,导致了所谓的概念漂移。这种无声的降级会导致缺货或库存过剩,而不会触发任何系统警告。目前的行业实践依赖于人工监测和每3-6个月的计划再训练,这在稳定期浪费了计算资源,同时错过了快速漂移事件。现有的学术方法只关注漂移检测,而不解决诊断或补救问题,并且忽略了供应链数据中固有的层次结构。零售商需要的是一个端到端的系统,能够及早发现漂移,解释其根本原因,并自动纠正受影响的模型。我们提出了DriftGuard,一个五模块的框架,解决了完整的漂移生命周期。该系统结合了四个互补的检测方法,即基于错误的监测,统计测试,自动编码器异常检测,累计和(Cumulative Sum)的变化点分析,与分层传播分析,以准确地确定漂移发生在整个产品线。一旦检测到,Shapley加法修正(SHAP)分析会诊断根本原因,而成本感知的再训练策略只选择性地更新受影响最严重的模型。通过对M5零售数据集的30,000多个时间序列进行评估,DriftGuard在4.2天内实现了97.8%的检测召回率,并通过有针对性的补救提供了高达417的投资回报。
摘要:Supply chain forecasting models degrade over time as real-world conditions change. Promotions shift, consumer preferences evolve, and supply disruptions alter demand patterns, causing what is known as concept drift. This silent degradation leads to stockouts or excess inventory without triggering any system warnings. Current industry practice relies on manual monitoring and scheduled retraining every 3-6 months, which wastes computational resources during stable periods while missing rapid drift events. Existing academic methods focus narrowly on drift detection without addressing diagnosis or remediation, and they ignore the hierarchical structure inherent in supply chain data. What retailers need is an end-to-end system that detects drift early, explains its root causes, and automatically corrects affected models. We propose DriftGuard, a five-module framework that addresses the complete drift lifecycle. The system combines an ensemble of four complementary detection methods, namely error-based monitoring, statistical tests, autoencoder anomaly detection, and Cumulative Sum (CUSUM) change-point analysis, with hierarchical propagation analysis to identify exactly where drift occurs across product lines. Once detected, Shapley Additive Explanations (SHAP) analysis diagnoses the root causes, and a cost-aware retraining strategy selectively updates only the most affected models. Evaluated on over 30,000 time series from the M5 retail dataset, DriftGuard achieves 97.8% detection recall within 4.2 days and delivers up to 417 return on investment through targeted remediation.
分类|识别(1篇)
【1】Preliminary Tests of the Anticipatory Classifier System with Hindsight Experience Replay
标题:事后经验回放的预期分类器系统初步测试
链接:https://arxiv.org/abs/2601.09400
作者:Olgierd Unold,Stanisław Franczyk
摘要:本文介绍了ACS 2 HER,一个新的集成的预期分类系统(ACS 2)与后见之明经验重放(HER)机制。虽然ACS 2在通过潜在学习构建认知地图方面非常有效,但在奖励稀疏的环境中,它的表现往往停滞不前。我们提出了一个特定的架构变体,当代理未能达到其主要目标时,它会触发后见之明学习,将访问过的状态重新标记为虚拟目标,以增强学习信号。该模型进行了评估的两个基准:确定性\texttt{迷宫6}和随机\texttt{冻结湖}。结果表明,ACS 2 HER显着加速知识获取和环境掌握相比,标准的ACS 2。然而,这种效率的提高伴随着增加的计算开销和分类器数量的大幅扩展。这项工作提供了第一个分析结合预期机制与回顾性目标重新标记的学习分类系统。
摘要:This paper introduces ACS2HER, a novel integration of the Anticipatory Classifier System (ACS2) with the Hindsight Experience Replay (HER) mechanism. While ACS2 is highly effective at building cognitive maps through latent learning, its performance often stagnates in environments characterized by sparse rewards. We propose a specific architectural variant that triggers hindsight learning when the agent fails to reach its primary goal, re-labeling visited states as virtual goals to densify the learning signal. The proposed model was evaluated on two benchmarks: the deterministic \texttt{Maze 6} and the stochastic \texttt{FrozenLake}. The results demonstrate that ACS2HER significantly accelerates knowledge acquisition and environmental mastery compared to the standard ACS2. However, this efficiency gain is accompanied by increased computational overhead and a substantial expansion in classifier numerosity. This work provides the first analysis of combining anticipatory mechanisms with retrospective goal-relabeling in Learning Classifier Systems.
表征(4篇)
【1】Geometric Stability: The Missing Axis of Representations
标题:几何稳定性:缺失的表示轴
链接:https://arxiv.org/abs/2601.09173
作者:Prashant C. Raju
摘要:学习表征的分析有一个盲点:它关注于$similarity$,衡量嵌入与外部参考的紧密程度,但相似性只揭示了所表示的内容,而不是结构是否健壮。我们引入了$几何稳定性$,一个量化表征几何在扰动下的可靠性的独特维度,并提出了$Shesha$,一个测量它的框架。在七个领域的2,463个配置中,我们表明稳定性和相似性在经验上不相关($ρ\approximately 0.01$)和机械上不同:相似性指标崩溃后,删除顶部的主成分,而稳定性保持敏感性细粒度的流形结构。这一区别产生了可操作的见解:对于安全监控,稳定性充当功能几何金丝雀,检测结构漂移的灵敏度比CKA高近2$\times$,同时过滤掉在刚性距离度量中触发错误警报的非功能噪声;对于可控性,监督稳定性预测线性可操纵性($ρ= 0.89$-0.96 $);对于模型选择,稳定性与可转移性分离,揭示了转移优化所产生的几何税。除了机器学习之外,稳定性还可以预测CRISPR扰动相干性和神经行为耦合。通过量化系统如何可靠地保持结构,几何稳定性为审计跨生物和计算系统的表示提供了必要的补充。
摘要:Analysis of learned representations has a blind spot: it focuses on $similarity$, measuring how closely embeddings align with external references, but similarity reveals only what is represented, not whether that structure is robust. We introduce $geometric$ $stability$, a distinct dimension that quantifies how reliably representational geometry holds under perturbation, and present $Shesha$, a framework for measuring it. Across 2,463 configurations in seven domains, we show that stability and similarity are empirically uncorrelated ($ρ\approx 0.01$) and mechanistically distinct: similarity metrics collapse after removing the top principal components, while stability retains sensitivity to fine-grained manifold structure. This distinction yields actionable insights: for safety monitoring, stability acts as a functional geometric canary, detecting structural drift nearly 2$\times$ more sensitively than CKA while filtering out the non-functional noise that triggers false alarms in rigid distance metrics; for controllability, supervised stability predicts linear steerability ($ρ= 0.89$-$0.96$); for model selection, stability dissociates from transferability, revealing a geometric tax that transfer optimization incurs. Beyond machine learning, stability predicts CRISPR perturbation coherence and neural-behavioral coupling. By quantifying $how$ $reliably$ systems maintain structure, geometric stability provides a necessary complement to similarity for auditing representations across biological and computational systems.
【2】Navigating Ideation Space: Decomposed Conceptual Representations for Positioning Scientific Ideas
标题:航行思想空间:定位科学思想的分解概念表示
链接:https://arxiv.org/abs/2601.08901
作者:Yuexi Shen,Minqian Liu,Dawei Zhou,Lifu Huang
备注:21 pages, 6 tables
摘要:科学发现是一个累积的过程,需要新的想法位于不断扩大的现有知识的景观。一个新出现的关键挑战是如何从快速增长的文献中识别概念相关的先前工作,并评估新想法如何与现有研究区分开来。当前的嵌入方法通常将不同的概念方面合并为单一的表示,不能支持细粒度的文献检索;同时,基于LLM的评估者受到奉承偏见的影响,无法提供有区别的新颖性评估。为了应对这些挑战,我们引入了构思空间,这是一种将科学知识分解为三个不同维度的结构化表示,即,研究问题,方法和核心发现,每一个通过对比培训学习。这个框架可以原则性地测量概念之间的距离,并对概念转换进行建模,以捕捉所提出的概念中的逻辑联系。在此表示的基础上,我们提出了一个层次子空间检索框架,有效的,有针对性的文献检索,和分解的新颖性评估算法,确定哪些方面的想法是新颖的。大量的实验表明,我们的方法实现了0.329的Recall@30(比基线高16.7%),我们的思维转变检索达到了0.643的Hit Rate@30,新颖性评估与专家判断的相关性达到了0.37。总之,我们的工作为未来加速和评估科学发现的研究提供了一个有前途的范例。
摘要
:Scientific discovery is a cumulative process and requires new ideas to be situated within an ever-expanding landscape of existing knowledge. An emerging and critical challenge is how to identify conceptually relevant prior work from rapidly growing literature, and assess how a new idea differentiates from existing research. Current embedding approaches typically conflate distinct conceptual aspects into single representations and cannot support fine-grained literature retrieval; meanwhile, LLM-based evaluators are subject to sycophancy biases, failing to provide discriminative novelty assessment. To tackle these challenges, we introduce the Ideation Space, a structured representation that decomposes scientific knowledge into three distinct dimensions, i.e., research problem, methodology, and core findings, each learned through contrastive training. This framework enables principled measurement of conceptual distance between ideas, and modeling of ideation transitions that capture the logical connections within a proposed idea. Building upon this representation, we propose a Hierarchical Sub-Space Retrieval framework for efficient, targeted literature retrieval, and a Decomposed Novelty Assessment algorithm that identifies which aspects of an idea are novel. Extensive experiments demonstrate substantial improvements, where our approach achieves Recall@30 of 0.329 (16.7% over baselines), our ideation transition retrieval reaches Hit Rate@30 of 0.643, and novelty assessment attains 0.37 correlation with expert judgments. In summary, our work provides a promising paradigm for future research on accelerating and evaluating scientific discovery.
【3】Learning Domain-Invariant Representations for Cross-Domain Image Registration via Scene-Appearance Disentanglement
标题:通过场景外观解纠缠学习跨域图像配准的域不变表示
链接:https://arxiv.org/abs/2601.08875
作者:Jiahao Qin,Yiwen Wang
备注:12 pages, 7 figures, 4 tables. Code and data available at https://github.com/D-ST-Sword/SAR-NET
摘要:域偏移下的图像配准仍然是计算机视觉和医学成像中的一个基本挑战:当源图像和目标图像表现出系统强度差异时,传统配准方法的亮度恒定性假设被违反,从而使对应估计不适定。我们提出SAR-Net,一个统一的框架,通过有原则的场景外观解纠缠来解决这一挑战。我们的关键见解是,观察到的图像可以分解为域不变的场景表示和域特定的外观代码,使注册通过重新渲染,而不是直接的强度匹配。我们建立了理论条件下,这种分解使一致的跨域对齐(命题1),并证明我们的场景一致性损失提供了一个充分条件,几何对应的共享潜在空间(命题2)。从经验上讲,我们在双向扫描显微镜上验证了SAR-Net,其中耦合域偏移和几何失真创建了一个具有挑战性的现实世界测试平台。我们的方法实现了0.885 SSIM和0.979 NCC,比最强基线提高了3.1倍,同时保持了实时性能(77 fps)。消融研究证实,场景一致性和域对齐损失都是必要的:删除或降低性能90% SSIM或导致潜在对齐错误增加223倍,分别。代码和数据可在https://github.com/D-ST-Sword/SAR-NET上获得。
摘要:Image registration under domain shift remains a fundamental challenge in computer vision and medical imaging: when source and target images exhibit systematic intensity differences, the brightness constancy assumption underlying conventional registration methods is violated, rendering correspondence estimation ill-posed. We propose SAR-Net, a unified framework that addresses this challenge through principled scene-appearance disentanglement. Our key insight is that observed images can be decomposed into domain-invariant scene representations and domain-specific appearance codes, enabling registration via re-rendering rather than direct intensity matching. We establish theoretical conditions under which this decomposition enables consistent cross-domain alignment (Proposition 1) and prove that our scene consistency loss provides a sufficient condition for geometric correspondence in the shared latent space (Proposition 2). Empirically, we validate SAR-Net on bidirectional scanning microscopy, where coupled domain shift and geometric distortion create a challenging real-world testbed. Our method achieves 0.885 SSIM and 0.979 NCC, representing 3.1x improvement over the strongest baseline, while maintaining real-time performance (77 fps). Ablation studies confirm that both scene consistency and domain alignment losses are necessary: removing either degrades performance by 90% SSIM or causes 223x increase in latent alignment error, respectively. Code and data are available at https://github.com/D-ST-Sword/SAR-NET.
【4】Universal Latent Homeomorphic Manifolds: Cross-Domain Representation Learning via Homeomorphism Verification
标题:泛潜在同胚Manifolds:通过同胚验证的跨域表示学习
链接:https://arxiv.org/abs/2601.09025
作者:Tong Wu,Tayab Uddin Wara,Daniel Hernandez,Sidong Lei
摘要:我们提出了通用潜在同胚流形(ULHM),一个统一语义表示的框架(例如,人类描述,诊断标签)和观察驱动的机器表示(例如,像素强度、传感器读数)转化为单个潜在结构。尽管这两种模式的起源从根本上不同,但它们都抓住了相同的基本现实。我们建立\n {同胚},一个连续的双射保持拓扑结构,作为数学标准,用于确定当潜在的流形诱导不同的语义观察对可以严格统一。该标准为三个关键应用提供了理论保证:(1)从不完整的观察中进行语义引导的稀疏恢复,(2)具有验证结构兼容性的跨域迁移学习,以及(3)通过从语义到观察空间的有效迁移进行zero-shot组合学习。我们的框架通过条件变分推理学习连续的流形到流形变换,避免脆弱的点到点映射。我们开发了实用的验证算法,包括信任、连续性和Wasserstein距离度量,可以根据有限样本凭经验验证同胚结构。实验证明:(1)从5%的CelebA像素恢复稀疏图像并在多个稀疏水平下进行MNIST数字重建,(2)从MNIST到Fashion-MNIST的跨域分类器转移实现了86.73%的准确率,而无需再训练,以及(3)对未见过的类的zero-shot分类在MNIST上实现了89.47%,在Fashion-MNIST上实现了84.70%,在CIFAR-10上实现了78.76%。重要的是,同胚标准正确地拒绝不兼容的数据集,防止无效的统一,并提供了一种可行的方法,原则上分解的一般基础模型到验证特定领域的组件。
摘要:We present the Universal Latent Homeomorphic Manifold (ULHM), a framework that unifies semantic representations (e.g., human descriptions, diagnostic labels) and observation-driven machine representations (e.g., pixel intensities, sensor readings) into a single latent structure. Despite originating from fundamentally different pathways, both modalities capture the same underlying reality. We establish \emph{homeomorphism}, a continuous bijection preserving topological structure, as the mathematical criterion for determining when latent manifolds induced by different semantic-observation pairs can be rigorously unified. This criterion provides theoretical guarantees for three critical applications: (1) semantic-guided sparse recovery from incomplete observations, (2) cross-domain transfer learning with verified structural compatibility, and (3) zero-shot compositional learning via valid transfer from semantic to observation space. Our framework learns continuous manifold-to-manifold transformations through conditional variational inference, avoiding brittle point-to-point mappings. We develop practical verification algorithms, including trust, continuity, and Wasserstein distance metrics, that empirically validate homeomorphic structure from finite samples. Experiments demonstrate: (1) sparse image recovery from 5\% of CelebA pixels and MNIST digit reconstruction at multiple sparsity levels, (2) cross-domain classifier transfer achieving 86.73\% accuracy from MNIST to Fashion-MNIST without retraining, and (3) zero-shot classification on unseen classes achieving 89.47\% on MNIST, 84.70\% on Fashion-MNIST, and 78.76\% on CIFAR-10. Critically, the homeomorphism criterion correctly rejects incompatible datasets, preventing invalid unification and providing a feasible way to principled decomposition of general foundation models into verified domain-specific components.
优化|敛散性(4篇)
【1】Terminally constrained flow-based generative models from an optimal control perspective
标题:从最优控制角度来看的终端约束基于流量的生成模型
链接:https://arxiv.org/abs/2601.09474
作者:Weiguo Gao,Ming Li,Qianxiao Li
备注:59 pages, 9 figures
摘要
:我们通过最优控制制定解决了从终端约束分布与预训练的基于流的生成模型的采样问题。理论上,我们用Hamilton-Jacobi-Bellman方程描述了值函数,并导出了最优反馈控制作为相关哈密顿量的极小值。我们表明,作为控制惩罚的增加,受控过程恢复的参考分布,而作为惩罚消失,终端法收敛到一个广义Wasserstein投影到约束流形。在数学上,我们引入了基于流模型的终端最优控制(TOCFlow),这是一种针对预训练流的几何感知采样时间指导方法。在跟踪参考轨迹的终端共同移动框架中解决控制问题,沿着黎曼梯度产生封闭形式的标量阻尼因子,捕获二阶曲率效应而无需矩阵求逆。因此,TOCFlow以标准梯度制导的计算成本匹配高斯-牛顿更新的几何一致性。我们评估TOCFlow的三个高维科学任务跨越平等,不平等和全球统计约束,即达西流,约束轨迹规划,湍流快照生成Kolmogorov谱标度。在所有设置中,TOCFlow在保持参考模型的生成质量的同时,提高了对欧几里得制导和投影基线的约束满足度。
摘要:We address the problem of sampling from terminally constrained distributions with pre-trained flow-based generative models through an optimal control formulation. Theoretically, we characterize the value function by a Hamilton-Jacobi-Bellman equation and derive the optimal feedback control as the minimizer of the associated Hamiltonian. We show that as the control penalty increases, the controlled process recovers the reference distribution, while as the penalty vanishes, the terminal law converges to a generalized Wasserstein projection onto the constraint manifold. Algorithmically, we introduce Terminal Optimal Control with Flow-based models (TOCFlow), a geometry-aware sampling-time guidance method for pre-trained flows. Solving the control problem in a terminal co-moving frame that tracks reference trajectories yields a closed-form scalar damping factor along the Riemannian gradient, capturing second-order curvature effects without matrix inversions. TOCFlow therefore matches the geometric consistency of Gauss-Newton updates at the computational cost of standard gradient guidance. We evaluate TOCFlow on three high-dimensional scientific tasks spanning equality, inequality, and global statistical constraints, namely Darcy flow, constrained trajectory planning, and turbulence snapshot generation with Kolmogorov spectral scaling. Across all settings, TOCFlow improves constraint satisfaction over Euclidean guidance and projection baselines while preserving the reference model's generative quality.
【2】GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization
标题:礼物:通过低温吉布斯按钮解锁训练后的全球最佳性
链接:https://arxiv.org/abs/2601.09233
作者:Zhengyang Zhao,Lu Ma,Yizhen Jiang,Xiaochen Ma,Zimo Meng,Chengyu Shen,Lexiang Tang,Haoze Sun,Peng Pei,Wentao Zhang
摘要:大型推理模型(LRM)的普遍后训练范式-监督微调(SFT),然后是强化学习(RL)-存在内在的优化不匹配:SFT固有的严格监督导致分布崩溃,从而耗尽后续RL所需的探索空间。在本文中,我们在一个统一的后训练框架内重新制定SFT,并提出了有限温度吉布斯方法(GIFT)。我们表征标准SFT作为一个退化的零温度限制,抑制基本先验。相反,GIFT将监督作为有限温度的能量潜力,建立了一个分布式桥梁,确保整个培训后管道的目标一致性。我们的实验表明,当用于RL初始化时,GIFT显著优于标准SFT和其他竞争基线,为在后训练中实现全局最优提供了数学原理途径。我们的代码可在https://github.com/zzy1127/GIFT上获得。
摘要:The prevailing post-training paradigm for Large Reasoning Models (LRMs)--Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)--suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at https://github.com/zzy1127/GIFT.
【3】DP-FEDSOFIM: Differentially Private Federated Stochastic Optimization using Regularized Fisher Information Matrix
标题:DP-FEDSORST:使用正规Fisher信息矩阵的差异私有联邦随机优化
链接:https://arxiv.org/abs/2601.09166
作者:Sidhant R. Nair,Tanmay Sen,Mrinmay Sen
备注:17 pages, 1 figure. Submitted to ICML 2026
摘要:差分私有联邦学习(DP-FL)在隐私预算紧张的情况下收敛缓慢,这是由于为保护隐私而引入的压倒性噪声。虽然自适应优化器可以加速收敛,但现有的二阶方法(如DP-FedNew)在每个客户端需要O(d^2)内存来维护局部特征协方差矩阵,这使得它们对于高维模型不切实际。我们提出了DP-FedSOdom,一个服务器端的二阶优化框架,利用Fisher信息矩阵作为自然梯度预处理器,同时每个客户端只需要O(d)内存。通过采用Sherman-Morrison公式进行有效的矩阵求逆,DP-FedSO算法实现了每轮O(d)的计算复杂度,同时保持了二阶方法的收敛优势。我们的分析证明,服务器端的预处理通过后处理定理保持(Δ,Δ)-差分隐私。对CIFAR-10的实证评估表明,DP-FedSONOTES在多个隐私机制中实现了优于一阶基线的测试准确性。
摘要:Differentially private federated learning (DP-FL) suffers from slow convergence under tight privacy budgets due to the overwhelming noise introduced to preserve privacy. While adaptive optimizers can accelerate convergence, existing second-order methods such as DP-FedNew require O(d^2) memory at each client to maintain local feature covariance matrices, making them impractical for high-dimensional models. We propose DP-FedSOFIM, a server-side second-order optimization framework that leverages the Fisher Information Matrix (FIM) as a natural gradient preconditioner while requiring only O(d) memory per client. By employing the Sherman-Morrison formula for efficient matrix inversion, DP-FedSOFIM achieves O(d) computational complexity per round while maintaining the convergence benefits of second-order methods. Our analysis proves that the server-side preconditioning preserves (epsilon, delta)-differential privacy through the post-processing theorem. Empirical evaluation on CIFAR-10 demonstrates that DP-FedSOFIM achieves superior test accuracy compared to first-order baselines across multiple privacy regimes.
【4】Block Decomposable Methods for Large-Scale Optimization Problems
标题:大规模优化问题的块分解方法
链接:https://arxiv.org/abs/2601.09010
作者:Leandro Farias Maia
摘要:本文主要研究大规模优化问题的块分解方法。重点介绍了交替方向乘子法(ADMM)和块坐标下降法(BCD)。具体地说,它介绍了一种新的邻近ADMM算法,并提出了两种BCD方法。研究的第一部分提出了一种新的邻近ADMM算法。该方法对问题的所有参数都是自适应的,并且不精确地求解了近似增广拉格朗日(AL)子问题。这种适应性有利于算法在实际问题中的高效应用。近端AL子问题的不精确解克服了ADMM实际应用中的许多关键挑战。所得到的算法获得一个近似的解决方案的优化问题,在一些迭代,匹配的国家的最先进的复杂性类的近端ADMM计划。第二部分的研究集中在一类块邻近梯度方法的不精确邻近映射。该算子的关键属性的建立,便于推导所提出的算法的收敛速度。在两个误差减小的条件下,该算法的收敛速度与精确计算的算法相匹配。数值结果表明,该算法的性能优越的动态误差制度下比固定的。论文的结论是提供收敛保证随机BCD方法适用于广泛的一类函数,称为Hölder光滑函数。收敛速度推导出非凸,凸,强凸函数。这些收敛速度与现有文献中Lipschtiz光滑设置的收敛速度相匹配。
摘要
:This dissertation explores block decomposable methods for large-scale optimization problems. It focuses on alternating direction method of multipliers (ADMM) schemes and block coordinate descent (BCD) methods. Specifically, it introduces a new proximal ADMM algorithm and proposes two BCD methods. The first part of the research presents a new proximal ADMM algorithm. This method is adaptive to all problem parameters and solves the proximal augmented Lagrangian (AL) subproblem inexactly. This adaptiveness facilitates the highly efficient application of the algorithm to a broad swath of practical problems. The inexact solution of the proximal AL subproblem overcomes many key challenges in the practical applications of ADMM. The resultant algorithm obtains an approximate solution of an optimization problem in a number of iterations that matches the state-of-the-art complexity for the class of proximal ADMM schemes. The second part of the research focuses on an inexact proximal mapping for the class of block proximal gradient methods. Key properties of this operator is established, facilitating the derivation of convergence rates for the proposed algorithm. Under two error decreases conditions, the algorithm matches the convergence rate of its exactly computed counterpart. Numerical results demonstrate the superior performance of the algorithm under a dynamic error regime over a fixed one. The dissertation concludes by providing convergence guarantees for the randomized BCD method applied to a broad class of functions, known as Hölder smooth functions. Convergence rates are derived for non-convex, convex, and strongly convex functions. These convergence rates match those furnished in the existing literature for the Lipschtiz smooth setting.
预测|估计(3篇)
【1】XLinear: A Lightweight and Accurate MLP-Based Model for Long-Term Time Series Forecasting with Exogenous Inputs
标题:XLinear:一个轻量级且准确的基于MLP的模型,用于具有外生输入的长期时间序列预测
链接:https://arxiv.org/abs/2601.09237
作者:Xinyang Chen,Huidong Jin,Yu Huang,Zaiwen Feng
备注:Accepted by AAAI 2026
摘要:尽管在长期时间序列预测模型中普遍假设变量的重要性是一致的,但现实世界的应用往往表现出不对称的因果关系和不同的数据获取成本。具体而言,具有成本效益的外生数据(例如,当地天气)可以单方面影响内源变量的动态,如湖面温度。利用这些联系,可以在随时可以获得外源投入的情况下进行更有效的预测。基于transformer的模型捕获长范围的依赖关系,但会导致高计算量,并遭受置换不变性。基于补丁的变体提高了效率,但可能会错过局部时间模式。为了有效地利用跨时间维度和相关外生变量的信息信号,本研究提出了XLinear,这是一种基于多层感知器(MLP)的轻量级时间序列预测模型。XLinear使用从内生变量派生的全局令牌作为与外生变量交互的关键枢纽,并采用具有sigmoid激活的MLP来提取时间模式和变量依赖性。其预测头然后整合这些信号来预测内生序列。我们在七个标准基准测试和五个具有外源输入的真实数据集上评估XLinear。与最先进的模型相比,XLinear对于受外生输入影响的多变量预测和单变量预测都具有卓越的准确性和效率。
摘要:Despite the prevalent assumption of uniform variable importance in long-term time series forecasting models, real world applications often exhibit asymmetric causal relationships and varying data acquisition costs. Specifically, cost-effective exogenous data (e.g., local weather) can unilaterally influence dynamics of endogenous variables, such as lake surface temperature. Exploiting these links enables more effective forecasts when exogenous inputs are readily available. Transformer-based models capture long-range dependencies but incur high computation and suffer from permutation invariance. Patch-based variants improve efficiency yet can miss local temporal patterns. To efficiently exploit informative signals across both the temporal dimension and relevant exogenous variables, this study proposes XLinear, a lightweight time series forecasting model built upon MultiLayer Perceptrons (MLPs). XLinear uses a global token derived from an endogenous variable as a pivotal hub for interacting with exogenous variables, and employs MLPs with sigmoid activation to extract both temporal patterns and variate-wise dependencies. Its prediction head then integrates these signals to forecast the endogenous series. We evaluate XLinear on seven standard benchmarks and five real-world datasets with exogenous inputs. Compared with state-of-the-art models, XLinear delivers superior accuracy and efficiency for both multivariate forecasts and univariate forecasts influenced by exogenous inputs.
【2】Resolving Predictive Multiplicity for the Rashomon Set
标题:解决罗生门集的预测多重性
链接:https://arxiv.org/abs/2601.09071
作者:Parian Haghighat,Hadis Anahideh,Cynthia Rudin
摘要:对于一个给定的预测任务,存在多个同样准确的模型会导致预测的多样性,其中一组“罗生门”模型实现了类似的准确性,但在各自的预测中存在分歧。这种不一致性破坏了对高风险应用程序的信任,我们希望在这些应用程序中进行一致的预测。我们提出了三种方法来减少罗生门集合的成员之间的预测不一致。第一种方法是\textbf{outlier correction}。离群值有一个标签,没有一个好的模型能够正确预测。离群值可能会导致罗生门集合在局部区域具有高方差预测,因此修复它们可以降低方差。我们的第二种方法是本地修补。在测试点周围的局部区域中,模型可能彼此不一致,因为其中一些模型存在偏差。我们可以使用验证集来检测和修复这些偏差,这也减少了多样性。我们的第三种方法是成对协调,我们找到在测试点周围的区域上不一致的模型对。我们修改不一致的预测,使它们更少偏见。这三种方法可以一起使用,也可以单独使用,它们都有各自的优势。然后,协调后的预测可以提取到单个可解释的模型中,用于实际部署。在多个数据集的实验中,我们的方法减少了分歧指标,同时保持了竞争力的准确性。
摘要:The existence of multiple, equally accurate models for a given predictive task leads to predictive multiplicity, where a ``Rashomon set'' of models achieve similar accuracy but diverges in their individual predictions. This inconsistency undermines trust in high-stakes applications where we want consistent predictions. We propose three approaches to reduce inconsistency among predictions for the members of the Rashomon set. The first approach is \textbf{outlier correction}. An outlier has a label that none of the good models are capable of predicting correctly. Outliers can cause the Rashomon set to have high variance predictions in a local area, so fixing them can lower variance. Our second approach is local patching. In a local region around a test point, models may disagree with each other because some of them are biased. We can detect and fix such biases using a validation set, which also reduces multiplicity. Our third approach is pairwise reconciliation, where we find pairs of models that disagree on a region around the test point. We modify predictions that disagree, making them less biased. These three approaches can be used together or separately, and they each have distinct advantages. The reconciled predictions can then be distilled into a single interpretable model for real-world deployment. In experiments across multiple datasets, our methods reduce disagreement metrics while maintaining competitive accuracy.
【3】XGBoost Forecasting of NEPSE Index Log Returns with Walk Forward Validation
标题:XGBoost预测NEPSE指数日志回报率,并采用渐进验证
链接:https://arxiv.org/abs/2601.08896
作者:Sahaj Raj Malla,Shreeyash Kayastha,Rumi Suwal,Harish Chandra Bhandari,Rajendra Adhikari
备注:9 pages, 4 figures, 3 tables
摘要:本研究开发了一个强大的机器学习框架,使用XGBoost回归量提前一步预测尼泊尔证券交易所(NEPSE)指数的每日日志回报。设计了一套全面的功能集,包括滞后对数回报(长达30天)和既定的技术指标,如短期和中期滚动波动率指标和14期相对强弱指数。超参数优化使用Optuna在初始训练段上进行时间序列交叉验证。样本外的性能是严格评估,通过步行向前验证下扩展和固定长度的滚动窗口计划在多个滞后配置,模拟现实世界的部署和避免前瞻性偏见。预测准确性评估使用均方根误差,平均绝对误差,决定系数(R平方),和方向准确性的日志回报和重建收盘价。实证结果表明,最佳配置,一个扩展窗口与20滞后,优于调整ARIMA和岭回归基准,实现了最低的对数回报RMSE(0.013450)和MAE(0.009814)以及65.15%的方向准确性。虽然R平方仍然适中,与金融回报的噪声性质一致,但主要重点放在相对误差减少和方向预测上。特征重要性分析和目视检查进一步增强了可解释性。这些研究结果表明,梯度提升合奏的有效性,在波动的新兴市场时间序列的非线性动态建模,并建立一个可重复的基准NEPSE指数预测。
摘要
:This study develops a robust machine learning framework for one-step-ahead forecasting of daily log-returns in the Nepal Stock Exchange (NEPSE) Index using the XGBoost regressor. A comprehensive feature set is engineered, including lagged log-returns (up to 30 days) and established technical indicators such as short- and medium-term rolling volatility measures and the 14-period Relative Strength Index. Hyperparameter optimization is performed using Optuna with time-series cross-validation on the initial training segment. Out-of-sample performance is rigorously assessed via walk-forward validation under both expanding and fixed-length rolling window schemes across multiple lag configurations, simulating real-world deployment and avoiding lookahead bias. Predictive accuracy is evaluated using root mean squared error, mean absolute error, coefficient of determination (R-squared), and directional accuracy on both log-returns and reconstructed closing prices. Empirical results show that the optimal configuration, an expanding window with 20 lags, outperforms tuned ARIMA and Ridge regression benchmarks, achieving the lowest log-return RMSE (0.013450) and MAE (0.009814) alongside a directional accuracy of 65.15%. While the R-squared remains modest, consistent with the noisy nature of financial returns, primary emphasis is placed on relative error reduction and directional prediction. Feature importance analysis and visual inspection further enhance interpretability. These findings demonstrate the effectiveness of gradient boosting ensembles in modeling nonlinear dynamics in volatile emerging market time series and establish a reproducible benchmark for NEPSE Index forecasting.
其他神经网络|深度学习|模型|建模(17篇)
【1】Exploring Fine-Tuning for Tabular Foundation Models
标题:平板基础模型的微调研究
链接:https://arxiv.org/abs/2601.09654
作者:Aditya Tanna,Pratinav Seth,Mohamed Bouadi,Vinay Kumar Sankarapu
摘要:表格基础模型(TFM)最近在结构化数据上显示出强大的上下文学习能力,实现了与传统机器学习方法相当的zero-shot性能。我们发现,zero-shot TFMs已经实现了强大的性能,而微调的好处是高度依赖于模型和数据。元学习和PEFT在特定条件下提供适度的增益,而全监督微调(SFT)通常会降低精度或校准质量。这项工作提出了第一个全面的研究微调TFMs跨基准,包括TALENT,OpenML-CC 18和TabZilla。我们比较了Zero-Shot、元学习、监督(SFT)和参数有效(PEFT)方法,分析了数据集因素(如不平衡、大小和维度)如何影响结果。我们的研究结果涵盖了性能、校准和公平性,为何时微调最有益及其局限性提供了实用指南。
摘要:Tabular Foundation Models (TFMs) have recently shown strong in-context learning capabilities on structured data, achieving zero-shot performance comparable to traditional machine learning methods. We find that zero-shot TFMs already achieve strong performance, while the benefits of fine-tuning are highly model and data-dependent. Meta-learning and PEFT provide moderate gains under specific conditions, whereas full supervised fine-tuning (SFT) often reduces accuracy or calibration quality. This work presents the first comprehensive study of fine-tuning in TFMs across benchmarks including TALENT, OpenML-CC18, and TabZilla. We compare Zero-Shot, Meta-Learning, Supervised (SFT), and parameter-efficient (PEFT) approaches, analyzing how dataset factors such as imbalance, size, and dimensionality affect outcomes. Our findings cover performance, calibration, and fairness, offering practical guidelines on when fine-tuning is most beneficial and its limitations.
【2】Identifying Models Behind Text-to-Image Leaderboards
标题:识别文本到图像排行榜背后的模型
链接:https://arxiv.org/abs/2601.09647
作者:Ali Naseh,Yuefeng Peng,Anshuman Suri,Harsh Chaudhari,Alina Oprea,Amir Houmansadr
摘要:文本到图像(T2I)模型越来越受欢迎,在线生成了大量AI生成的图像。为了比较模型质量,基于投票的排行榜已经成为标准,依靠匿名模型输出来实现公平性。在这项工作中,我们表明,这种匿名性可以很容易地打破。我们发现,每个T2I模型的生成在图像嵌入空间中形成了独特的集群,从而在没有及时控制或训练数据的情况下实现了准确的去匿名化。使用22个模型和280个提示(150K图像),我们基于质心的方法实现了高准确性,并揭示了系统的模型特定签名。我们进一步引入了一个层次的可扩展性度量,并进行大规模的分析,显示某些提示如何导致近乎完美的可扩展性。我们的发现揭示了T2I排行榜中的基本安全缺陷,并激发了更强大的匿名防御。
摘要:Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.
【3】Deep Operator Networks for Surrogate Modeling of Cyclic Adsorption Processes with Varying Initial Conditions
标题:用于不同初始条件的循环吸收过程的代理建模的深度运营商网络
链接:https://arxiv.org/abs/2601.09491
作者:Beatrice Ceccanti,Mattia Galanti,Ivo Roghair,Martin van Sint Annaland
备注:36 pages, 11 figures
摘要:深度算子网络正在成为各种神经网络类型中学习函数空间之间映射的基本工具,最近由于其近似非线性算子的能力而受到关注。特别是,DeepONets为PDE求解提供了一种自然的公式,因为偏微分方程的解可以被解释为将初始条件映射到其相应解域的算子。在这项工作中,我们将DeepONets应用于吸附技术的过程建模,以评估其作为循环吸附过程模拟和优化的替代品的可行性。我们的目标是加速收敛的循环过程,如温度-真空变压吸附(TVSA),这需要重复的解决方案的瞬态偏微分方程,这是计算昂贵的。由于循环吸附过程的每一步都从前一步的最终状态开始,因此有效的替代建模需要在广泛的初始条件下进行泛化。控制方程表现出陡峭的旅行前线,提供了一个苛刻的基准操作员学习。为了评估这些条件下的函数泛化,我们构建了一个由异构初始条件组成的混合训练数据集,并训练DeepONets来近似相应的解算子。然后,在训练过程中使用的参数范围之外的初始条件以及完全看不见的函数形式上测试训练后的模型。结果证明了训练分布内外的准确预测,突出了DeepONets作为加速循环吸附模拟和优化工作流程的潜在有效替代品。
摘要:Deep Operator Networks are emerging as fundamental tools among various neural network types to learn mappings between function spaces, and have recently gained attention due to their ability to approximate nonlinear operators. In particular, DeepONets offer a natural formulation for PDE solving, since the solution of a partial differential equation can be interpreted as an operator mapping an initial condition to its corresponding solution field. In this work, we applied DeepONets in the context of process modeling for adsorption technologies, to assess their feasibility as surrogates for cyclic adsorption process simulation and optimization. The goal is to accelerate convergence of cyclic processes such as Temperature-Vacuum Swing Adsorption (TVSA), which require repeated solution of transient PDEs, which are computationally expensive. Since each step of a cyclic adsorption process starts from the final state of the preceding step, effective surrogate modeling requires generalization across a wide range of initial conditions. The governing equations exhibit steep traveling fronts, providing a demanding benchmark for operator learning. To evaluate functional generalization under these conditions, we construct a mixed training dataset composed of heterogeneous initial conditions and train DeepONets to approximate the corresponding solution operators. The trained models are then tested on initial conditions outside the parameter ranges used during training, as well as on completely unseen functional forms. The results demonstrate accurate predictions both within and beyond the training distribution, highlighting DeepONets as potential efficient surrogates for accelerating cyclic adsorption simulations and optimization workflows.
【4】SimMerge: Learning to Select Merge Operators from Similarity Signals
标题:SimMerge:学习从相似性信号中选择合并运算符
链接:https://arxiv.org/abs/2601.09473
作者:Oliver Bolton,Aakanksha,Arash Ahmadian,Sara Hooker,Marzieh Fadaee,Beyza Ermis
摘要:模型合并可以将多个大型语言模型(LLM)合并到单个模型中,同时保持性能。这使得它成为LLM开发中的一个有价值的工具,为多任务培训提供了一个有竞争力的替代方案。然而,合并在规模上可能很困难,因为成功的合并需要选择正确的合并运算符,选择正确的模型,并以正确的顺序合并它们。这通常导致研究人员运行昂贵的合并和评估搜索来选择最佳合并。在这项工作中,我们通过引入simmerge{}提供了一种替代方案,simmerge {}是一种预测合并选择方法,它使用模型之间廉价的、与任务无关的相似性信号来选择最佳合并。从一小组未标记的探针,我们计算功能和结构特征,并使用它们来预测一个给定的2路合并的性能。使用这些预测,\simmerge{}选择最佳合并运算符、要合并的模型子集以及合并顺序,从而消除了昂贵的合并和评估循环。我们证明,我们超过了标准的合并运算符性能的2路合并的7 B参数LLM,和\simmerge{}推广到多路合并和111 B参数LLM合并没有重新训练。此外,我们提出了一个强盗的变种,支持添加新的任务,模型和运营商的飞行。我们的研究结果表明,学习如何合并是一个实用的路线,可扩展的模型组成时,检查点目录很大,评估预算紧张。
摘要:Model merging enables multiple large language models (LLMs) to be combined into a single model while preserving performance. This makes it a valuable tool in LLM development, offering a competitive alternative to multi-task training. However, merging can be difficult at scale, as successful merging requires choosing the right merge operator, selecting the right models, and merging them in the right order. This often leads researchers to run expensive merge-and-evaluate searches to select the best merge. In this work, we provide an alternative by introducing \simmerge{}, \emph{a predictive merge-selection method} that selects the best merge using inexpensive, task-agnostic similarity signals between models. From a small set of unlabeled probes, we compute functional and structural features and use them to predict the performance of a given 2-way merge. Using these predictions, \simmerge{} selects the best merge operator, the subset of models to merge, and the merge order, eliminating the expensive merge-and-evaluate loop. We demonstrate that we surpass standard merge-operator performance on 2-way merges of 7B-parameter LLMs, and that \simmerge{} generalizes to multi-way merges and 111B-parameter LLM merges without retraining. Additionally, we present a bandit variant that supports adding new tasks, models, and operators on the fly. Our results suggest that learning how to merge is a practical route to scalable model composition when checkpoint catalogs are large and evaluation budgets are tight.
【5】SoK: Enhancing Cryptographic Collaborative Learning with Differential Privacy
标题:SoK:通过差异隐私增强加密协作学习
链接:https://arxiv.org/abs/2601.09460
作者:Francesco Capano,Jonas Böhler,Benjamin Weggenmann
备注:This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML 2026)
摘要:在协作学习(CL)中,多方在其私有数据集上联合训练机器学习模型。但是,由于隐私问题,数据不能直接共享。为了确保输入机密性,密码技术,例如,多方计算(MPC),使加密数据的训练。然而,即使是经过安全训练的模型也容易受到旨在从模型输出中提取记忆数据的推理攻击。为了确保输出隐私并减轻推断攻击,差分隐私(DP)在训练期间注入校准噪声。虽然密码学和DP提供了互补的保证,但将它们有效地结合起来用于密码学和差分私有CL(CPCL)是一项挑战。加密会带来性能开销,而DP会降低准确性,从而产生需要仔细设计考虑的隐私-准确性-性能权衡。这项工作使CPCL景观系统化。我们介绍了一个统一的框架,概括了跨CPCL范例的共同阶段,并确定安全的噪声采样作为实现CPCL的基础阶段。我们分析权衡不同的安全噪声采样技术,噪声类型和DP机制,讨论他们的实施挑战,并评估其准确性和加密开销跨CPCL范例。此外,我们实现了确定的安全噪声采样MPC选项,并评估其计算和通信成本在WAN和LAN。最后,我们提出了未来的研究方向的基础上确定的关键意见,差距和可能的增强文献。
摘要:In collaborative learning (CL), multiple parties jointly train a machine learning model on their private datasets. However, data can not be shared directly due to privacy concerns. To ensure input confidentiality, cryptographic techniques, e.g., multi-party computation (MPC), enable training on encrypted data. Yet, even securely trained models are vulnerable to inference attacks aiming to extract memorized data from model outputs. To ensure output privacy and mitigate inference attacks, differential privacy (DP) injects calibrated noise during training. While cryptography and DP offer complementary guarantees, combining them efficiently for cryptographic and differentially private CL (CPCL) is challenging. Cryptography incurs performance overheads, while DP degrades accuracy, creating a privacy-accuracy-performance trade-off that needs careful design considerations. This work systematizes the CPCL landscape. We introduce a unified framework that generalizes common phases across CPCL paradigms, and identify secure noise sampling as the foundational phase to achieve CPCL. We analyze trade-offs of different secure noise sampling techniques, noise types, and DP mechanisms discussing their implementation challenges and evaluating their accuracy and cryptographic overhead across CPCL paradigms. Additionally, we implement identified secure noise sampling options in MPC and evaluate their computation and communication costs in WAN and LAN. Finally, we propose future research directions based on identified key observations, gaps and possible enhancements in the literature.
【6】Late Breaking Results: Quamba-SE: Soft-edge Quantizer for Activations in State Space Models
标题:最新突破性结果:Quamba-SE:状态空间模型激活的软边缘量化器
链接:https://arxiv.org/abs/2601.09451
作者:Yizhi Chen,Ahmed Hemani
备注:Accepted to DATE Late Breaking Results 2026, Verona, Italy
摘要:我们提出了Quamba-SE,这是一种用于状态空间模型(SSM)激活量化的软边缘量化器。与现有方法不同,Quamba-SE使用标准的INT 8运算,采用三种自适应尺度:小值的高精度,正常值的标准尺度,以及离群值的低精度。这将保留离群值信息而不是硬裁剪,同时保持其他值的精度。我们在6个zero-shot基准上对Mamba-130 M进行评估。结果显示,Quamba- SE始终优于Quamba,在单个基准测试中达到+2.68%,在6个数据集的平均准确度上提高了+0.83%。
摘要:We propose Quamba-SE, a soft-edge quantizer for State Space Model (SSM) activation quantization. Unlike existing methods, using standard INT8 operation, Quamba-SE employs three adaptive scales: high-precision for small values, standard scale for normal values, and low-precision for outliers. This preserves outlier information instead of hard clipping, while maintaining precision for other values. We evaluate on Mamba- 130M across 6 zero-shot benchmarks. Results show that Quamba- SE consistently outperforms Quamba, achieving up to +2.68% on individual benchmarks and up to +0.83% improvement in the average accuracy of 6 datasets.
【7】DeepLight: A Sobolev-trained Image-to-Image Surrogate Model for Light Transport in Tissue
标题:DeepLight:Sobolev训练的图像到图像替代模型,用于组织中光传输
链接:https://arxiv.org/abs/2601.09439
作者:Philipp Haim,Vasilis Ntziachristos,Torsten Enßlin,Dominik Jüstel
摘要:在光声成像中,通过反转光传输来恢复组织的吸收系数仍然是一个具有挑战性的问题。解决这一问题的改进可以大大有利于光声成像的临床价值。现有的变分反演方法需要一个准确的和可微的模型,这种光传输。由于神经代理模型允许对复杂的物理过程进行快速和可微的模拟,因此它们被认为是用于解决此类逆问题的有前途的候选者。然而,通常不能保证这些代理模型的导数与底层物理操作符的导数准确匹配。由于精确的导数是解决反问题的核心,模型导数中的误差会大大阻碍高保真重建。为了克服这一局限性,我们提出了一种替代模型,用于组织中的光传输,该模型使用Sobolev训练来提高模型导数的准确性。此外,我们使用的Sobolev训练形式通常适用于高维模型。我们的研究结果表明,Sobolev训练的光传输代理模型不仅提高了导数的准确性,但也减少了泛化误差的分布和分布外的样本。这些改进有望大大提高代理模型在下游任务中的实用性,特别是在求解逆问题中。
摘要
:In optoacoustic imaging, recovering the absorption coefficients of tissue by inverting the light transport remains a challenging problem. Improvements in solving this problem can greatly benefit the clinical value of optoacoustic imaging. Existing variational inversion methods require an accurate and differentiable model of this light transport. As neural surrogate models allow fast and differentiable simulations of complex physical processes, they are considered promising candidates to be used in solving such inverse problems. However, there are in general no guarantees that the derivatives of these surrogate models accurately match those of the underlying physical operator. As accurate derivatives are central to solving inverse problems, errors in the model derivative can considerably hinder high fidelity reconstructions. To overcome this limitation, we present a surrogate model for light transport in tissue that uses Sobolev training to improve the accuracy of the model derivatives. Additionally, the form of Sobolev training we used is suitable for high-dimensional models in general. Our results demonstrate that Sobolev training for a light transport surrogate model not only improves derivative accuracy but also reduces generalization error for in-distribution and out-of-distribution samples. These improvements promise to considerably enhance the utility of the surrogate model in downstream tasks, especially in solving inverse problems.
【8】Learning to Trust Experience: A Monitor-Trust-Regulator Framework for Learning under Unobservable Feedback Reliability
标题:学习信任经验:不可观测反馈可靠性下的学习的一个通道-信任-调节器框架
链接:https://arxiv.org/abs/2601.09261
作者:Zhipeng Zhang,Zhenjie Yao,Kai Li,Lei Yang
备注:23 pages, 7 figures. Preprint
摘要:在不可观测反馈可靠性下的学习提出了一个超越优化鲁棒性的独特挑战:系统必须决定是否从经验中学习,而不仅仅是如何稳定地学习。我们研究这种设置下的不可观察的可靠性(EIUR),其中每个经验有一个潜在的可信度,可靠和不可靠的反馈可以在本地无法区分,和数据生成在一个闭环学习者自己的不断发展的信念和行动的认知识别。在EIUR中,标准的鲁棒学习可以稳定地收敛,但会形成高置信度的系统性错误信念。 我们提出元认知调节作为一个实际的反应:第二,内省的控制回路,推断经验的可信度从内源性证据在学习者的内部动态。我们将其形式化为一个模块化的通道-信任-调节器(MTR)分解,并将其实例化为自诊断,它保持了一个缓慢变化的经验信任变量,可以温和地调节学习更新,而无需外源可靠性标签或明确的腐败模型。 从经验上讲,在EIUR制度研究这里,自我诊断与提高认识的可识别性。在强化学习中,它可以在系统性腐败的奖励下实现校准的怀疑和恢复。在监督学习中,它暴露了一个关键的分离:性能恢复并不意味着认知恢复。准确性可以反弹,而内部信念动态仍然被早期误导性数据锁定,这种失败只能通过内省诊断来检测。MTR和自诊断共同为不可观察可靠性下自主学习的内在可靠性评估提供了一个组织抽象和具体设计模板。
摘要:Learning under unobservable feedback reliability poses a distinct challenge beyond optimization robustness: a system must decide whether to learn from an experience, not only how to learn stably. We study this setting as Epistemic Identifiability under Unobservable Reliability (EIUR), where each experience has a latent credibility, reliable and unreliable feedback can be locally indistinguishable, and data are generated in a closed loop by the learner's own evolving beliefs and actions. In EIUR, standard robust learning can converge stably yet form high-confidence, systematically wrong beliefs. We propose metacognitive regulation as a practical response: a second, introspective control loop that infers experience credibility from endogenous evidence in the learner's internal dynamics. We formalize this as a modular Monitor-Trust-Regulator (MTR) decomposition and instantiate it with self-diagnosis, which maintains a slowly varying experience-trust variable that softly modulates learning updates, without exogenous reliability labels or an explicit corruption model. Empirically, in the EIUR regimes studied here, self-diagnosis is associated with improved epistemic identifiability. In reinforcement learning, it enables calibrated skepticism and recovery under systematically corrupted rewards. In supervised learning, it exposes a critical dissociation: performance recovery does not imply epistemic recovery. Accuracy can rebound while internal belief dynamics remain locked-in by early misleading data, a failure detectable only through introspective diagnostics. Together, MTR and self-diagnosis provide an organizing abstraction and a concrete design template for intrinsic reliability assessment in autonomous learning under unobservable reliability.
【9】Reward Learning through Ranking Mean Squared Error
标题:通过均方误差排名奖励学习
链接:https://arxiv.org/abs/2601.09236
作者:Chaitanya Kharyal,Calarina Muslimani,Matthew E. Taylor
摘要:奖励设计仍然是将强化学习(RL)应用于现实世界问题的一个重要瓶颈。一个流行的替代方法是奖励学习,其中奖励函数是从人类反馈中推断出来的,而不是手动指定的。最近的工作提出了从人类反馈中学习奖励函数,而不是传统的二元偏好,从而实现更丰富和可能更少的认知要求监督。在此基础上,我们引入了一种新的基于评级的RL方法,RL(R4)的Ranked Return Regression。在其核心,R4采用了一种新的排名均方误差(rMSE)损失,它把教师提供的评级作为有序的目标。我们的方法是从一个概率-评级对的数据集学习的,其中每个轨迹都被标记了一个离散的评级(例如,“坏”、“中性”、“好”)。在每个训练步骤中,我们对一组轨迹进行采样,预测它们的返回值,并使用可微排序算子(软排名)对它们进行排名。然后,我们优化所得到的软排名和教师的评级之间的均方误差损失。与之前的基于评级的方法不同,R4提供了形式上的保证:它的解集在温和的假设下是可证明的最小和完全的。从经验上讲,使用模拟人类反馈,我们证明R4在OpenAI Gym和DeepMind Control Suite的机器人运动基准测试中始终匹配或优于现有的基于评级和偏好的RL方法,同时需要的反馈显著减少。
摘要:Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., "bad," "neutral," "good"). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher's ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.
【10】Discrete Solution Operator Learning for Geometry-Dependent PDEs
标题:依赖几何的偏出方程的离散解运算符学习
链接:https://arxiv.org/abs/2601.09143
作者:Jinshuai Bai,Haolin Li,Zahra Sharif Khodaei,M. H. Aliabadi,YuanTong Gu,Xi-Qiao Feng
备注:15 pages main text, 40 pages SI
摘要:神经算子学习通过将算子近似为连续函数空间之间的映射来加速PDE求解。然而,在许多工程环境中,变化的几何形状会导致离散的结构变化,包括拓扑变化,边界条件或边界类型的突变,以及有效计算域的变化,这打破了平滑变化的前提。在这里,我们介绍离散求解算子学习(DiSOL),这是一种学习离散求解过程而不是连续函数空间算子的补充范式。DiSOL将求解器分解为反映经典离散化的可学习阶段:局部贡献编码,多尺度组装和嵌入式网格上的隐式解重构,从而在适应几何相关离散结构的同时保持程序级一致性。在几何相关的泊松,对流扩散,线性弹性,以及时空热传导问题,DiSOL产生稳定和准确的预测下的分布和强烈的分布几何形状,包括不连续的边界和拓扑变化。这些结果强调了在几何主导的制度中对程序算子表示的需要,并将离散解算子学习定位为科学机器学习中一个独特的互补方向。
摘要:Neural operator learning accelerates PDE solution by approximating operators as mappings between continuous function spaces. Yet in many engineering settings, varying geometry induces discrete structural changes, including topological changes, abrupt changes in boundary conditions or boundary types, and changes in the effective computational domain, which break the smooth-variation premise. Here we introduce Discrete Solution Operator Learning (DiSOL), a complementary paradigm that learns discrete solution procedures rather than continuous function-space operators. DiSOL factorizes the solver into learnable stages that mirror classical discretizations: local contribution encoding, multiscale assembly, and implicit solution reconstruction on an embedded grid, thereby preserving procedure-level consistency while adapting to geometry-dependent discrete structures. Across geometry-dependent Poisson, advection-diffusion, linear elasticity, as well as spatiotemporal heat-conduction problems, DiSOL produces stable and accurate predictions under both in-distribution and strongly out-of-distribution geometries, including discontinuous boundaries and topological changes. These results highlight the need for procedural operator representations in geometry-dominated regimes and position discrete solution operator learning as a distinct, complementary direction in scientific machine learning.
【11】A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication
标题:一种实现矩阵相乘子优化的机器学习方法
链接:https://arxiv.org/abs/2601.09114
作者:Yufan Xia,Marco De La Pierre,Amanda S. Barnard,Giuseppe Maria Junior Barca
备注:2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
摘要:广义矩阵乘法(GEMM)是科学计算中的基本算法之一.单线程GEMM实现使用阻塞和自动调整等技术进行了很好的优化。然而,由于现代多核共享内存系统的复杂性,确定使多线程GEMM运行时最小化的线程数量是具有挑战性的。我们提出了一个概念验证的方法来构建一个架构和数据结构感知线性代数(ADSALA)软件库,使用机器学习来优化BLAS例程的运行时性能。更具体地说,我们的方法使用动态机器学习模型,根据收集的训练数据为给定的GEMM任务自动选择最佳线程数量。在两种不同的HPC节点架构上的测试结果显示,与BLAS中的传统GEMM实现相比,当使用100 MB以内的内存使用量时,一种基于双插槽Intel Cascade Lake,另一种基于双插槽AMD Zen 3,速度提高了25%至40%。
摘要:The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime. We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.
【12】SCaLE: Switching Cost aware Learning and Exploration
标题:SCaLE:切换成本意识学习和探索
链接:https://arxiv.org/abs/2601.09042
作者:Neelkamal Bhuyan,Debankur Mukherjee,Adam Wierman
备注:42 pages
摘要:通过考虑高维动态二次命中代价和带噪声的Bandit反馈模型中的$\ell_2$-范数切换代价,研究了Bandit在线凸优化中无界度量移动代价的基本问题.对于一般类的随机环境中,我们提供了第一个算法SCALE,可证明实现了一个分布不可知的次线性动态遗憾,没有知识的命中成本结构。途中,我们提出了一种新的谱后悔分析,分别量化特征值误差驱动的遗憾和特征基扰动驱动的遗憾。针对在线学习基线的大量数值实验证实了我们的说法,并突出了我们算法的统计一致性。
摘要:This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and $\ell_2$-norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the first algorithm SCaLE that provably achieves a distribution-agnostic sub-linear dynamic regret, without the knowledge of hitting cost structure. En-route, we present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret. Extensive numerical experiments, against online-learning baselines, corroborate our claims, and highlight statistical consistency of our algorithm.
【13】Optimising for Energy Efficiency and Performance in Machine Learning
标题:优化机器学习中的能源效率和性能
链接:https://arxiv.org/abs/2601.08991
作者:Emile Dos Santos Ferreira,Neil D. Lawrence,Andrei Paleyes
备注:Accepted to CAIN'26
摘要:机器学习(ML)的普及和对更大模型的需求带来了能源消耗和环境影响的增加。然而,对ML中的能量标度律知之甚少,现有的研究集中在训练成本上-忽略了更大的推理成本。此外,测量机器学习能耗的工具并不能提供可操作的反馈。 为了解决这些差距,我们开发了能耗优化器(ECOpt):一种超参数调谐器,可以优化能源效率和模型性能。ECOpt将这些指标之间的权衡量化为可解释的帕累托边界。这使ML从业者能够就能源成本和环境影响做出明智的决策,同时最大限度地发挥其模型的优势并遵守新的法规。 使用ECOpt,我们表明,参数和浮点运算计数可以是不可靠的代理的能源消耗,并观察到的能量效率的Transformer模型的文本生成是相对一致的硬件。这些发现激发了测量和发布ML模型的能量指标。我们进一步表明,ECOpt可以产生净积极的环境影响,并使用它来揭示CIFAR-10的七个模型,这些模型在同时考虑准确性和能源效率的情况下,改进了最先进的技术。
摘要:The ubiquity of machine learning (ML) and the demand for ever-larger models bring an increase in energy consumption and environmental impact. However, little is known about the energy scaling laws in ML, and existing research focuses on training cost -- ignoring the larger cost of inference. Furthermore, tools for measuring the energy consumption of ML do not provide actionable feedback. To address these gaps, we developed Energy Consumption Optimiser (ECOpt): a hyperparameter tuner that optimises for energy efficiency and model performance. ECOpt quantifies the trade-off between these metrics as an interpretable Pareto frontier. This enables ML practitioners to make informed decisions about energy cost and environmental impact, while maximising the benefit of their models and complying with new regulations. Using ECOpt, we show that parameter and floating-point operation counts can be unreliable proxies for energy consumption, and observe that the energy efficiency of Transformer models for text generation is relatively consistent across hardware. These findings motivate measuring and publishing the energy metrics of ML models. We further show that ECOpt can have a net positive environmental impact and use it to uncover seven models for CIFAR-10 that improve upon the state of the art, when considering accuracy and energy efficiency together.
【14】Attention Consistency Regularization for Interpretable Early-Exit Neural Networks
标题:可解释提前退出神经网络的注意力一致性正规化
链接:https://arxiv.org/abs/2601.08891
作者:Yanhua Zhao
备注:2 pages, 1 figure
摘要:早期退出神经网络通过允许在中间层进行预测来实现自适应推理,从而降低计算成本。然而,早期退出通常缺乏可解释性,并且可能关注与更深层不同的功能,从而限制了信任和可解释性。本文提出了一种多目标框架--启发式训练(EGT),它通过基于注意力的正则化来提高早期退出网络的可解释性和一致性。EGT引入了注意一致性损失,使早期退出注意地图与最终退出保持一致。该框架通过损失的加权组合来共同优化分类准确性和注意力一致性。在真实世界图像分类数据集上的实验表明,EGT实现了高达98.97%的整体准确率(匹配基线性能),通过早期退出实现了1.97倍的推理加速,同时与基线模型相比,注意力一致性提高了18.5%。所提出的方法在所有退出点上提供了更可解释和一致的解释,使早期退出网络更适合于资源受限环境中的可解释AI应用。
摘要
:Early-exit neural networks enable adaptive inference by allowing predictions at intermediate layers, reducing computational cost. However, early exits often lack interpretability and may focus on different features than deeper layers, limiting trust and explainability. This paper presents Explanation-Guided Training (EGT), a multi-objective framework that improves interpretability and consistency in early-exit networks through attention-based regularization. EGT introduces an attention consistency loss that aligns early-exit attention maps with the final exit. The framework jointly optimizes classification accuracy and attention consistency through a weighted combination of losses. Experiments on a real-world image classification dataset demonstrate that EGT achieves up to 98.97% overall accuracy (matching baseline performance) with a 1.97x inference speedup through early exits, while improving attention consistency by up to 18.5% compared to baseline models. The proposed method provides more interpretable and consistent explanations across all exit points, making early-exit networks more suitable for explainable AI applications in resource-constrained environments.
【15】High-fidelity lunar topographic reconstruction across diverse terrain and illumination environments using deep learning
标题:使用深度学习在不同地形和照明环境中进行高保真月球地形重建
链接:https://arxiv.org/abs/2601.09468
作者:Hao Chen,Philipp Gläser,Konrad Willner,Jürgen Oberst
备注:25 pages, 1 table, 8 figures
摘要:地形模型对于确定行星表面的特征和推断潜在的地质过程至关重要。然而,米级的地形数据仍然有限,这限制了对行星的详细调查,即使是对月球的调查,因为月球有大量的高分辨率轨道图像。深度学习(DL)的最新进展利用受低分辨率地形约束的单视图图像,以快速灵活地重建精细尺度地形。然而,它们在不同月球地貌和照明条件下的鲁棒性和普遍适用性仍然没有得到充分的探索。在这项研究中,我们建立在我们以前提出的DL框架,结合一个更强大的规模恢复计划,并在低太阳光照条件下将模型扩展到极地地区。我们证明,与单视图形状从阴影的方法相比,所提出的DL方法具有更大的鲁棒性,以不同的照明,并实现更一致和准确的地形重建。此外,它可靠地重建了不同尺度,形态和地质年龄的月球特征的地形。还为月球南极地区制作了高质量的地形模型,包括永久阴影区域,展示了该方法重建复杂和低照度地形的能力。这些发现表明,基于DL的方法有可能利用广泛的月球数据集来支持先进的探测任务,并以前所未有的地形分辨率对月球进行调查。
摘要:Topographic models are essential for characterizing planetary surfaces and for inferring underlying geological processes. Nevertheless, meter-scale topographic data remain limited, which constrains detailed planetary investigations, even for the Moon, where extensive high-resolution orbital images are available. Recent advances in deep learning (DL) exploit single-view imagery, constrained by low-resolution topography, for fast and flexible reconstruction of fine-scale topography. However, their robustness and general applicability across diverse lunar landforms and illumination conditions remain insufficiently explored. In this study, we build upon our previously proposed DL framework by incorporating a more robust scale recovery scheme and extending the model to polar regions under low solar illumination conditions. We demonstrate that, compared with single-view shape-from-shading methods, the proposed DL approach exhibits greater robustness to varying illumination and achieves more consistent and accurate topographic reconstructions. Furthermore, it reliably reconstructs topography across lunar features of diverse scales, morphologies, and geological ages. High-quality topographic models are also produced for the lunar south polar areas, including permanently shadowed regions, demonstrating the method's capability in reconstructing complex and low-illumination terrain. These findings suggest that DL-based approaches have the potential to leverage extensive lunar datasets to support advanced exploration missions and enable investigations of the Moon at unprecedented topographic resolution.
【16】Machine Learning-Driven Creep Law Discovery Across Alloy Compositional Space
标题:机器学习驱动的合金成分空间蠕变规律发现
链接:https://arxiv.org/abs/2601.08970
作者:Hongshun Chen,Ryan Zhou,Rujing Zha,Zihan Chen,Wenpan Li,Rowan Rolark,John Patrick Reidy,Jian Cao,Ping Guo,David C. Dunand,Horacio D. Espinosa
备注:27 pages, 7 figures
摘要:结构合金的高温蠕变表征传统上依赖于系列单轴试验,这对于探索合金成分的大搜索空间和材料发现是非常低效的。在这里,我们介绍了一个机器学习辅助的,高吞吐量的框架,用于蠕变规律识别的基础上的凹坑阵列膨胀仪器(DABI)配置,它可以并行蠕变测试的25个酒窝,每个由不同的合金,在一个单一的实验。采用三维数字图像相关法测量了惰性气体压力下蠕变胀形过程中韧窝的全场表面位移。我们训练一个递归神经网络(RNN)作为代理模型,映射蠕变参数和加载条件的时间依赖性变形响应的DABI。耦合这个代理粒子群优化方案,使快速和全球的逆识别与稀疏正则化的蠕变参数从实验位移-时间的历史。此外,我们提出了一个唯象蠕变规律与时间相关的应力指数,捕捉变形INCONEL 625中观察到的S形初级蠕变,并提取其温度依赖性DABI测试在多个温度。此外,我们采用了一个一般的蠕变规律结合几种传统的形式与正则化反演确定的蠕变规律为47个额外的铁,镍,钴丰富的合金,并自动选择占主导地位的功能形式为每种合金。该工作流程与DABI实验相结合,提供了一个定量、高通量的蠕变表征平台,该平台与数据挖掘、成分-性能建模和非线性结构优化兼容,可在大型合金设计空间中进行蠕变行为分析。
摘要:Hihg-temperature creep characterization of structural alloys traditionally relies on serial uniaxial tests, which are highly inefficient for exploring the large search space of alloy compositions and for material discovery. Here, we introduce a machine-learning-assisted, high-throughput framework for creep law identification based on a dimple array bulge instrument (DABI) configuration, which enables parallel creep testing of 25 dimples, each fabricated from a different alloy, in a single experiment. Full-field surface displacements of dimples undergoing time-dependent creep-induced bulging under inert gas pressure are measured by 3D digital image correlation. We train a recurrent neural network (RNN) as a surrogate model, mapping creep parameters and loading conditions to the time-dependent deformation response of DABI. Coupling this surrogate with a particle swarm optimization scheme enables rapid and global inverse identification with sparsity regularization of creep parameters from experiment displacement-time histories. In addition, we propose a phenomenological creep law with a time-dependent stress exponent that captures the sigmoidal primary creep observed in wrought INCONEL 625 and extracts its temperature dependence from DABI test at multiple temperatures. Furthermore, we employ a general creep law combining several conventional forms together with regularized inversion to identify the creep laws for 47 additional Fe-, Ni-, and Co-rich alloys and to automatically select the dominant functional form for each alloy. This workflow combined with DABI experiment provides a quantitative, high-throughput creep characterization platform that is compatible with data mining, composition-property modeling, and nonlinear structural optimization with creep behavior across a large alloy design space.
【17】Comprehensive Machine Learning Benchmarking for Fringe Projection Profilometry with Photorealistic Synthetic Data
标题:具有真实感合成数据的边缘投影轮廓测量的全面机器学习基准
链接:https://arxiv.org/abs/2601.08900
作者:Anush Lakshman S,Adam Haroon,Beiwen Li
摘要:用于条纹投影轮廓术(FPP)的机器学习方法受到缺乏大型、多样化数据集和全面基准测试协议的阻碍。本文介绍了第一个用于FPP的开源真实感合成数据集,该数据集使用NVIDIA Isaac Sim生成,包含50个不同对象的15,600个条纹图像和300个深度重建。我们对四种神经网络架构(UNet,Hformer,ResUNet,Pix 2 Pix)进行了单次深度重建的基准测试,结果表明,尽管架构存在很大差异,但所有模型都实现了相似的性能(58-77 mm RMSE)。我们的研究结果表明,没有明确的相位信息的直接条纹到深度映射的基本限制,重建误差接近75- 95%的典型对象的深度范围。该资源提供了标准化的评估协议,使系统的比较和基于学习的FPP方法的发展。
摘要:Machine learning approaches for fringe projection profilometry (FPP) are hindered by the lack of large, diverse datasets and comprehensive benchmarking protocols. This paper introduces the first open-source, photorealistic synthetic dataset for FPP, generated using NVIDIA Isaac Sim with 15,600 fringe images and 300 depth reconstructions across 50 diverse objects. We benchmark four neural network architectures (UNet, Hformer, ResUNet, Pix2Pix) on single-shot depth reconstruction, revealing that all models achieve similar performance (58-77 mm RMSE) despite substantial architectural differences. Our results demonstrate fundamental limitations of direct fringe-to-depth mapping without explicit phase information, with reconstruction errors approaching 75-95\% of the typical object depth range. This resource provides standardized evaluation protocols enabling systematic comparison and development of learning-based FPP approaches.
其他(17篇)
【1】Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection
标题:通过垂直梯度投影解决多任务LoRA中的任务冲突
链接:https://arxiv.org/abs/2601.09684
作者:Ziyu Yang,Guibin Chen,Yuxin Yang,Aoxiong Zeng,Xiangquan Yang
备注:preprint
摘要:多任务学习(MTL)结合低秩自适应(LoRA)已经成为大型语言模型(LLM)参数有效部署的一个有前途的方向。通过在多个任务之间共享单个适配器,可以显著降低存储开销。然而,这种方法遭受负转移,其中,与单任务微调相比,来自不同任务的冲突梯度更新降低了单个任务的性能。由于低秩约束,LoRA中的这个问题更加严重,这限制了优化环境适应不同任务需求的能力。在本文中,我们提出了Ortho-LoRA,一种专门为LoRA的二分结构量身定制的梯度投影方法。Ortho-LoRA动态地将冲突的任务梯度投影到内在LoRA子空间内彼此的正交互补上。在GLUE基准测试上进行的大量实验表明,Ortho-LoRA有效地减轻了任务干扰,优于标准联合训练,并以可忽略的计算开销恢复了多任务和单任务基线之间95%的性能差距。
摘要:Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape's capacity to accommodate diverse task requirements. In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA. Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95\% of the performance gap between multi-task and single-task baselines with negligible computational overhead.
【2】PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records
标题:PersonalAlign:具有长期以用户为中心的记录的个性化图形界面代理的分层隐式意图对齐
链接:https://arxiv.org/abs/2601.09636
作者:Yibo Lyu,Gongwei Chen,Rui Shao,Weili Guan,Liqiang Nie
摘要:虽然GUI代理在显式和完成指令下表现出强大的性能,但实际部署需要与用户更复杂的隐式意图保持一致。在这项工作中,我们强调个性化GUI代理(PersonalAlign),一个新的代理任务,需要代理利用长期用户记录作为持久的上下文来解决省略的偏好模糊的指令和预期潜在的例程由用户状态的主动协助。为了促进这项研究,我们引入了AndroidIntent,这是一个基准测试,旨在评估代理解决模糊指令的能力,并通过对长期用户记录的推理提供主动建议。我们从不同用户的20 k长期记录中注释了775个用户特定的偏好和215个例程以进行评估。此外,我们引入分层意图记忆代理(HIM-Agent),它保持一个不断更新的个人记忆和分层组织用户的偏好和个性化的例程。最后,我们在AndroidIntent上评估了一系列GUI代理,包括GPT-5,Qwen 3-VL和UI-TARS,进一步的结果表明,HIM-Agent的执行和主动性能分别提高了15.7%和7.3%。
摘要:While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.
【3】Constraint- and Score-Based Nonlinear Granger Causality Discovery with Kernels
标题:基于约束和分数的核函数非线性Granger因果关系发现
链接:https://arxiv.org/abs/2601.09579
作者:Fiona Murphy,Alessio Benavoli
摘要:基于核函数的方法用于格兰杰因果关系的背景下,使时间序列变量之间的非线性因果关系的识别。在本文中,我们表明,两个国家的最先进的基于核的格兰杰因果关系(GC)的方法可以在理论上统一的框架下的核主成分回归(KPCR),并介绍了一种方法,基于这种统一,证明这种方法可以提高因果识别。此外,我们引入了一个高斯过程分数为基础的模型与平滑信息准则惩罚的边际可能性,并证明了现有的最先进的时间序列非线性因果发现方法的性能改善。此外,我们提出了一个同期的因果识别算法,完全基于GC,使用建议的分数为基础的$GP_{SIC}$方法,并比较其性能的最先进的同期时间序列因果发现算法的状态。
摘要:Kernel-based methods are used in the context of Granger Causality to enable the identification of nonlinear causal relationships between time series variables. In this paper, we show that two state of the art kernel-based Granger Causality (GC) approaches can be theoretically unified under the framework of Kernel Principal Component Regression (KPCR), and introduce a method based on this unification, demonstrating that this approach can improve causal identification. Additionally, we introduce a Gaussian Process score-based model with Smooth Information Criterion penalisation on the marginal likelihood, and demonstrate improved performance over existing state of the art time-series nonlinear causal discovery methods. Furthermore, we propose a contemporaneous causal identification algorithm, fully based on GC, using the proposed score-based $GP_{SIC}$ method, and compare its performance to a state of the art contemporaneous time series causal discovery algorithm.
【4】Residual Power Flow for Neural Solvers
标题:神经求解器的剩余功率流
链接:https://arxiv.org/abs/2601.09533
作者:Jochen Stiasny,Jochen Cremer
备注:This work has been submitted to the IEEE for possible publication
摘要:能源转型挑战基于模拟和优化的运营任务。随着电网的不断扩大,这些计算需要快速和灵活,而可再生能源的不确定性需要灵活的操作环境。学习的近似,代理或代理-我们称之为神经求解器-在评估速度方面表现出色,但在适应不断变化的任务方面不灵活。因此,神经求解器通常适用于高度特定的任务,这限制了它们在实践中的实用性;需要一个可广泛重用的基础神经求解器。因此,这项工作提出了剩余功率流(RPF)制定。RPF根据基尔霍夫定律制定残差函数,以量化操作条件的不可行性。残差的最小化决定了电压解决方案;需要一个额外的松弛变量来实现AC可行性。RPF形成了一个自然的,基本的子任务的任务受潮流约束。我们建议使用神经求解器学习RPF以利用其速度。此外,RPF提高学习性能相比,常见的潮流公式。为了解决操作任务,我们将神经求解器集成到预测然后优化(PO)方法中,以结合速度和灵活性。算例分析了IEEE 9节点系统和PO求解的三个任务(交流最优潮流、潮流和准稳态潮流)。实验结果表明了RPF学习的准确性和灵活性。
摘要
:The energy transition challenges operational tasks based on simulations and optimisation. These computations need to be fast and flexible as the grid is ever-expanding, and renewables' uncertainty requires a flexible operational environment. Learned approximations, proxies or surrogates -- we refer to them as Neural Solvers -- excel in terms of evaluation speed, but are inflexible with respect to adjusting to changing tasks. Hence, neural solvers are usually applicable to highly specific tasks, which limits their usefulness in practice; a widely reusable, foundational neural solver is required. Therefore, this work proposes the Residual Power Flow (RPF) formulation. RPF formulates residual functions based on Kirchhoff's laws to quantify the infeasibility of an operating condition. The minimisation of the residuals determines the voltage solution; an additional slack variable is needed to achieve AC-feasibility. RPF forms a natural, foundational subtask of tasks subject to power flow constraints. We propose to learn RPF with neural solvers to exploit their speed. Furthermore, RPF improves learning performance compared to common power flow formulations. To solve operational tasks, we integrate the neural solver in a Predict-then-Optimise (PO) approach to combine speed and flexibility. The case study investigates the IEEE 9-bus system and three tasks (AC Optimal Power Flow (OPF), power-flow and quasi-steady state power flow) solved by PO. The results demonstrate the accuracy and flexibility of learning with RPF.
【5】Parallelizable memory recurrent units
标题:可分组化的记忆循环单位
链接:https://arxiv.org/abs/2601.09495
作者:Florent De Geeter,Gaspard Lambrechts,Damien Ernst,Guillaume Drion
备注:19 pages, 12 figures. This work has been the subject of a patent application (Number: EP26151077). This work has been submitted to the IEEE for possible publication
摘要:随着大规模并行处理单元的出现,并行化已经成为新序列模型的理想属性。在训练过程中根据序列长度并行处理序列的能力是Transformer架构兴起背后的主要因素之一。然而,Transformers在序列生成方面缺乏效率,因为它们需要在每个生成步骤重新处理所有过去的时间步。最近,状态空间模型(SSM)作为一种更有效的替代方案出现。这些新型的递归神经网络(RNN)保持了RNN的有效更新,同时通过摆脱非线性动力学(或递归)来获得并行化。SSM可以通过对潜在的非常大的网络进行有效的训练来达到最先进的性能,但仍然受到有限的表示能力的影响。特别是,由于它们的单稳态性,SSM不能表现出持久的记忆,或者无限长时间保留信息的能力。在本文中,我们引入了一个新的RNN家族,记忆递归单元(MRU),它将非线性RNN的持久记忆能力与SSM的并行计算相结合。这些单元利用多稳定性作为持久内存的来源,同时摆脱瞬态动力学以实现高效计算。然后,我们推导出一个具体的实现作为概念验证:内存递归单元(BMRU)。这种新的RNN与并行扫描算法兼容。我们表明,BMRU在具有长期依赖关系的任务中取得了良好的效果,并且可以与状态空间模型相结合,以创建可并行化的混合网络,该混合网络具有瞬态动态和持久记忆。
摘要:With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.
【6】Ability Transfer and Recovery via Modularized Parameters Localization
标题:通过模块化参数本地化的能力转移和恢复
链接:https://arxiv.org/abs/2601.09398
作者:Songyao Jin,Kun Zhou,Wenqi Li,Peng Wang,Biwei Huang
摘要:大型语言模型可以不断进行预训练或微调,以提高特定领域、语言或技能的性能,但这种专业化往往会降低其他能力,并可能导致灾难性的遗忘。我们通过分析密切相关模型的特定于领域和语言的输入下的模块激活来研究LLM参数中的能力分布。跨层和模块,我们发现,能力相关的激活高度集中在一个小的通道集(通常<5%),这些通道在很大程度上是解开良好的充分性和稳定性。基于这些观察结果,我们提出了ACT(激活引导的智能能力转移),它通过激活差异定位能力相关的通道,并选择性地只转移相应的参数,然后进行轻量级的兼容性微调。多语言数学和科学推理的实验表明,ACT可以恢复被遗忘的能力,同时保留保留的技能。它还可以合并多个专业模型,以最小的干扰将多个能力集成到一个模型中。我们的代码和数据将公开发布。
摘要:Large language models can be continually pre-trained or fine-tuned to improve performance in specific domains, languages, or skills, but this specialization often degrades other capabilities and may cause catastrophic forgetting. We investigate how abilities are distributed within LLM parameters by analyzing module activations under domain- and language-specific inputs for closely related models. Across layers and modules, we find that ability-related activations are highly concentrated in a small set of channels (typically <5\%), and these channels are largely disentangled with good sufficiency and stability. Building on these observations, we propose ACT (Activation-Guided Channel-wise Ability Transfer), which localizes ability-relevant channels via activation differences and selectively transfers only the corresponding parameters, followed by lightweight fine-tuning for compatibility. Experiments on multilingual mathematical and scientific reasoning show that ACT can recover forgotten abilities while preserving retained skills. It can also merge multiple specialized models to integrate several abilities into a single model with minimal interference. Our code and data will be publicly released.
【7】High-Performance Serverless Computing: A Systematic Literature Review on Serverless for HPC, AI, and Big Data
标题:高性能无服务器计算:面向HPC、AI和大数据的无服务器系统性文献综述
链接:https://arxiv.org/abs/2601.09334
作者:Valerio Besozzi,Matteo Della Bartola,Patrizio Dazzi,Marco Danelutto
摘要:高性能计算、人工智能和大数据等大规模计算密集型应用程序的广泛部署正在导致云和高性能计算基础设施之间的融合。云提供商越来越多地在其基础设施中集成高性能计算功能,例如硬件加速器和高速互连,而高性能计算社区的研究人员开始探索云原生范例,以提高可扩展性,弹性和资源利用率。在这种情况下,无服务器计算成为一种有前途的执行模型,可以有效地处理高度动态,并行和分布式的工作负载。本文对2018年至2025年初发表的122篇研究文章进行了全面系统的文献综述,探讨了如何使用无服务器模式在云、高性能计算和混合环境中开发、部署和编排计算密集型应用程序。从这些,提出了一个分类法,包括八个主要研究方向和九个目标用例域,以及最近的出版趋势和作者之间的合作网络的分析,突出了这个新兴研究领域内日益增长的兴趣和相互联系。总体而言,这项工作旨在为新研究人员和经验丰富的从业者提供宝贵的基础,指导并行计算密集型应用程序的下一代无服务器解决方案的开发。
摘要
:The widespread deployment of large-scale, compute-intensive applications such as high-performance computing, artificial intelligence, and big data is leading to convergence between cloud and high-performance computing infrastructures. Cloud providers are increasingly integrating high-performance computing capabilities in their infrastructures, such as hardware accelerators and high-speed interconnects, while researchers in the high-performance computing community are starting to explore cloud-native paradigms to improve scalability, elasticity, and resource utilization. In this context, serverless computing emerges as a promising execution model to efficiently handle highly dynamic, parallel, and distributed workloads. This paper presents a comprehensive systematic literature review of 122 research articles published between 2018 and early 2025, exploring the use of the serverless paradigm to develop, deploy, and orchestrate compute-intensive applications across cloud, high-performance computing, and hybrid environments. From these, a taxonomy comprising eight primary research directions and nine targeted use case domains is proposed, alongside an analysis of recent publication trends and collaboration networks among authors, highlighting the growing interest and interconnections within this emerging research field. Overall, this work aims to offer a valuable foundation for both new researchers and experienced practitioners, guiding the development of next-generation serverless solutions for parallel compute-intensive applications.
【8】Magnifying change: Rapid burn scar mapping with multi-resolution, multi-source satellite imagery
标题:放大变化:利用多分辨率、多源卫星图像快速绘制烧伤疤痕地图
链接:https://arxiv.org/abs/2601.09262
作者:Maria Sdraka,Dimitrios Michail,Ioannis Papoutsis
摘要:由于电磁频谱的不规则和空间异质性光谱变化,使用卫星图像划定野火影响区域仍然具有挑战性。虽然最近的深度学习方法在高分辨率多光谱数据可用时实现了高精度,但它们在操作环境中的适用性,其中需要在野火事件后不久快速描绘烧伤疤痕,受到当前卫星系统的空间分辨率和时间重访频率之间的权衡的限制。为了解决这一限制,我们提出了一种新的深度学习模型,即BAM-MRCD,它采用多分辨率,多源卫星图像(MODIS和Sentinel-2),以及时生成具有高空间和时间分辨率的详细烧伤区域地图。我们的模型能够以高精度检测到小规模的野火,超越了类似的变化检测模型和坚实的基线。所有数据和代码都可以在GitHub存储库中找到:https://github.com/Orion-AI-Lab/BAM-MRCD。
摘要:Delineating wildfire affected areas using satellite imagery remains challenging due to irregular and spatially heterogeneous spectral changes across the electromagnetic spectrum. While recent deep learning approaches achieve high accuracy when high-resolution multispectral data are available, their applicability in operational settings, where a quick delineation of the burn scar shortly after a wildfire incident is required, is limited by the trade-off between spatial resolution and temporal revisit frequency of current satellite systems. To address this limitation, we propose a novel deep learning model, namely BAM-MRCD, which employs multi-resolution, multi-source satellite imagery (MODIS and Sentinel-2) for the timely production of detailed burnt area maps with high spatial and temporal resolution. Our model manages to detect even small scale wildfires with high accuracy, surpassing similar change detection models as well as solid baselines. All data and code are available in the GitHub repository: https://github.com/Orion-AI-Lab/BAM-MRCD.
【9】RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning
标题:Rift:通过奖励知情的微调重新利用负样本
链接:https://arxiv.org/abs/2601.09253
作者:Zehua Liu,Shuqi Liu,Tao Zhong,Mingxuan Yuan
摘要:虽然监督微调(SFT)和拒绝采样微调(RFT)是LLM对齐的标准,但它们要么依赖于昂贵的专家数据,要么丢弃有价值的负样本,导致数据效率低下。为了解决这个问题,我们提出了奖励知情微调(RIFT),一个简单而有效的框架,利用所有自我生成的样本。与RFT的硬阈值不同,RIFT重新利用负轨迹,用标量奖励重新加权损失,从模型输出的正轨迹和负轨迹中学习。为了克服由朴素奖励积分引起的训练崩溃,其中直接乘法产生无界损失,我们引入了一个稳定的损失公式,以确保数值鲁棒性和优化效率。在各种基本模型上进行的数学基准测试的广泛实验表明,RIFT始终优于RFT。我们的研究结果表明,RIFT是一个强大的和数据效率的替代对齐使用混合质量,自我生成的数据。
摘要:While Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT) are standard for LLM alignment, they either rely on costly expert data or discard valuable negative samples, leading to data inefficiency. To address this, we propose Reward Informed Fine-Tuning (RIFT), a simple yet effective framework that utilizes all self-generated samples. Unlike the hard thresholding of RFT, RIFT repurposes negative trajectories, reweighting the loss with scalar rewards to learn from both the positive and negative trajectories from the model outputs. To overcome the training collapse caused by naive reward integration, where direct multiplication yields an unbounded loss, we introduce a stabilized loss formulation that ensures numerical robustness and optimization efficiency. Extensive experiments on mathematical benchmarks across various base models show that RIFT consistently outperforms RFT. Our results demonstrate that RIFT is a robust and data-efficient alternative for alignment using mixed-quality, self-generated data.
【10】From Hawkes Processes to Attention: Time-Modulated Mechanisms for Event Sequences
标题:从霍克斯过程到注意力:事件序列的时间调制机制
链接:https://arxiv.org/abs/2601.09220
作者:Xinzi Tan,Kejian Zhang,Junhan Yu,Doudou Zhou
摘要:标记时间点过程(MarkedTemporalPointProcesses,MTPPs)在医学、社会、商业和金融领域中自然出现。然而,现有的基于Transformer的方法大多仅通过位置编码来注入时间信息,依赖于共享或参数化的衰减结构,这限制了它们捕获异构和特定类型的时间效果的能力。受此启发,我们从MTPP的多变量Hawkes过程理论中推导出一种新的注意力算子Hawkes Attention,使用可学习的每类型神经内核来调节查询,键和值投影,从而取代传统注意力中的相应部分。受益于这种设计,Hawkes Attention将事件定时和内容交互结合起来,从数据中学习与时间相关的行为和特定类型的兴奋模式。实验结果表明,我们的方法取得了更好的性能相比,基线。除了一般的MTPP,我们的注意力机制也可以很容易地应用到特定的时间结构,如时间序列预测。
摘要:Marked Temporal Point Processes (MTPPs) arise naturally in medical, social, commercial, and financial domains. However, existing Transformer-based methods mostly inject temporal information only via positional encodings, relying on shared or parametric decay structures, which limits their ability to capture heterogeneous and type-specific temporal effects. Inspired by this observation, we derive a novel attention operator called Hawkes Attention from the multivariate Hawkes process theory for MTPP, using learnable per-type neural kernels to modulate query, key and value projections, thereby replacing the corresponding parts in the traditional attention. Benefited from the design, Hawkes Attention unifies event timing and content interaction, learning both the time-relevant behavior and type-specific excitation patterns from the data. The experimental results show that our method achieves better performance compared to the baselines. In addition to the general MTPP, our attention mechanism can also be easily applied to specific temporal structures, such as time series forecasting.
【11】Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling
标题:作为早期信号的隐藏状态:阶梯级跟踪评估和修剪以实现有效的测试时间缩放
链接:https://arxiv.org/abs/2601.09093
作者:Zhixiang Liang,Beichen Huang,Zheng Wang,Minjia Zhang
摘要:大型语言模型(LLM)可以通过生成多个跟踪来扩展测试时间,从而增强推理能力。然而,冗长的推理跟踪与多次采样的组合引入了大量的计算和高的端到端延迟。加速这一过程的先前工作依赖于基于相似性或基于置信度的修剪,但这些信号并不能可靠地指示跟踪质量。为了解决这些限制,我们提出了STEP:步骤级跟踪评估和修剪,一种新的修剪框架,评估推理步骤使用隐藏状态和动态修剪过程中产生的不希望的痕迹。我们训练了一个轻量级的步骤评分器来估计跟踪质量,并设计了一个GPU内存感知的修剪策略,当GPU内存被KV缓存饱和时触发修剪,以减少端到端的延迟。在具有挑战性的推理基准上的实验表明,与自一致性相比,STEP平均将端到端推理延迟降低了45%-70%,同时还提高了推理准确性。我们的代码发布于:https://github.com/Supercomputing-System-AI-Lab/STEP
摘要
:Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP
【12】MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
标题:MMR-GRPO:通过多元化意识的奖励重新加权加速GRPO式训练
链接:https://arxiv.org/abs/2601.09085
作者:Kangda Wei,Ruihong Huang
摘要:组相对策略优化(GRPO)已经成为训练数学推理模型的标准方法;然而,它依赖于每个提示的多个完成,使得训练在计算上昂贵。虽然最近的工作已经减少了达到峰值性能所需的训练步骤的数量,但由于每步成本较高,整体挂钟训练时间通常保持不变甚至增加。我们提出了MMR-GRPO,它集成了最大边缘相关性,以重新权衡基于完成多样性的奖励。我们的关键见解是,语义冗余完成贡献有限的边际学习信号;优先考虑不同的解决方案产生更多的信息更新和加速收敛。对三种模型尺寸(1.5B、7 B、8B)、三种GRPO变体和五种数学推理基准的广泛评估表明,MMR-GRPO实现了相当的峰值性能,同时平均需要减少47.9%的训练步骤和70.2%的挂钟时间。这些收益在模型、方法和基准之间是一致的。我们将发布我们的代码,训练模型和实验协议。
摘要:Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.
【13】How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation
标题:人类的多少判断才足够?人类偏好评估的可行性限制
链接:https://arxiv.org/abs/2601.09084
作者:Wilson Y. Lee
摘要:人类偏好评估被广泛用于比较生成模型,但仍不清楚需要多少判断才能可靠地检测到微小的改进。我们表明,当偏好信号在提示中扩散时(即,所有提示类型都具有类似的信息性),比例分配是最小最大最优的:没有分配策略实质上提高可检测性。对大规模人类偏好数据集的实证分析表明,大多数比较都属于这种扩散机制,表现出较小的偏好边际,即使在采样良好的比较中,也需要比通常收集的更多的判断。这些限制在评估协议和模式中持续存在,包括聊天、图像生成和带有执行反馈的代码生成。相比之下,减少提示诱导变异性的策划基准系统地诱导更大的裕度,并通过将提示水平方差减少1.5x $来提高可检测性。我们的研究结果表明,不确定的或负面的人类评价结果往往反映了动力不足的评价,而不是模型等效性,强调需要明确考虑效应大小,预算和方案设计。
摘要:Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.
【14】Continuous Fairness On Data Streams
标题:数据流的持续公平性
链接:https://arxiv.org/abs/2601.08976
作者:Subhodeep Ghosh,Zhihui Du,Angela Bonifati,Manish Kumar,David Bader,Senjuti Basu Roy
摘要:我们研究了数据流中窗口上的连续组公平性问题。我们提出了一种新的公平性模型,确保组公平性在一个更细的粒度级别(称为块)内的每个滑动窗口。当窗口大小很大时,这种公式特别有用,从而希望以更细的粒度实施公平性。在这个框架内,我们解决了两个关键的挑战:有效地监测每个滑动窗口是否满足块级组公平性,并重新排序当前窗口时,尽可能有效地违反公平性。为了实现实时监控,我们设计了基于草图的数据结构,以最小的开销维护属性分布。我们还开发了最佳的,高效的算法的重新排序任务,支持严格的理论保证。我们对四个真实世界的流媒体场景的评估表明了我们的方法的实际有效性。我们实现了毫秒级的处理和平均每秒约30,000个查询的吞吐量,具体取决于系统参数。流重新排序算法在某些情况下将块级组公平性提高了95%,在数据集之间平均提高了50-60%。定性研究进一步突出了块级公平性与窗口级公平性相比的优势。
摘要:We study the problem of enforcing continuous group fairness over windows in data streams. We propose a novel fairness model that ensures group fairness at a finer granularity level (referred to as block) within each sliding window. This formulation is particularly useful when the window size is large, making it desirable to enforce fairness at a finer granularity. Within this framework, we address two key challenges: efficiently monitoring whether each sliding window satisfies block-level group fairness, and reordering the current window as effectively as possible when fairness is violated. To enable real-time monitoring, we design sketch-based data structures that maintain attribute distributions with minimal overhead. We also develop optimal, efficient algorithms for the reordering task, supported by rigorous theoretical guarantees. Our evaluation on four real-world streaming scenarios demonstrates the practical effectiveness of our approach. We achieve millisecond-level processing and a throughput of approximately 30,000 queries per second on average, depending on system parameters. The stream reordering algorithm improves block-level group fairness by up to 95% in certain cases, and by 50-60% on average across datasets. A qualitative study further highlights the advantages of block-level fairness compared to window-level fairness.
【15】ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue
标题:ConvoLearn:建构主义导师与学生对话数据集
链接:https://arxiv.org/abs/2601.08950
作者:Mayank Sharma,Roy Pea,Hari Subramonyam
摘要
:在教育应用中,LLM表现出几个基本的教学局限性,如他们倾向于揭示解决方案,而不是支持对话学习。我们介绍了ConvoLearn(huggingface.co/datasets/masharma/convolearn),这是一个基于知识构建理论的数据集,它实现了六个核心教学维度:认知参与,形成性评估,问责制,文化响应,元认知和权力动态。我们构建了一个半合成数据集的1250导师,学生的对话(20轮)在中学地球科学通过控制人类教师和模拟学生之间的互动。使用QLoRA,我们证明了在这个数据集上的训练有意义地将LLM行为转向知识构建策略。31位教师的人工评估显示,我们微调的Mistral 7B(M = 4.10,SD = 1.03)显著优于其基础版本(M = 2.59,SD = 1.11)和Claude Sonnet 4.5(M = 2.87,SD = 1.29)。这项工作建立了一个潜在的框架,以指导未来的开发和评估建构主义人工智能导师。
摘要:In educational applications, LLMs exhibit several fundamental pedagogical limitations, such as their tendency to reveal solutions rather than support dialogic learning. We introduce ConvoLearn (https://huggingface.co/datasets/masharma/convolearn ), a dataset grounded in knowledge building theory that operationalizes six core pedagogical dimensions: cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. We construct a semi-synthetic dataset of 1250 tutor-student dialogues (20 turns each) in middle school Earth Science through controlled interactions between human teachers and a simulated student. Using QLoRA, we demonstrate that training on this dataset meaningfully shifts LLM behavior toward knowledge-building strategies. Human evaluation by 31 teachers shows our fine-tuned Mistral 7B (M = 4.10, SD = 1.03) significantly outperforms both its base version (M = 2.59, SD = 1.11) and Claude Sonnet 4.5 (M = 2.87, SD = 1.29) overall. This work establishes a potential framework to guide future development and evaluation of constructivist AI tutors.
【16】Horseshoe Mixtures-of-Experts (HS-MoE)
标题:马蹄专家混合体(HS-MoE)
链接:https://arxiv.org/abs/2601.09043
作者:Nick Polson,Vadim Sokolov
摘要:马蹄铁混合专家(HS-MoE)模型提供了一个贝叶斯框架,稀疏的专家选择混合专家架构。我们结合了马蹄先验的自适应全局-局部收缩与依赖于输入的门控,在专家使用中产生数据自适应稀疏。我们的主要方法贡献是一个粒子学习算法的顺序推理,其中过滤器是向前传播的时间,同时跟踪只有足够的统计。我们还讨论了HS-MoE如何与大型语言模型中的现代专家混合层相关联,这些模型是在极端稀疏约束下部署的(例如,从大的池中激活每个令牌的少量专家)。
摘要:Horseshoe mixtures-of-experts (HS-MoE) models provide a Bayesian framework for sparse expert selection in mixture-of-experts architectures. We combine the horseshoe prior's adaptive global-local shrinkage with input-dependent gating, yielding data-adaptive sparsity in expert usage. Our primary methodological contribution is a particle learning algorithm for sequential inference, in which the filter is propagated forward in time while tracking only sufficient statistics. We also discuss how HS-MoE relates to modern mixture-of-experts layers in large language models, which are deployed under extreme sparsity constraints (e.g., activating a small number of experts per token out of a large pool).
【17】An Inexact Weighted Proximal Trust-Region Method
标题:一种不精确加权近端信任域方法
链接:https://arxiv.org/abs/2601.09024
作者:Leandro Farias Maia,Robert Baraldi,Drew P. Kouri
摘要:在[R. J. Baraldi和D. P. Kouri,数学程序,201:1(2023),pp. 559-598],作者引入了一个信赖域方法,用于最小化光滑非凸函数和非光滑凸函数之和,后者具有解析邻近算子。虽然许多函数满足该标准,例如,定义在$\ell_2$上的$\ell_1$-范数,许多其他的都被拓扑或非光滑项的性质所排除。使用$δ$-Fréchet次微分,我们扩展了不精确邻近算子的定义,并使其能够在上述信赖域算法中使用。此外,我们增加了标准的信赖域收敛理论的分析,处理邻近算子的不精确加权内积。我们首先介绍一个算法来产生一个点的不精确的邻近算子,然后应用该算法内的信赖域方法来解决的最优控制问题的Burgers方程的约束。
摘要:In [R. J. Baraldi and D. P. Kouri, Math. Program., 201:1 (2023), pp. 559-598], the authors introduced a trust-region method for minimizing the sum of a smooth nonconvex and a nonsmooth convex function, the latter of which has an analytical proximity operator. While many functions satisfy this criterion, e.g., the $\ell_1$-norm defined on $\ell_2$, many others are precluded by either the topology or the nature of the nonsmooth term. Using the $δ$-Fréchet subdifferential, we extend the definition of the inexact proximity operator and enable its use within the aforementioned trust-region algorithm. Moreover, we augment the analysis for the standard trust-region convergence theory to handle proximity operator inexactness with weighted inner products. We first introduce an algorithm to generate a point in the inexact proximity operator and then apply the algorithm within the trust-region method to solve an optimal control problem constrained by Burgers' equation.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递