机器学习学术速递[3.4]

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.LG 方向，今日共计193篇

大模型相关(19篇)

【1】Understanding and Mitigating Dataset Corruption in LLM Steering
标题：了解和缓解LLM指导中的数据集损坏
链接：https://arxiv.org/abs/2603.03206

作者：Cullen Anderson,Narmeen Oozeer,Foad Namjoo,Remy Ogasawara,Amirali Abdullah,Jeff M. Phillips
摘要：对比转向已被证明是一种简单而有效的方法来调整LLM在推理时的生成行为。它使用带有和不带有特质的快速反应的例子来识别中间激活层中的方向，然后在这个一维子空间中转移激活。然而，尽管其在人工智能安全应用中的使用越来越多，但对比转向对噪声或对抗性数据损坏的鲁棒性却知之甚少。我们开始研究这个过程的鲁棒性方面的腐败的数据集的例子，用于训练转向方向。我们的第一个观察结果是，对比度转向对于中等程度的损坏是相当鲁棒的，但是当训练数据的一个重要部分被改变时，不必要的副作用可能会明显而恶意地表现出来。其次，我们分析了各种类型的腐败的几何形状，并确定了一些保障措施。值得注意的是，在学习转向方向的关键步骤涉及高维平均值计算，我们表明，用最近开发的鲁棒平均值估计器取代这一步骤，往往减轻了恶意腐败的大部分不必要的影响。
摘要：Contrastive steering has been shown as a simple and effective method to adjust the generative behavior of LLMs at inference time. It uses examples of prompt responses with and without a trait to identify a direction in an intermediate activation layer, and then shifts activations in this 1-dimensional subspace. However, despite its growing use in AI safety applications, the robustness of contrastive steering to noisy or adversarial data corruption is poorly understood. We initiate a study of the robustness of this process with respect to corruption of the dataset of examples used to train the steering direction. Our first observation is that contrastive steering is quite robust to a moderate amount of corruption, but unwanted side effects can be clearly and maliciously manifested when a non-trivial fraction of the training data is altered. Second, we analyze the geometry of various types of corruption, and identify some safeguards. Notably, a key step in learning the steering direction involves high-dimensional mean computation, and we show that replacing this step with a recently developed robust mean estimator often mitigates most of the unwanted effects of malicious corruption.

【2】MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
标题：MoD-DPO：使用模式脱钩偏好优化来缓解Omni LLM中的跨模式幻觉
链接：https://arxiv.org/abs/2603.03192

作者：Ashutosh Chaubey,Jiacheng Pang,Mohammad Soleymani
备注：CVPR 2026. Project Page: https://mod-dpo.github.io/
摘要：全模态大语言模型（omni LLM）最近在视听理解任务中取得了很好的表现，但它们仍然非常容易受到虚假相关性和主导语言先验所产生的跨模态幻觉的影响。在这项工作中，我们提出了模态解耦直接偏好优化（MoD-DPO），这是一个用于改善全方位LLM中模态基础的简单而有效的框架。MoD-DPO引入了模态感知正则化项，明确地对不相关模态中的腐败和对相关模态中的扰动的敏感性实施不变性，从而减少了非预期的跨模态交互。为了进一步减轻对文本先验的过度依赖，我们引入了语言先验去偏罚，以阻止幻觉倾向的纯文本响应。在多个视听幻觉基准测试中进行的广泛实验表明，MoD-DPO始终提高了感知准确性和幻觉抗性，在类似的训练预算下优于以前的偏好优化基线。我们的研究结果强调了模态忠实对齐的重要性，并展示了一条通往更可靠和更有弹性的多模态基础模型的可扩展路径。
摘要：Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

【3】Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models
标题：超越一刀切：使用大型语言模型进行Zero-Shot图学习的自适应子图去噪
链接：https://arxiv.org/abs/2603.02938

作者：Fengzhi Li,Liang Zhang,Yuan Zuo,Ruiqing Zhao,YanSong Liu,Yunfei Ma,Fanyu Meng,Junlan Feng
摘要：zero-shot设置中的基于图的任务仍然是一个重大挑战，因为数据稀缺以及传统的图神经网络（GNN）无法推广到看不见的域或标签空间。虽然最近的进展已经转向利用大型语言模型（LLM）作为预测器来增强GNN，但这些方法通常会遇到跨模态对齐问题。最近的范例（即，Graph-R1）通过采用纯基于文本的格式和利用基于LLM的图推理克服了上述架构依赖性，显示出改进的zero-shot泛化。然而，它采用了一种与任务无关的、一刀切的子图提取策略，这不可避免地引入了显著的结构噪声--不相关的邻居和边缘--这会扭曲LLM的感受野，导致次优预测。为了解决这个问题，我们引入了GraphSSR，一个新的框架，设计用于自适应子图提取和去噪的zero-shot基于LLM的图推理。具体来说，我们提出了SSR管道，它通过“样本选择原因”过程动态地根据特定上下文定制子图提取，使模型能够自主过滤出与任务无关的邻居，并克服一刀切的问题。为了使这种能力内在化，我们开发了SSR-SFT，这是一种数据合成策略，可以生成高质量的SSR风格的图推理轨迹，用于LLM的监督微调。此外，我们提出了SSR-RL，这是一个两阶段的强化学习框架，它明确规定了为自适应子图去噪而设计的SSR管道中的采样和选择操作。通过结合真实性增强和去噪增强RL，我们引导模型使用简约的去噪子图进行推理来实现准确的预测。
摘要：Graph-based tasks in the zero-shot setting remain a significant challenge due to data scarcity and the inability of traditional Graph Neural Networks (GNNs) to generalize to unseen domains or label spaces. While recent advancements have transitioned toward leveraging Large Language Models (LLMs) as predictors to enhance GNNs, these methods often suffer from cross-modal alignment issues. A recent paradigm (i.e., Graph-R1) overcomes the aforementioned architectural dependencies by adopting a purely text-based format and utilizing LLM-based graph reasoning, showing improved zero-shot generalization. However, it employs a task-agnostic, one-size-fits-all subgraph extraction strategy, which inevitably introduces significant structural noise--irrelevant neighbors and edges--that distorts the LLMs' receptive field and leads to suboptimal predictions. To address this limitation, we introduce GraphSSR, a novel framework designed for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning. Specifically, we propose the SSR pipeline, which dynamically tailors subgraph extraction to specific contexts through a "Sample-Select-Reason" process, enabling the model to autonomously filter out task-irrelevant neighbors and overcome the one-size-fits-all issue. To internalize this capability, we develop SSR-SFT, a data synthesis strategy that generates high-quality SSR-style graph reasoning traces for supervised fine-tuning of LLMs. Furthermore, we propose SSR-RL, a two-stage reinforcement learning framework that explicitly regulates sampling and selection operations within the proposed SSR pipeline designed for adaptive subgraph denoising. By incorporating Authenticity-Reinforced and Denoising-Reinforced RL, we guide the model to achieve accurate predictions using parsimonious, denoised subgraphs for reasoning.

【4】Eliciting Numerical Predictive Distributions of LLMs Without Autoregression
标题：无需自回归即可推导LLM的数值预测分布
链接：https://arxiv.org/abs/2603.02913

作者：Julianna Piskorz,Katarzyna Kobalczyk,Mihaela van der Schaar
备注：First two authors contributed equally. Published as a conference paper at ICLR2026
摘要：大型语言模型（LLM）最近已经成功地应用于回归任务-例如时间序列预测和表格预测-通过利用它们的上下文学习能力。然而，它们的自回归解码过程可能不适合连续值输出，其中获得数字目标上的预测分布需要重复采样，导致高计算成本和推理时间。在这项工作中，我们调查是否分布特性的LLM预测可以恢复没有显式自回归生成。为此，我们研究了一组经过训练的回归探针，以预测统计泛函（例如，平均值，中位数，分位数）的LLM的数值输出分布直接从其内部表示。我们的研究结果表明，LLM嵌入携带有关其预测分布的汇总统计的信息信号，包括数值不确定性。这项调查开辟了新的问题，如何LLM内部编码的不确定性，在数值任务，以及轻量级的替代方案，以采样为基础的方法，不确定性感知的数值预测的可行性。
摘要：Large Language Models (LLMs) have recently been successfully applied to regression tasks -- such as time series forecasting and tabular prediction -- by leveraging their in-context learning abilities. However, their autoregressive decoding process may be ill-suited to continuous-valued outputs, where obtaining predictive distributions over numerical targets requires repeated sampling, leading to high computational cost and inference time. In this work, we investigate whether distributional properties of LLM predictions can be recovered without explicit autoregressive generation. To this end, we study a set of regression probes trained to predict statistical functionals (e.g., mean, median, quantiles) of the LLM's numerical output distribution directly from its internal representations. Our results suggest that LLM embeddings carry informative signals about summary statistics of their predictive distributions, including the numerical uncertainty. This investigation opens up new questions about how LLMs internally encode uncertainty in numerical tasks, and about the feasibility of lightweight alternatives to sampling-based approaches for uncertainty-aware numerical predictions.

【5】From Heuristic Selection to Automated Algorithm Design: LLMs Benefit from Strong Priors
标题：从启发式选择到自动算法设计：LLM受益于强先验
链接：https://arxiv.org/abs/2603.02792

作者：Qi Huang,Furong Ye,Ananta Shahane,Thomas Bäck,Niki van Stein
摘要：大型语言模型（LLM）已经被广泛用于自动算法设计，在各个领域的生成和进化算法方面表现出强大的能力。现有的工作主要集中在研究它们在解决特定问题方面的有效性，搜索策略主要由自适应提示设计指导。在本文中，通过调查的提示，以LLM生成的算法代码的令牌的属性，我们表明，提供高质量的算法代码的例子可以大大提高性能的LLM驱动的优化。基于这一见解，我们建议利用以前的基准算法来指导LLM驱动的优化，并在两个黑盒优化基准测试中表现出卓越的性能：伪布尔优化套件（pbo）和黑盒优化套件（bbob）。我们的研究结果突出了整合基准研究的价值，以提高LLM驱动的黑盒优化方法的效率和鲁棒性。
摘要：Large Language Models (LLMs) have already been widely adopted for automated algorithm design, demonstrating strong abilities in generating and evolving algorithms across various fields. Existing work has largely focused on examining their effectiveness in solving specific problems, with search strategies primarily guided by adaptive prompt designs. In this paper, through investigating the token-wise attribution of the prompts to LLM-generated algorithmic codes, we show that providing high-quality algorithmic code examples can substantially improve the performance of the LLM-driven optimization. Building upon this insight, we propose leveraging prior benchmark algorithms to guide LLM-driven optimization and demonstrate superior performance on two black-box optimization benchmarks: the pseudo-Boolean optimization suite (pbo) and the black-box optimization suite (bbob). Our findings highlight the value of integrating benchmarking studies to enhance both efficiency and robustness of the LLM-driven black-box optimization methods.

【6】From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench
标题：从解决者到导师：利用KMP-Bench评估法学硕士的教学智力
链接：https://arxiv.org/abs/2603.02775

作者：Weikang Shi,Houxing Ren,Junting Pan,Aojun Zhou,Ke Wang,Zimu Lu,Yunqiao Yang,Yuxuan Hu,Linda Wei,Mingjie Zhan,Hongsheng Li
摘要：大型语言模型（LLM）在人工智能数学辅导方面显示出巨大的潜力，但目前的评估往往依赖于简单化的指标或狭窄的教学场景，无法评估全面的多轮教学效果。在本文中，我们介绍KMP-Bench，一个全面的K-8数学教学基准，旨在从两个互补的角度评估法学硕士。第一个模块，KMP对话，根据六项核心原则（例如，挑战，解释，反馈），利用一个新的多轮对话数据集，通过编织在一起不同的教学组件。第二个模块，KMP-Skills，提供基础辅导能力的粒度评估，包括多轮问题解决，错误检测和纠正以及问题生成。我们对KMP-Bench的评估揭示了一个关键的差距：虽然领先的LLM擅长于具有可验证解决方案的任务，但他们在教学原则的细微差别应用方面很挣扎。此外，我们提出了KMP-Pile，一个大规模（150 K）的对话数据集。在KMP-Pile上微调的模型显示出KMP-Bench的实质性改进，强调了教学丰富的训练数据对于开发更有效的AI数学导师的价值。
摘要：Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

【7】DREAM: Where Visual Understanding Meets Text-to-Image Generation
标题：梦想：视觉理解与文本到图像生成的相遇
链接：https://arxiv.org/abs/2603.02667

作者：Chao Li,Tianhong Li,Sai Vidyaranya Nuthalapati,Hong-You Chen,Satya Narayan Shukla,Yonghuan Yang,Jun Xiao,Xiangjun Fan,Aashu Singh,Dina Katabi,Shlok Kumar Mishra
摘要：在单一模型中统一视觉表示学习和文本到图像（T2I）生成仍然是多模态学习的核心挑战。我们引入了DREAM，这是一个统一的框架，它在学习强大的视觉表征的同时，联合优化了区分性和生成性目标。DREAM建立在两个关键技术上：在训练过程中，掩蔽热身（Masking Warmup）是一种渐进式掩蔽计划，从最小掩蔽开始，以建立表征学习所需的对比对齐，然后逐渐过渡到完全掩蔽，以实现稳定的生成训练。在推理时，DREAM使用语义对齐解码将部分掩蔽的图像候选与目标文本对齐，并选择最佳图像进行进一步解码，提高文本图像保真度（+6.3%），而无需外部重新排序。仅在CC12 M上训练，DREAM实现了72.7%的ImageNet线性探测准确率（比CLIP高1.1%）和4.25的FID（比FLUID高6.2%），在Few-Shot分类、语义分割和深度估计方面都有一致的收益。这些结果表明，判别和生成目标可以协同，允许统一的多模态模型，擅长视觉理解和生成。
摘要：Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

【8】SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving
标题：Sun：共享下一个令牌预测以实现高效的多LLM分解服务
链接：https://arxiv.org/abs/2603.02599

作者：Sunghyeon Woo,Ahreum Seo,Jaegwang Lee,Jaeeun Kil,Hanbae Seo,Joonghoon Kim,Baeseong Park,Se Jung Kwon,Dongsoo Lee
备注：Preprint, 15 pages, 5 figures
摘要：在多模型LLM服务中，由于模型特定的资源分区，解码执行仍然效率低下：由于跨模型的并行计算是不可能的，内存限制的解码经常遭受严重的GPU利用不足，特别是在倾斜的工作负载下。我们提出了下一个令牌预测（SUN）的共享使用，这是第一种在分解的多LLM服务中实现解码执行的跨模型共享的方法。SUN将仅解码器的Transformer分解为预填充模块和解码模块，并仅微调特定于任务的预填充模块，使冻结的解码模块能够在模型之间共享。此设计支持与模型无关的解码路由策略，该策略在共享工作进程之间平衡解码请求以最大化利用率。在不同的任务和模型系列中，SUN实现了与完全微调相当的精度，同时以更少的解码工人保持系统吞吐量。特别是，SUN将每个GPU的吞吐量提高了2.0倍，同时将每个输出令牌的时间（TPOT）保持在5%以内。SUN本质上支持并促进了低位解码;通过量化SUN（QSUN），它实现了45%的加速比，精度与SUN相当，同时保留了共享解码的优势。
摘要：In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.

【9】How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
标题：大型语言模型的可控性如何？跨行为粒度的统一评估
链接：https://arxiv.org/abs/2603.02578

作者：Ziwen Xu,Kewei Xu,Haoming Xu,Haiwen Hong,Longtao Huang,Hui Xue,Ningyu Zhang,Yongliang Shen,Guozhou Zheng,Huajun Chen,Shumin Deng
备注：Work in progress
摘要：大型语言模型（LLM）越来越多地部署在社会敏感领域，但它们的不可预测的行为，从不一致的意图到不一致的个性，都构成了重大风险。我们介绍SteerEval，一个分层的基准评估LLM可控性在三个领域：语言功能，情感和个性。每个域都被构造成三个规范级别：L1（表达什么），L2（如何表达）和L3（如何实例化），将高级行为意图与具体的文本输出联系起来。使用SteerEval，我们系统地评估当代的转向方法，揭示了控制往往在更细粒度的水平下降。我们的基准为安全可控的LLM行为提供了一个原则性和可解释的框架，为未来的研究奠定了基础。
摘要：Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

【10】CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think
标题：CoDART：连续扩散语言模型比您想象的更强大
链接：https://arxiv.org/abs/2603.02547

作者：Junzhe Shen,Jieru Zhao,Ziwei He,Zhouhan Lin
摘要：我们研究了为什么连续扩散语言模型（DLMs）落后于离散扩散方法，尽管他们有吸引力的连续生成动态。在一个受控的令牌恢复研究，我们确定令牌舍入，从去噪嵌入令牌的最终投影，作为一个主要的瓶颈。基于这些见解，我们提出了CoDAR（Continuous Diffusion with Contextual AutoRegressive Decoder，连续扩散与上下文自回归解码器），一个两阶段的框架，保持扩散完全连续的嵌入空间，同时学习一个强大的上下文条件离散：自回归Transformer解码器，交叉-出席降噪嵌入序列，并执行上下文舍入令牌。LM 1B和OpenWebText上的实验表明，CoDAR大大提高了潜在扩散的生成质量，并与强大的离散DLMs竞争，同时暴露了一个简单的解码器-温度旋钮来导航流畅性-多样性权衡。
摘要：We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.

【11】MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
标题：MUSE：一个以运行为中心的平台，用于大型语言模型的多模式统一安全评估
链接：https://arxiv.org/abs/2603.02482

作者：Zhongxi Wang,Yueqian Lin,Jingyang Zhang,Hai Helen Li,Yiran Chen
备注：Submitted to ACL 2026 System Demonstration Track
摘要：大型语言模型的安全评估和红队仍然主要以文本为中心，现有的框架缺乏系统地测试对齐是否适用于音频、图像和视频输入的基础设施。我们提出了MUSE（多模态统一安全评估），一个开源的，以运行为中心的平台，集成了自动跨模态有效载荷生成，三个多回合攻击算法（Crescendo，PAIR，Violent Durian），提供商不可知的模型路由，和一个LLM判断与五级安全分类到一个基于浏览器的系统。双指标框架将硬攻击成功率（仅合规性）与软ASR（包括部分合规性）区分开来，捕获二进制指标遗漏的部分信息泄漏。为了探索对齐是否跨越模态边界，我们引入了匝间模态切换（ITMS），它通过每匝模态旋转来增强多匝攻击。来自四家供应商的六个多模态LLM的实验表明，多转向策略可以实现高达90-100%的ASR，而不是接近完美的单转向拒绝模型。ITMS不会在已经饱和的基线上统一提高最终ASR，但会通过破坏早期防御来加速收敛，并且消融揭示了模态效应的方向是特定于模型系列而不是通用的，这强调了提供商感知的跨模态安全测试的必要性。
摘要：Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.

【12】VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings
标题：VL-KGE：视觉语言模型满足知识图谱嵌入
链接：https://arxiv.org/abs/2603.02435

作者：Athanasios Efthymiou,Stevan Rudinac,Monika Kackovic,Nachoem Wijnberg,Marcel Worring
备注：Published in Proceedings of the ACM Web Conference 2026 (WWW '26). This arXiv version includes extended supplementary material
摘要：真实世界的多模态知识图（MKG）本质上是异构的，与不同模态相关联的建模实体。传统的知识图嵌入（KGE）方法擅长学习实体和关系的连续表示，但它们通常是为单峰设置而设计的。最近的方法将KGE扩展到多模态设置，但仍然受到限制，通常孤立地处理模态，导致弱的跨模态对齐，并依赖于简单化的假设，如跨实体的统一模态可用性。视觉语言模型（VLM）提供了一种强大的方法来在共享的嵌入空间内对齐不同的模态。我们提出了视觉语言知识图嵌入（VL-KGE），一个框架，集成了跨模态对齐从VLMs与结构化关系建模学习统一的多模态表示的知识图。在WN 9-IMG和两个新的美术MKG（WikiArt-MKG-v1和WikiArt-MKG-v2）上的实验表明，VL-KGE在链接预测任务中始终优于传统的单峰和多峰KGE方法。我们的研究结果突出了多模态KGE的VLM的价值，使更强大的和结构化的推理在大规模的异构知识图。
摘要：Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.

【13】Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs
标题：微调期间的无声破坏：紧凑型医疗LLM的少量射击理由中毒
链接：https://arxiv.org/abs/2603.02262

作者：Jingyuan Xie,Wenjie Wang,Ji Wu,Jiandong Gao
摘要：监督微调（SFT）对于医学大型语言模型（LLM）的开发至关重要，但之前的中毒研究主要集中在可检测的后门攻击上。我们提出了一种新的中毒攻击针对医疗LLM的推理过程中SFT。与后门攻击不同，我们的方法将有毒的理论注入到Few-Shot训练数据中，导致目标医学主题的模型性能隐形下降。结果表明，知识中毒是无效的，而理性中毒会导致目标主题的准确性显著下降，只要数据集中没有出现同一主题的正确样本。该方法需要最少的中毒样本数量和比例来进行有效的隐形攻击，比灾难性遗忘更有效、更准确。我们通过这项研究证明了SFT阶段中毒的风险，希望能在敏感的医学领域激发更多的防御研究。
摘要：Supervised fine-tuning (SFT) is essential for the development of medical large language models (LLMs), yet prior poisoning studies have mainly focused on the detectable backdoor attacks. We propose a novel poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model performance on targeted medical topics. Results showed that knowledge overwriting was ineffective, while rationale poisoning caused significant decline on the accuracy of the target subject, as long as no correct samples of the same subject appear in the dataset. A minimum number and ratio of poisoned samples was needed to carry out an effective and stealthy attack, which was more efficient and accurate than catastrophic forgetting. We demonstrate though this study the risk of SFT-stage poisoning, hoping to spur more studies of defense in the sensitive medical domain.

【14】HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval
标题：HEIOS：协调早期融合、后期融合和LLM推理以实现多粒度表文本检索
链接：https://arxiv.org/abs/2603.02248

作者：Sungho Park,Joohyung Yun,Jongwuk Lee,Wook-Shin Han
备注：9 pages, 6 figures. Accepted at ACL 2025 main. Project page: https://helios-projectpage.github.io/
摘要：表文本检索的目的是检索相关的表和文本，以支持开放域的问题回答。现有的研究使用早期或晚期融合，但面临局限性。早期融合预先将表行与其关联的段落对齐，形成“星”，其中通常包括不相关的上下文和缺少依赖于查询的关系。后期融合检索单个节点，动态对齐它们，但它有丢失相关上下文的风险。这两种方法也都难以处理高级推理任务，例如列式聚合和多跳推理。为了解决这些问题，我们提出了HELIOS，它结合了两种方法的优点。首先，基于边的二部子图检索识别表段和段落之间的细粒度边，有效地避免了不相关的上下文的包含。然后，查询相关节点扩展识别最有希望的节点，动态检索相关边以增长二部子图，最大限度地降低丢失重要上下文的风险。最后，基于星形的LLM细化在星形图级别而不是二部子图级别执行逻辑推理，支持高级推理任务。实验结果表明，在OTT QA测试中，HELIOS模型在召回率和nDCG上分别比现有模型提高了42. 6%和39. 9%.
摘要：Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated passages, forming "stars," which often include irrelevant contexts and miss query-dependent relationships. Late fusion retrieves individual nodes, dynamically aligning them, but it risks missing relevant contexts. Both approaches also struggle with advanced reasoning tasks, such as column-wise aggregation and multi-hop reasoning. To address these issues, we propose HELIOS, which combines the strengths of both approaches. First, the edge-based bipartite subgraph retrieval identifies finer-grained edges between table segments and passages, effectively avoiding the inclusion of irrelevant contexts. Then, the query-relevant node expansion identifies the most promising nodes, dynamically retrieving relevant edges to grow the bipartite subgraph, minimizing the risk of missing important contexts. Lastly, the star-based LLM refinement performs logical inference at the star graph level rather than the bipartite subgraph, supporting advanced reasoning tasks. Experimental results show that HELIOS outperforms state-of-the-art models with a significant improvement up to 42.6\% and 39.9\% in recall and nDCG, respectively, on the OTT-QA benchmark.

【15】CUDABench: Benchmarking LLMs for Text-to-CUDA Generation
标题：CUDABench：文本到CUDA生成的LLM基准测试
链接：https://arxiv.org/abs/2603.02236

作者：Jiace Zhu,Wentao Chen,Qi Fan,Zhixing Ren,Junying Wu,Xing Zhe Chai,Chotiwit Rungrueangwutthinon,Yehan Ma,An Zou
摘要：最近的研究已经证明了大型语言模型（LLM）在生成GPU内核方面的潜力。目前的基准测试侧重于将高级语言翻译成CUDA，忽略了生成文本到CUDA的更一般和更具挑战性的任务。此外，考虑到GPU编程的硬件特定和性能关键特征，准确评估LLM生成的GPU程序的性能并非易事。在这项工作中，我们介绍CUDABench，一个全面的基准，旨在评估文本到CUDA的LLM功能。首先，我们构建CUDAB分支集，它涵盖了不同应用领域的广度-深度-难度评估空间，包括人工智能，科学计算和数据分析等。此外，我们提出了CUDAB分支分数和生成验证管道，用于评估（1）编译正确性，（2）通过基于执行的验证的功能一致性，以及（3）一种新的基于屋顶的度量，性能分数。对最先进的LLM进行基准测试揭示了文本到CUDA的深刻发现和挑战，例如高编译成功率和低功能正确性之间的显著不匹配，缺乏特定领域的算法知识，以及GPU硬件资源的次优利用率。我们的基准可以在https://github.com/CUDA-Bench/CUDABench上找到。
摘要：Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of text-to-CUDA generation. Furthermore, given the hardware-specific and performance-critical features of GPU programming, accurately assessing the performance of LLM-generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text-to-CUDA capabilities of LLMs. First, we construct CUDABench-Set, which covers Breadth-Depth-Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench-Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution-based verification, and (3) a novel roofline-based metric, Performance-Score. Benchmarking state-of-the-art LLMs reveals insightful findings and challenges of text-to-CUDA, such as a notable mismatch between high compilation success rates and low functional correctness, a lack of domain-specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. Our benchmark is available at https://github.com/CUDA-Bench/CUDABench.

【16】Safety Training Persists Through Helpfulness Optimization in LLM Agents
标题：通过LLM代理的帮助性优化持续安全训练
链接：https://arxiv.org/abs/2603.02229

作者：Benjamin Plaut
备注：Under submission
摘要：安全后培训已经在单步“聊天”设置中进行了广泛的研究，其中安全通常是指拒绝有害的请求。我们研究一个“代理人”（即，多步骤，工具使用）设置，其中安全是指LLM直接采取的有害行动。我们比较了运行直接偏好优化（DPO）对安全性或有用性单独与这两个指标顺序的影响。正如预期的那样，仅在一个指标上进行训练会导致沿着该边界的极端点。然而，与以前的工作，我们发现，安全培训持续通过随后的有益的培训。我们还发现，所有的训练配置最终都接近线性帕累托边界，R^2 = 0.77。即使同时对这两个指标进行后训练，也只会导致边界上的另一个点，而不是找到“两全其美”的策略，尽管我们的DPO数据集中存在这样的策略。总的来说，我们的研究结果强调需要更好地了解培训后的动态。
摘要：Safety post-training has been studied extensively in single-step "chat" settings where safety typically refers to refusing harmful requests. We study an "agentic" (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with $R^2 = 0.77$. Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a "best of both worlds" strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the need for better understanding of post-training dynamics.

【17】MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction
标题：MedFeat：采用LLM进行模型感知和解释性驱动的特征工程，用于临床表格预测
链接：https://arxiv.org/abs/2603.02221

作者：Zizheng Zhang,Yiming Li,Justin Xu,Jinyu Wang,Rui Wang,Lei Song,Jiang Bian,David W Eyre,Jingjing Fu
摘要：在医疗保健表格预测中，具有特征工程的经典模型通常优于神经方法。大型语言模型的最新进展使领域知识集成到特征工程中，提供了一个有前途的方向。然而，现有的方法通常依赖于对预定义变换的广泛搜索，忽略了下游模型特征和特征重要性信号。我们提出了MedFeat，一个反馈驱动和模型感知的特征工程框架，利用LLM推理与领域知识，并提供基于SHAP值的特征解释，同时跟踪成功和失败的建议，以指导特征发现。通过结合模型感知，MedFeat优先考虑下游模型由于其特性而难以直接学习的信息信号。在广泛的临床预测任务中，MedFeat在各种基线上实现了稳定的改进，并发现了具有临床意义的特征，这些特征在分布变化下泛化，证明了多年来从ICU队列到普通住院患者的稳健性，从而提供了对现实世界部署的见解。复制我们的实验所需的代码将被发布，受数据集协议和机构政策的约束。
摘要：In healthcare tabular predictions, classical models with feature engineering often outperform neural approaches. Recent advances in Large Language Models enable the integration of domain knowledge into feature engineering, offering a promising direction. However, existing approaches typically rely on a broad search over predefined transformations, overlooking downstream model characteristics and feature importance signals. We present MedFeat, a feedback-driven and model-aware feature engineering framework that leverages LLM reasoning with domain knowledge and provides feature explanations based on SHAP values while tracking successful and failed proposals to guide feature discovery. By incorporating model awareness, MedFeat prioritizes informative signals that are difficult for the downstream model to learn directly due to its characteristics. Across a broad range of clinical prediction tasks, MedFeat achieves stable improvements over various baselines and discovers clinically meaningful features that generalize under distribution shift, demonstrating robustness across years and from ICU cohorts to general hospitalized patients, thereby offering insights into real-world deployment. Code required to reproduce our experiments will be released, subject to dataset agreements and institutional policies.

【18】RxnNano:Training Compact LLMs for Chemical Reaction and Retrosynthesis Prediction via Hierarchical Curriculum Learning
标题：RxnNano：通过分层课程学习训练紧凑型LLM进行化学反应和逆合成预测
链接：https://arxiv.org/abs/2603.02215

作者：Ran Li,Shimin Di,Haowei LI,Luanshi Bu,Jiachuan Wang,Wangze Ni,Lei Chen
摘要：化学反应预测是加速药物发现和合成计划的关键。尽管数据驱动模型取得了进展，但当前的方法受到参数和数据集缩放的过度关注的阻碍。一些方法与评估技术相结合，绕过了反应表示中的基本挑战，并且未能捕获深层的化学直觉，如反应常识和{拓扑原子映射逻辑}。我们认为，核心挑战在于将这些知识灌输到模型中。为此，我们提出了一个统一的框架，通过三个关键的创新优先考虑化学理解的规模：（1）一个{潜在的化学一致性}目标，模拟反应作为一个连续的化学流形上的运动，确保可逆的和物理上合理的转换;（2）一个{层次认知课程}，通过渐进的阶段训练模型，从语法掌握到语义推理，建立强大的化学直觉;（3）{原子映射排列不变性（AMPI）}，这迫使模型学习不变的关系拓扑和平衡多任务学习。（4）和结构化的基于计划的推理，以提高LLM的性能。我们的紧凑型{0. 5 B参数模型}，\textbf{RxnNano}显著优于微调LLM十倍大（> 7 B）和所有域基线，实现了23.5%的Top-1精度改进严格的基准测试，而无需测试时间增强。https://github.com/rlisml/RxnNano.
摘要：Chemical reaction prediction is pivotal for accelerating drug discovery and synthesis planning. Despite advances in data-driven models, current approaches are hindered by an overemphasis on parameter and dataset scaling. Some methods coupled with evaluation techniques that bypass fundamental challenges in reaction representation and fail to capture deep chemical intuition like reaction common sense and {topological atom mapping logic}. We argue that the core challenge lies in instilling these knowledge into the models. To this end, we propose a unified framework that prioritizes chemical understanding over scale through three key innovations: (1) a {Latent Chemical Consistency} objective that models reactions as movements on a continuous chemical manifold, ensuring reversible and physically plausible transformations; (2) a {Hierarchical Cognitive Curriculum} that trains the model through progressive stages, from syntax mastery to semantic reasoning, building robust chemical intuition; (3) {Atom-Map Permutation Invariance (AMPI)}, which force the model to learn invariant relational topology and balance multi-task learning. (4)and structured plan-based reasoning to improve the performance of the LLMs. Our compact {0.5B-parameter model}, \textbf{RxnNano} significantly outperforms fine-tuned LLMs ten times larger (>7B) and all the domain baselines, achieving a 23.5\% Top-1 accuracy improvement on rigorous benchmarks without test-time augmentation. https://github.com/rlisml/RxnNano.

【19】Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost
标题：Param$Delta $用于直接权重混合：零成本训练后大型语言模型
链接：https://arxiv.org/abs/2504.21023

作者：Sheng Cao,Mingrui Wu,Karthik Prasad,Yuandong Tian,Zechun Liu
备注：Published as a conference paper at ICLR 2025
摘要：大型语言模型的后训练阶段对于增强诸如推理跟踪、推理和与人类偏好保持一致等能力至关重要。然而，它需要大量的高质量数据，并带来过拟合等风险，以及由于每次基础模型更新后重复的后训练和评估而导致的巨大计算成本。本文介绍了$ParamΔ$，这是一种新的方法，通过将知识从现有的后训练模型转移到新更新的基础模型中，从而简化了后训练。通过计算训练后的模型权重（$Θ_\text{post}$）和基础模型权重（$Θ_\text{base}$）之间的差异，并将其添加到更新后的基础模型（$$\'_\text{base}$）中，我们将$ParamΔ$ Model定义为：$Θ_{\text{Param}Δ} = Θ_\text{post} - Θ_\text{base} + θ\'_\text{base}$。这种方法令人惊讶地为新的基础模型配备了后训练能力，实现了与直接后训练相当的性能。我们对LLama 3、Llama3.1、Qwen和DeepSeek蒸馏模型进行了分析。结果表明，$ParamΔ$模型有效地复制了传统的后训练。以70 B Llama 3-inst、Llama 3-base、Llama3.1-base模型为例，得到的$ParamΔ$模型平均性能约为Llama3.1-inst模型的95%。$ParamΔ$为如何在开放权重社区中充分利用模型带来了新的视角，其中基础和指令模型的检查点随时可用且经常更新，通过提供免费框架来加速模型开发的迭代周期。
摘要：The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated post-training and evaluation after each base model update. This paper introduces $ParamΔ$, a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with ZERO additional training. By computing the difference between post-trained model weights ($Θ_\text{post}$) and base model weights ($Θ_\text{base}$), and adding this to the updated base model ($Θ'_\text{base}$), we define $ParamΔ$ Model as: $Θ_{\text{Param}Δ} = Θ_\text{post} - Θ_\text{base} + Θ'_\text{base}$. This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate $ParamΔ$ Model effectively replicates traditional post-training. For example, the $ParamΔ$ Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95\% of Llama3.1-inst model's performance on average. $ParamΔ$ brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.

Graph相关(图学习|图神经网络|图优化等)(8篇)

【1】I-CAM-UV: Integrating Causal Graphs over Non-Identical Variable Sets Using Causal Additive Models with Unobserved Variables
标题：I-CAM-紫外线：使用具有不可观察变量的因果加性模型在不相同变量集中集成因果图
链接：https://arxiv.org/abs/2603.03207

作者：Hirofumi Suzuki,Kentaro Kanamori,Takuya Takagi,Thong Pham,Takashi Nicholas Maeda,Shohei Shimizu
备注：16 pages, 22 figures, to appear in the 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
摘要：从观测数据中发现因果关系是各个科学领域的基本工具。虽然现有的方法通常是针对单个数据集设计的，但在实践中，我们经常需要处理具有不同变量集的多个数据集。一种直接的方法是从每个数据集估计因果图，并通过重叠构建单个因果图。然而，这种方法识别有限的因果关系，因为每个数据集中未观察到的变量可能是混淆因素，并且某些变量对可能在任何数据集中都是未观察到的。为了解决这个问题，我们利用具有未观察变量的因果加性模型（CAM-UV），该模型提供了与未观察变量相关的因果图。我们表明，地面真相因果图具有结构一致性的信息CAM-UV在每个数据集上。因此，我们提出了一种名为I-CAM-UV的方法来整合CAM-UV结果枚举所有一致的因果图。我们还提供了一个有效的组合搜索算法，并证明了有用的I-CAM-UV对现有的方法。
摘要：Causal discovery from observational data is a fundamental tool in various fields of science. While existing approaches are typically designed for a single dataset, we often need to handle multiple datasets with non-identical variable sets in practice. One straightforward approach is to estimate a causal graph from each dataset and construct a single causal graph by overlapping. However, this approach identifies limited causal relationships because unobserved variables in each dataset can be confounders, and some variable pairs may be unobserved in any dataset. To address this issue, we leverage Causal Additive Models with Unobserved Variables (CAM-UV) that provide causal graphs having information related to unobserved variables. We show that the ground truth causal graph has structural consistency with the information of CAM-UV on each dataset. As a result, we propose an approach named I-CAM-UV to integrate CAM-UV results by enumerating all consistent causal graphs. We also provide an efficient combinatorial search algorithm and demonstrate the usefulness of I-CAM-UV against existing methods.

【2】Multi-Scale Adaptive Neighborhood Awareness Transformer For Graph Fraud Detection
标题：用于图欺诈检测的多尺度自适应邻居感知Transformer
链接：https://arxiv.org/abs/2603.03106

作者：Jiaqi Lv,Qingfeng Du,Yu Zhang,Yongqi Han,Sheng Li
摘要：图欺诈检测（GFD）对于识别图中的欺诈行为至关重要，可使金融网络和社交媒体等各个领域受益。基于图神经网络（GNNs）的现有方法由于其对图结构数据的有效表达能力而取得了相当大的成功。然而，GNN固有的归纳偏差，包括同质性假设和有限的全局建模能力，阻碍了这些模型的有效性。为了解决这些挑战，我们提出了多尺度邻域感知Transformer（MANDATE），它消除了GNN固有的感应偏差。具体来说，我们设计了一个多尺度的位置编码策略来编码的位置信息的各种距离的中心节点。通过将其与自注意机制相结合，可以显著增强全局建模能力。同时，我们设计了不同的嵌入策略，为同亲和异亲连接。这减轻了良性和欺诈性节点之间的同质性分布差异。此外，设计了一种多关系图的嵌入融合策略，消除了不同关系所引起的分布偏差。在三个欺诈检测数据集上的实验证明了MANDATE的优越性。
摘要：Graph fraud detection (GFD) is crucial for identifying fraudulent behavior within graphs, benefiting various domains such as financial networks and social media. Existing methods based on graph neural networks (GNNs) have succeeded considerably due to their effective expressive capacity for graph-structured data. However, the inherent inductive bias of GNNs, including the homogeneity assumption and the limited global modeling ability, hinder the effectiveness of these models. To address these challenges, we propose Multi-scale Neighborhood Awareness Transformer (MANDATE), which alleviates the inherent inductive bias of GNNs. Specifically, we design a multi-scale positional encoding strategy to encode the positional information of various distances from the central node. By incorporating it with the self-attention mechanism, the global modeling ability can be enhanced significantly. Meanwhile, we design different embedding strategies for homophilic and heterophilic connections. This mitigates the homophily distribution differences between benign and fraudulent nodes. Moreover, an embedding fusion strategy is designed for multi-relation graphs, which alleviates the distribution bias caused by different relationships. Experiments on three fraud detection datasets demonstrate the superiority of MANDATE.

【3】Incremental Graph Construction Enables Robust Spectral Clustering of Texts
标题：增量图构建实现文本的鲁棒谱聚集
链接：https://arxiv.org/abs/2603.03056

作者：Marko Pranjić,Boshko Koloski,Nada Lavrač,Senja Pollak,Marko Robnik-Šikonja
备注：MP and BK contributed equally
摘要：邻域图是文本嵌入的谱聚类中一个关键但往往脆弱的步骤。在现实的文本数据集上，标准的$k$-NN图在实际的稀疏度水平（小$k$）上可能包含许多不连通的分量，这使得谱聚类退化并且对超参数敏感。我们引入了一个简单的增量$k$-NN图构造，通过设计保持连通性：每个新节点都链接到其$k$最近的先前插入的节点，这保证了任何$k$的连通图。我们提供了一个归纳证明的连通性，并讨论了增量更新时，新的文件到达的影响。我们使用Laplacian特征映射在海量文本嵌入基准的六个聚类数据集上验证了SentenceTransformer嵌入的谱聚类方法。与标准的$k$-NN图相比，我们的方法在低k$政权中表现出色，其中断开组件是普遍的，并且在较大的$k$下匹配标准的$k$-NN。
摘要：Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard $k$-NN graphs can contain many disconnected components at practical sparsity levels (small $k$), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental $k$-NN graph construction that preserves connectivity by design: each new node is linked to its $k$ nearest previously inserted nodes, which guarantees a connected graph for any $k$. We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark.Compared to standard $k$-NN graphs, our method outperforms in the low-$k$ regime where disconnected components are prevalent, and matches standard $k$-NN at larger $k$.

【4】MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks
标题：MASPOB：基于Bandit的多Agent系统图神经网络即时优化
链接：https://arxiv.org/abs/2603.02630

作者：Zhi Hong,Qian Zhang,Jiahang Sun,Zhiwei Shang,Mingze Kong,Xiangyi Wang,Yao Shu,Zhongxiang Dai
备注：Preprint
摘要：大型语言模型（LLM）在许多实际应用中取得了巨大的成功，特别是作为多智能体系统（MAS）的认知中枢来协调复杂工作流的LLM。由于许多部署方案排除MAS工作流的修改和它的性能是高度敏感的输入提示，提示优化出现作为一个更自然的方法来提高其性能。然而，现实世界的MAS即时优化受到三个关键挑战的阻碍：（1）由于高昂的评估成本，样本效率的需要，（2）提示之间的拓扑诱导耦合，以及（3）搜索空间的组合爆炸。为了解决这些挑战，我们引入MASPOB（多智能体系统提示优化通过土匪），一种新的基于土匪的样本效率的框架。通过利用置信上限（UCB）来量化不确定性，强盗框架平衡了勘探和开发，在严格有限的预算内实现收益最大化。为了处理拓扑诱导的耦合，MASPOB集成了图神经网络（GNN）来捕获结构先验，学习提示语义的拓扑感知表示。此外，它采用坐标上升分解成单变量子问题的优化，减少搜索复杂性从指数到线性。在各种基准测试中进行的大量实验表明，MASPOB实现了最先进的性能，始终优于现有的基准。
摘要：Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backbone of Multi-Agent Systems (MAS) to orchestrate complex workflows in practice. Since many deployment scenarios preclude MAS workflow modifications and its performance is highly sensitive to the input prompts, prompt optimization emerges as a more natural approach to improve its performance. However, real-world prompt optimization for MAS is impeded by three key challenges: (1) the need of sample efficiency due to prohibitive evaluation costs, (2) topology-induced coupling among prompts, and (3) the combinatorial explosion of the search space. To address these challenges, we introduce MASPOB (Multi-Agent System Prompt Optimization via Bandits), a novel sample-efficient framework based on bandits. By leveraging Upper Confidence Bound (UCB) to quantify uncertainty, the bandit framework balances exploration and exploitation, maximizing gains within a strictly limited budget. To handle topology-induced coupling, MASPOB integrates Graph Neural Networks (GNNs) to capture structural priors, learning topology-aware representations of prompt semantics. Furthermore, it employs coordinate ascent to decompose the optimization into univariate sub-problems, reducing search complexity from exponential to linear. Extensive experiments across diverse benchmarks demonstrate that MASPOB achieves state-of-the-art performance, consistently outperforming existing baselines.

【5】Can Computational Reducibility Lead to Transferable Models for Graph Combinatorial Optimization?
标题：计算约简能否为图组合优化带来可移植模型？
链接：https://arxiv.org/abs/2603.02462

作者：Semih Cantürk,Thomas Sabourin,Frederik Wenkel,Michael Perlmutter,Guy Wolf
摘要：在推导用于组合优化（CO）的统一神经求解器时，一个关键挑战是将给定任务集之间的模型有效地推广到初始训练过程中未使用的新任务。为了解决这个问题，我们首先建立了一个新的模型，该模型使用GCON模块作为表达消息传递的一种形式，以及基于能量的无监督损失函数。当在每个任务上单独训练时，该模型在多个CO任务上实现了高性能（通常与最先进的结果相当）。然后，我们利用计算约简文献中的知识，提出预训练和微调策略，这些策略可以有效地（a）在MVC，MIS和MaxClique之间转移，以及（b）在多任务学习环境中，另外还包含MaxCut，MDS和图形着色。此外，在留一法多任务学习环境中，我们观察到，除了一个任务外，所有任务的预训练几乎总是会在微调时导致剩余任务的更快收敛，同时避免负迁移。我们的研究结果表明，通过使用表达性的消息传递加上多项式约简文献所告知的预训练策略，学习多个图CO问题的共同表示是可行的，从而朝着开发神经CO的基础模型迈出了重要的一步。https://github.com/semihcanturk/COPT-MT
摘要：A key challenge in deriving unified neural solvers for combinatorial optimization (CO) is efficient generalization of models between a given set of tasks to new tasks not used during the initial training process. To address it, we first establish a new model, which uses a GCON module as a form of expressive message passing together with energy-based unsupervised loss functions. This model achieves high performance (often comparable with state-of-the-art results) across multiple CO tasks when trained individually on each task. We then leverage knowledge from the computational reducibility literature to propose pretraining and fine-tuning strategies that transfer effectively (a) between MVC, MIS and MaxClique, and (b) in a multi-task learning setting that additionally incorporates MaxCut, MDS and graph coloring. Additionally, in a leave-one-out, multi-task learning setting, we observe that pretraining on all but one task almost always leads to faster convergence on the remaining task when fine-tuning while avoiding negative transfer. Our findings indicate that learning common representations across multiple graph CO problems is viable through the use of expressive message passing coupled with pretraining strategies that are informed by the polynomial reduction literature, thereby taking an important step towards enabling the development of foundational models for neural CO. We provide an open-source implementation of our work at https://github.com/semihcanturk/COPT-MT .

【6】Learning graph topology from metapopulation epidemic encoder-decoder
标题：从集合种群流行病编码器-解码器学习图拓扑
链接：https://arxiv.org/abs/2603.02349

作者：Xin Li,Jonathan Cohen,Shai Pilosof,Rami Puzis
摘要：集合种群流行病模型是研究大规模疫情的重要工具。由于流行病追踪数据的有限性，推断这些模型的基本组成部分，即流行病参数和子群体之间的相关移动网络是具有挑战性的。这些成分中的任何一个都可以在假设另一个的同时进行估计;然而，它们的联合推理问题尚未解决。在这里，我们提出了两种编码器-解码器深度学习架构，它们可以从时间序列数据中推断集合种群移动图，有和没有流行病模型参数的假设。不同的随机和经验的移动性网络的评估表明，所提出的方法优于国家的最先进的拓扑推理。此外，我们表明，拓扑推理显着改善与其他病原体的数据。我们的研究建立了一个强大的框架，同时推断流行病参数和拓扑结构，解决了建模疾病传播的持续差距。
摘要：Metapopulation epidemic models are a valuable tool for studying large-scale outbreaks. With the limited availability of epidemic tracing data, it is challenging to infer the essential constituents of these models, namely, the epidemic parameters and the relevant mobility network between subpopulations. Either one of these constituents can be estimated while assuming the other; however, the problem of their joint inference has not yet been solved. Here, we propose two encoder-decoder deep learning architectures that infer metapopulation mobility graphs from time-series data, with and without the assumption of epidemic model parameters. Evaluation across diverse random and empirical mobility networks shows that the proposed approach outperforms the state-of-the-art topology inference. Further, we show that topology inference improves dramatically with data on additional pathogens. Our study establishes a robust framework for simultaneously inferring epidemic parameters and topology, addressing a persistent gap in modeling disease propagation.

【7】Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network
标题：多模式阿尔茨海默氏症网络中基于图形注意力的疾病责任基因优先顺序
链接：https://arxiv.org/abs/2603.02273

作者：Binon Teji,Subhajit Bandyopadhyay,Swarup Roy
摘要：优先考虑疾病相关基因是理解阿尔茨海默病（AD）等复杂疾病的分子机制的核心。传统的基于网络的方法依赖于静态的中心性措施，往往无法捕捉跨模式的生物异质性。我们提出了NETRA（通过基于变压器的表示和注意节点评估），一个多模态图Transformer框架，以注意力驱动的相关性评分取代启发式中心性指标。使用AD作为案例研究，基因调控网络独立地从微阵列、单细胞RNA-seq和单核RNA-seq数据构建。来自这些网络的随机游走序列用于训练基于BERT的模型，用于学习全局基因嵌入，而模态特定的基因表达谱使用变分自编码器压缩。这些表示与辅助生物网络集成，包括蛋白质-蛋白质相互作用，基因本体语义相似性和基于扩散的基因相似性，成为一个统一的多模态图。图形Transformer分配NETRA分数，该分数以疾病特异性和上下文感知的方式量化基因相关性。基因集富集分析表明，NETRA实现了约3.9的阿尔茨海默病途径的归一化富集分数，大大优于经典的中心性措施和扩散模型。排名靠前的基因丰富了多种神经退行性通路，恢复了chr 12 q13处已知的迟发性AD易感基因座，并揭示了保守的跨疾病基因模块。该框架保留了生物学上现实的重尾网络拓扑结构，并易于扩展到其他复杂的疾病。
摘要：Prioritizing disease-associated genes is central to understanding the molecular mechanisms of complex disorders such as Alzheimer's disease (AD). Traditional network-based approaches rely on static centrality measures and often fail to capture cross-modal biological heterogeneity. We propose NETRA (Node Evaluation through Transformer-based Representation and Attention), a multimodal graph transformer framework that replaces heuristic centrality metrics with attention-driven relevance scoring. Using AD as a case study, gene regulatory networks are independently constructed from microarray, single-cell RNA-seq, and single-nucleus RNA-seq data. Random-walk sequences derived from these networks are used to train a BERT-based model for learning global gene embeddings, while modality-specific gene expression profiles are compressed using variational autoencoders. These representations are integrated with auxiliary biological networks, including protein-protein interactions, Gene Ontology semantic similarity, and diffusion-based gene similarity, into a unified multimodal graph. A graph transformer assigns NETRA scores that quantify gene relevance in a disease-specific and context-aware manner. Gene set enrichment analysis shows that NETRA achieves a normalized enrichment score of about 3.9 for the Alzheimer's disease pathway, substantially outperforming classical centrality measures and diffusion models. Top-ranked genes enrich multiple neurodegenerative pathways, recover a known late-onset AD susceptibility locus at chr12q13, and reveal conserved cross-disease gene modules. The framework preserves biologically realistic heavy-tailed network topology and is readily extensible to other complex disorders.

【8】Conformal Graph Prediction with Z-Gromov Wasserstein Distances
标题：具有Z-Gromov Wasserstein距离的共形图预测
链接：https://arxiv.org/abs/2603.02460

作者：Gabriel Melo,Thibaut de Saivre,Anna Calissano,Florence d'Alché-Buc
摘要：监督图预测解决了输出是结构化图的回归问题。虽然有几种方法存在图值预测，原则性的不确定性量化仍然有限。我们提出了一个共形预测框架的图值输出，提供分布-自由覆盖保证结构化的输出空间。我们的方法定义通过Z-Gromov-Wasserstein距离，实例化在实践中通过融合Gromov-Wasserstein（FGW），使预测和候选graphs.To获得自适应预测集之间的置换不变的比较不一致，我们引入分数共形分位数回归（SCQR），共形分位数回归（CQR）的扩展，以处理复杂的输出空间，如图-值输出。我们评估所提出的方法上的合成任务和一个真正的问题的分子识别。
摘要：Supervised graph prediction addresses regression problems where the outputs are structured graphs. Although several approaches exist for graph--valued prediction, principled uncertainty quantification remains limited. We propose a conformal prediction framework for graph-valued outputs, providing distribution--free coverage guarantees in structured output spaces. Our method defines nonconformity via the Z--Gromov--Wasserstein distance, instantiated in practice through Fused Gromov--Wasserstein (FGW), enabling permutation invariant comparison between predicted and candidate graphs.To obtain adaptive prediction sets, we introduce Score Conformalized Quantile Regression (SCQR), an extension of Conformalized Quantile Regression (CQR) to handle complex output spaces such as graph--valued outputs. We evaluate the proposed approach on a synthetic task and a real problem of molecule identification.

Transformer(4篇)

【1】From Complex Dynamics to DynFormer: Rethinking Transformers for PDEs
标题：从复杂动力学到DynFormer：重新思考用于PDEs的Transformer
链接：https://arxiv.org/abs/2603.03112

作者：Pengyu Lai,Yixiao Chen,Dewu Yang,Rui Wang,Feng Wang,Hui Xu
摘要：偏微分方程（PDE）是模拟复杂物理系统的基础，但经典的数值求解器在高维和多尺度区域面临着高昂的计算成本。虽然基于Transformer的神经操作符已经成为强大的数据驱动替代方案，但它们通常将所有离散空间点视为统一的独立令牌。这种单一的方法忽略了物理场的内在尺度分离，应用了计算上禁止的全局注意力，冗余地将平滑的大尺度动力学与高频波动混合在一起。通过复杂动力学的镜头重新思考Transformers，我们提出了DynFormer，一种新的动力学信息神经运算符。DynFormer不是在所有尺度上应用统一的注意力机制，而是将专门的网络模块明确分配给不同的物理尺度。它利用频谱嵌入来隔离低频模式，使Kronecker结构的注意力机制能够有效地捕获大规模的全局交互，降低复杂性。同时，我们引入了局部-全局-混合变换。该模块利用非线性乘性频率混合隐式地重建小尺度，快速变化的湍流级联，这些湍流级联从属于宏观状态，而不会引起全局关注的成本。将这些模块集成到一个混合进化架构中，确保了强大的长期时间稳定性。在四个PDE基准测试中进行的广泛的内存对齐评估表明，与最先进的基线相比，DynFormer的相对误差降低了95%，同时显著降低了GPU内存消耗。我们的研究结果表明，嵌入到Transformer架构的第一原理物理动力学产生了一个高度可扩展的，理论上接地PDE代理建模的蓝图。
摘要：Partial differential equations (PDEs) are fundamental for modeling complex physical systems, yet classical numerical solvers face prohibitive computational costs in high-dimensional and multi-scale regimes. While Transformer-based neural operators have emerged as powerful data-driven alternatives, they conventionally treat all discretized spatial points as uniform, independent tokens. This monolithic approach ignores the intrinsic scale separation of physical fields, applying computationally prohibitive global attention that redundantly mixes smooth large-scale dynamics with high-frequency fluctuations. Rethinking Transformers through the lens of complex dynamics, we propose DynFormer, a novel dynamics-informed neural operator. Rather than applying a uniform attention mechanism across all scales, DynFormer explicitly assigns specialized network modules to distinct physical scales. It leverages a Spectral Embedding to isolate low-frequency modes, enabling a Kronecker-structured attention mechanism to efficiently capture large-scale global interactions with reduced complexity. Concurrently, we introduce a Local-Global-Mixing transformation. This module utilizes nonlinear multiplicative frequency mixing to implicitly reconstruct the small-scale, fast-varying turbulent cascades that are slaved to the macroscopic state, without incurring the cost of global attention. Integrating these modules into a hybrid evolutionary architecture ensures robust long-term temporal stability. Extensive memory-aligned evaluations across four PDE benchmarks demonstrate that DynFormer achieves up to a 95% reduction in relative error compared to state-of-the-art baselines, while significantly reducing GPU memory consumption. Our results establish that embedding first-principles physical dynamics into Transformer architectures yields a highly scalable, theoretically grounded blueprint for PDE surrogate modeling.

【2】On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions
标题：Maxout网络和连续分段线性函数的Transformer的表现能力
链接：https://arxiv.org/abs/2603.03084

作者：Linyan Gu,Lihua Yang,Feng Zhou
摘要：Transformer网络在广泛的应用中取得了显著的经验成功，但它们的理论表达能力仍然没有得到充分的理解。在本文中，我们研究了Transformer体系结构的表达能力。我们首先通过Transformer网络建立了maxout网络的显式近似，同时保留了可比的模型复杂性。因此，Transformers在类似的复杂性约束下继承了ReLU网络的通用近似能力。在此基础上，我们开发了一个框架来分析连续分段线性函数的逼近Transformers和定量表征其表达能力通过线性区域的数量，这与深度呈指数增长。我们的分析建立了一个理论之间的桥梁近似理论的标准前馈神经网络和Transformer架构。它还提供了对Transformers的结构性见解：自注意层实现最大类型操作，而前馈层实现标记仿射变换。
摘要：Transformer networks have achieved remarkable empirical success across a wide range of applications, yet their theoretical expressive power remains insufficiently understood. In this paper, we study the expressive capabilities of Transformer architectures. We first establish an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity. As a consequence, Transformers inherit the universal approximation capability of ReLU networks under similar complexity constraints. Building on this connection, we develop a framework to analyze the approximation of continuous piecewise linear functions by Transformers and quantitatively characterize their expressivity via the number of linear regions, which grows exponentially with depth. Our analysis establishes a theoretical bridge between approximation theory for standard feedforward neural networks and Transformer architectures. It also yields structural insights into Transformers: self-attention layers implement max-type operations, while feedforward layers realize token-wise affine transformations.

【3】Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
标题：可解释的运动注意地图：视频扩散变形器中的时空局部化概念
链接：https://arxiv.org/abs/2603.02919

作者：Youngjun Jun,Seil Kang,Woojung Han,Seong Jae Hwang
备注：CVPR 2026
摘要：视频扩散Transformers（DiT）已经从涉及运动的给定文本描述合成具有高保真度的高质量视频。然而，了解视频DiT如何将运动词转换为视频仍然不够。此外，虽然先前对可解释的显着性图的研究主要针对对象，但视频DiTs中与运动相关的行为在很大程度上仍未被探索。在本文中，我们调查具体的运动功能，指定何时以及哪个对象移动一个给定的运动概念。首先，空间本地化，我们引入GramCol，它自适应地产生每帧显着图的任何文本概念，包括运动和非运动。其次，我们提出了一个运动特征选择算法，以获得一个可解释的运动注意地图（IMAP），定位运动的空间和时间。我们的方法发现概念显着图，而不需要任何梯度计算或参数更新。实验结果表明，该方法在运动定位和zero-shot视频语义分割中表现出了良好的定位能力，为运动和非运动概念提供了可解释和更清晰的显著性图。
摘要：Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

【4】Length Generalization Bounds for Transformers
标题：Transformer的长度概括界限
链接：https://arxiv.org/abs/2603.02238

作者：Andy Yang,Pascal Bergsträßer,Georg Zetzsche,David Chiang,Anthony W. Lin
摘要：长度泛化是学习算法的一个关键属性，它使学习算法能够在给定有限训练数据的情况下对任何长度的输入做出正确的预测。为了提供这样的保证，需要能够计算出一个长度泛化界，超过这个界，模型就能保证泛化。本文关注的开放问题的CRASP，一类语言，这是密切相关的Transformers，这样的推广界的可计算性。Chen等人最近对只有一层的CRASP和在某些限制下也有两层的CRASP给出了一个肯定的部分结果。我们提供完整的答案，上述开放的问题。我们的主要结果是CRASP（已经有两层），因此Transformers的可计算的长度推广界的不存在。为了补充这一点，我们提供了一个可计算的CRASP，我们表现出相当于固定精度的Transformers的积极片段。对于正CRASP和固定精度的Transformers，我们证明了长度复杂度是指数的，并证明了边界的最优性。
摘要：Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of any length, given finite training data. To provide such a guarantee, one needs to be able to compute a length generalization bound, beyond which the model is guaranteed to generalize. This paper concerns the open problem of the computability of such generalization bounds for CRASP, a class of languages which is closely linked to transformers. A positive partial result was recently shown by Chen et al. for CRASP with only one layer and, under some restrictions, also with two layers. We provide complete answers to the above open problem. Our main result is the non-existence of computable length generalization bounds for CRASP (already with two layers) and hence for transformers. To complement this, we provide a computable bound for the positive fragment of CRASP, which we show equivalent to fixed-precision transformers. For both positive CRASP and fixed-precision transformers, we show that the length complexity is exponential, and prove optimality of the bounds.

GAN|对抗|攻击|生成相关(3篇)

【1】Gravity Falls: A Comparative Analysis of Domain-Generation Algorithm (DGA) Detection Methods for Mobile Device Spearphishing
标题：重力下降：移动终端鱼叉式网络钓鱼领域生成算法（DGA）检测方法的比较分析
链接：https://arxiv.org/abs/2603.03270

作者：Adam Dorian Wong,John D. Hastings
备注：Disclaimer: The views expressed are those of the authors and do not necessarily reflect the official policy or position of the U.S. Department of Defense or the U.S. Government. References to external sites do not constitute endorsement. Cleared for release on 24 FEB 2026 (DOPSR 26-T-0771). Gravity Falls Dataset DOI: 10.5281/zenodo.17624554
摘要：移动设备经常成为电子犯罪威胁参与者的目标，他们通过利用域生成算法（DGA）来轮换敌对基础设施的SMS鱼叉式网络钓鱼（smishing）链接。尽管如此，DGA的研究和评估在很大程度上强调恶意软件C2和电子邮件钓鱼数据集，留下的证据有限，检测器如何推广到企业外围以外的诈骗驱动的域策略。这项工作通过评估传统和机器学习DGA探测器与重力瀑布的对比来解决这一差距，重力瀑布是一个新的半合成数据集，来自2022年至2025年之间交付的smishing链接。Gravity Falls捕获了单个威胁行为者在四个技术集群中的演变，从短随机字符串转变为用于凭据盗窃和费用/罚款欺诈的字典串联和主题组合蹲变体。使用Top-1 M域作为良性基线评估两种字符串分析方法（Shannon熵和Exp 0 se）和两种基于ML的检测器（LSTM分类器和COSSAS DGAD）。结果是强烈的策略依赖：性能是最高的随机字符串域，但下降字典连接和主题组合蹲，低召回跨多个工具/集群配对。总的来说，传统的机器学习和最近的机器学习检测器都不适合在重力瀑布中观察到的不断发展的DGA策略，这激发了更多的上下文感知方法，并为未来的评估提供了可重复的基准。
摘要：Mobile devices are frequent targets of eCrime threat actors through SMS spearphishing (smishing) links that leverage Domain Generation Algorithms (DGA) to rotate hostile infrastructure. Despite this, DGA research and evaluation largely emphasize malware C2 and email phishing datasets, leaving limited evidence on how well detectors generalize to smishing-driven domain tactics outside enterprise perimeters. This work addresses that gap by evaluating traditional and machine-learning DGA detectors against Gravity Falls, a new semi-synthetic dataset derived from smishing links delivered between 2022 and 2025. Gravity Falls captures a single threat actor's evolution across four technique clusters, shifting from short randomized strings to dictionary concatenation and themed combo-squatting variants used for credential theft and fee/fine fraud. Two string-analysis approaches (Shannon entropy and Exp0se) and two ML-based detectors (an LSTM classifier and COSSAS DGAD) are assessed using Top-1M domains as benign baselines. Results are strongly tactic-dependent: performance is highest on randomized-string domains but drops on dictionary concatenation and themed combo-squatting, with low recall across multiple tool/cluster pairings. Overall, both traditional heuristics and recent ML detectors are ill-suited for consistently evolving DGA tactics observed in Gravity Falls, motivating more context-aware approaches and providing a reproducible benchmark for future evaluation.

【2】Generative adversarial imitation learning for robot swarms: Learning from human demonstrations and trained policies
标题：机器人群的生成对抗模仿学习：从人类演示和训练有素的政策中学习
链接：https://arxiv.org/abs/2603.02783

作者：Mattes Kraus,Jonas Kuckling
备注：Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
摘要：在模仿学习中，机器人应该从所需行为的演示中学习。大多数群体机器人的模仿学习工作都是作为现有政策的展示提供的。在这项工作中，我们提供了一个基于生成对抗模仿学习的框架，旨在从人类演示中学习集体行为。我们的框架在六个不同的特派团进行评估，从手动演示和演示来自PPO培训的政策。结果表明，模仿学习过程是能够学习定性有意义的行为，表现类似以及所提供的演示。此外，我们部署学习的策略上的一群TurtleBot 4机器人在真正的机器人实验。所展示的行为保留了其视觉上可识别的特征，其性能与模拟中实现的性能相当。
摘要：In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.

【3】Talking with Verifiers: Automatic Specification Generation for Neural Network Verification
标题：与验证者交谈：神经网络验证的自动规范生成
链接：https://arxiv.org/abs/2603.02235

作者：Yizhak Y. Elboher,Reuven Peleg,Zhouxing Shi,Guy Katz,Jan Křetínský
摘要：神经网络验证工具目前只支持一类狭窄的规范，通常表示为原始输入和输出的低级约束。这种限制大大阻碍了它们在不同应用领域的采用和实用性，在这些应用领域中，正确性要求自然地在更高的语义级别上表达。这一挑战源于深度神经网络的固有性质，它学习的内部表示缺乏与人类可理解特征的显式映射。为了解决这一问题，我们通过在验证管道中引入一个新的组件来弥合这一差距，使现有的验证工具适用于更广泛的领域和规范风格。我们的框架使用户能够用自然语言制定规范，然后自动分析并转换为与最先进的神经网络验证器兼容的正式验证查询。我们在结构化和非结构化数据集上评估了我们的方法，证明它成功地验证了以前无法访问的复杂语义规范。我们的研究结果表明，这种翻译过程保持了对用户意图的高度保真度，同时产生了较低的计算开销，从而大大扩展了正式DNN验证对现实世界的高层次需求的适用性。
摘要：Neural network verification tools currently support only a narrow class of specifications, typically expressed as low-level constraints over raw inputs and outputs. This limitation significantly hinders their adoption and practical applicability across diverse application domains where correctness requirements are naturally expressed at a higher semantic level. This challenge is rooted in the inherent nature of deep neural networks, which learn internal representations that lack an explicit mapping to human-understandable features. To address this, we bridge this gap by introducing a novel component to the verification pipeline, making existing verification tools applicable to a broader range of domains and specification styles. Our framework enables users to formulate specifications in natural language, which are then automatically analyzed and translated into formal verification queries compatible with state-of-the-art neural network verifiers. We evaluate our approach on both structured and unstructured datasets, demonstrating that it successfully verifies complex semantic specifications that were previously inaccessible. Our results show that this translation process maintains high fidelity to user intent while incurring low computational overhead, thereby substantially extending the applicability of formal DNN verification to real-world, high-level requirements.

半/弱/无/有监督|不确定性|主动学习(11篇)

【1】Learning Demographic-Conditioned Mobility Trajectories with Aggregate Supervision
标题：在总体监督下学习人口条件下的移动轨迹
链接：https://arxiv.org/abs/2603.03275

作者：Jessie Z. Li,Zhiqing Hong,Toru Shirakawa,Serina Chang
摘要：人类流动轨迹在公共卫生和社会科学中得到广泛研究，其中不同的人口群体表现出显著不同的流动模式。然而，现有的轨迹生成模型很少捕捉这种异质性，因为大多数轨迹数据集缺乏人口统计标签。为了解决数据中的这一差距，我们提出了ATLAS，这是一种弱监督的方法，用于人口统计条件下的轨迹生成，仅使用（i）没有人口统计标签的个人轨迹，（ii）区域级聚合移动特征，以及（iii）来自人口普查数据的区域级人口组成。ATLAS训练轨迹生成器并对其进行微调，以便模拟的移动性与观察到的区域聚集相匹配，同时以人口统计为条件。对带有人口统计标签的真实轨迹数据的实验表明，ATLAS大大提高了基线的人口统计真实性（JSD $\downarrow $12%-69%），并缩小了与强监督训练的大部分差距。我们进一步发展理论分析何时以及为什么ATLAS工作，确定关键因素，包括跨地区的人口多样性和聚合特征的信息量，与实验相结合，证明我们的理论的实际意义。我们在https://github.com/schang-lab/ATLAS上发布代码。
摘要：Human mobility trajectories are widely studied in public health and social science, where different demographic groups exhibit significantly different mobility patterns. However, existing trajectory generation models rarely capture this heterogeneity because most trajectory datasets lack demographic labels. To address this gap in data, we propose ATLAS, a weakly supervised approach for demographic-conditioned trajectory generation using only (i) individual trajectories without demographic labels, (ii) region-level aggregated mobility features, and (iii) region-level demographic compositions from census data. ATLAS trains a trajectory generator and fine-tunes it so that simulated mobility matches observed regional aggregates while conditioning on demographics. Experiments on real trajectory data with demographic labels show that ATLAS substantially improves demographic realism over baselines (JSD $\downarrow$ 12%--69%) and closes much of the gap to strongly supervised training. We further develop theoretical analyses for when and why ATLAS works, identifying key factors including demographic diversity across regions and the informativeness of the aggregate feature, paired with experiments demonstrating the practical implications of our theory. We release our code at https://github.com/schang-lab/ATLAS.

【2】Leveraging Label Proportion Prior for Class-Imbalanced Semi-Supervised Learning
标题：利用标签优先比例进行班级不平衡的半监督学习
链接：https://arxiv.org/abs/2603.02957

作者：Kohki Akiba,Shinnosuke Matsuo,Shota Harada,Ryoma Bise
摘要：半监督学习（SSL）经常受到类不平衡的影响，其中伪标签放大了多数偏见并抑制了少数表现。我们用一个轻量级框架来解决这个问题，据我们所知，这个框架是第一个将从标签比例学习（LLP）中引入SSL的比例损失作为正则化项的框架。比例损失使模型预测与全局类别分布保持一致，从而减轻多数类别和少数类别之间的偏差。为了进一步稳定训练，我们制定了一个随机变量，该变量解释了小批量组成的波动。在Long-tailed CIFAR-10基准测试中的实验表明，将Proportion Loss集成到FixMatch和ReMixMatch中，可以在不平衡严重程度和标签比率上持续提高基线性能，并与现有的CISSL方法相比，特别是在稀缺标签条件下，实现了具有竞争力或优越性的结果。
摘要：Semi-supervised learning (SSL) often suffers under class imbalance, where pseudo-labeling amplifies majority bias and suppresses minority performance. We address this issue with a lightweight framework that, to our knowledge, is the first to introduce Proportion Loss from learning from label proportions (LLP) into SSL as a regularization term. Proportion Loss aligns model predictions with the global class distribution, mitigating bias across both majority and minority classes. To further stabilize training, we formulate a stochastic variant that accounts for fluctuations in mini-batch composition. Experiments on the Long-tailed CIFAR-10 benchmark show that integrating Proportion Loss into FixMatch and ReMixMatch consistently improves performance over the baselines across imbalance severities and label ratios, and achieves competitive or superior results compared to existing CISSL methods, particularly under scarce-label conditions.

【3】Improving Diffusion Planners by Self-Supervised Action Gating with Energies
标题：通过能量自我监督动作门控改善扩散规划
链接：https://arxiv.org/abs/2603.02650

作者：Yuan Lu,Dongqi Han,Yansen Wang,Dongsheng Li
摘要：扩散规划器是离线强化学习的一种强大方法，但当价值导向选择倾向于得分较高但与环境动态局部不一致的轨迹时，它们可能会失败，从而导致执行脆弱。我们提出了自监督动作门控与能源（SAGE），推理时间重新排名的方法，惩罚动态不一致的计划使用潜在的一致性信号。SAGE在离线状态序列上训练联合嵌入预测架构（JEPA）编码器，并在短时间内转换时训练动作条件潜在预测器。在测试时，SAGE为每个采样的候选者分配由其潜在预测误差给出的能量，并将此可行性得分与价值估计相结合以选择动作。SAGE可以集成到现有的扩散规划管道中，可以通过价值评分来采样轨迹和选择行动;它不需要环境推出，也不需要政策重新培训。在运动、导航和操作基准测试中，SAGE提高了扩散规划器的性能和鲁棒性。
摘要：Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

【4】What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty
标题：有能力的代理必须知道什么：不确定性下稳健决策的选择定理
链接：https://arxiv.org/abs/2603.02491

作者：Aran Nayebi
备注：18 pages
摘要：随着人工代理人的能力越来越强，代理人在不确定性下有效行动需要什么内部结构？经典的结果表明，最优控制可以 * 实现 * 使用信念状态或世界模型，但不需要这样的表示。我们证明了定量的“选择定理”表明，低 * 平均情况下的遗憾 * 结构家庭的行动条件预测任务迫使代理实现预测，结构化的内部状态。我们的研究结果涵盖了随机政策，部分可观测性和任务分布下的评估，而不假设最优性，确定性，或访问一个明确的模型。从技术上讲，我们减少了预测建模二进制“投注”的决定，并表明遗憾的界限限制概率质量次优的赌注，执行所需的预测区分，以分离高利润率的结果。在完全观测的情况下，这会产生近似恢复的干预过渡内核;在部分可观测性下，它意味着必要的信念记忆和预测状态，解决了一个悬而未决的问题，在以前的世界模型恢复工作。
摘要：As artificial agents become increasingly capable, what internal structure is *necessary* for an agent to act competently under uncertainty? Classical results show that optimal control can be *implemented* using belief states or world models, but not that such representations are required. We prove quantitative "selection theorems" showing that low *average-case regret* on structured families of action-conditioned prediction tasks forces an agent to implement a predictive, structured internal state. Our results cover stochastic policies, partial observability, and evaluation under task distributions, without assuming optimality, determinism, or access to an explicit model. Technically, we reduce predictive modeling to binary "betting" decisions and show that regret bounds limit probability mass on suboptimal bets, enforcing the predictive distinctions needed to separate high-margin outcomes. In fully observed settings, this yields approximate recovery of the interventional transition kernel; under partial observability, it implies necessity of belief-like memory and predictive state, addressing an open question in prior world-model recovery work.

【5】Learning to Pay Attention: Unsupervised Modeling of Attentive and Inattentive Respondents in Survey Data
标题：学会集中注意力：调查数据中专注和不专注受访者的无监督建模
链接：https://arxiv.org/abs/2603.02427

作者：Ilias Triantafyllopoulos,Panos Ipeirotis
摘要：行为和社会科学调查的完整性取决于检测那些提供随机或低努力答案的不专心的受访者。传统的保障措施，如注意力检查，往往是昂贵的，被动的，不一致的。我们提出了一个统一的，无标签的框架，用于疏忽检测，使用互补的无监督视图：几何重建（自动编码器）和概率依赖建模（Chow-Liu树）来评分响应一致性。虽然我们引入了“百分位损失”目标来提高自动编码器对异常的鲁棒性，但我们的主要贡献是确定能够实现无监督质量控制的结构条件。在九个异构的现实世界的数据集，我们发现，检测效率是由模型的复杂性比调查结构驱动的：具有连贯性，重叠的项目电池的仪器表现出强大的协方差模式，即使是线性模型，可靠地分离注意力不集中的受访者。这揭示了一个关键的“心理测量-ML对齐”：最大化测量可靠性的相同设计原则（例如，内部一致性）也最大化算法可检测性。该框架为调查平台提供了一个可扩展的、与领域无关的诊断工具，将数据质量直接与工具设计联系起来，从而在不增加受访者负担的情况下进行审计。
摘要：The integrity of behavioral and social-science surveys depends on detecting inattentive respondents who provide random or low-effort answers. Traditional safeguards, such as attention checks, are often costly, reactive, and inconsistent. We propose a unified, label-free framework for inattentiveness detection that scores response coherence using complementary unsupervised views: geometric reconstruction (Autoencoders) and probabilistic dependency modeling (Chow-Liu trees). While we introduce a "Percentile Loss" objective to improve Autoencoder robustness against anomalies, our primary contribution is identifying the structural conditions that enable unsupervised quality control. Across nine heterogeneous real-world datasets, we find that detection effectiveness is driven less by model complexity than by survey structure: instruments with coherent, overlapping item batteries exhibit strong covariance patterns that allow even linear models to reliably separate attentive from inattentive respondents. This reveals a critical ``Psychometric-ML Alignment'': the same design principles that maximize measurement reliability (e.g., internal consistency) also maximize algorithmic detectability. The framework provides survey platforms with a scalable, domain-agnostic diagnostic tool that links data quality directly to instrument design, enabling auditing without additional respondent burden.

【6】Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study
标题：语音识别中的序列级无监督训练：理论研究
链接：https://arxiv.org/abs/2603.02285

作者：Zijian Yang,Jörg Barkoczi,Ralf Schlüter,Hermann Ney
备注：accepted to ICASSP 2026
摘要：无监督语音识别是一项利用非配对数据训练语音识别模型的任务。为了确定无监督语音识别何时以及如何成功，以及分类错误如何与候选训练目标相关，我们开发了一个基于分类错误界限的无监督语音识别理论框架。我们介绍了两个条件下，无监督语音识别是可能的。并对这些条件的必要性进行了讨论。在这些条件下，我们推导出无监督语音识别的分类误差界，并在模拟中验证了这一界。受此限制，我们提出了一个单阶段的序列级交叉熵损失的无监督语音识别。
摘要：Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.

【7】Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning
标题：课堂增量学习中正、负监督的时间失衡
链接：https://arxiv.org/abs/2603.02280

作者：Jinge Ma,Fengqing Zhu
备注：Accepted to CVPR 2026
摘要：随着深度学习在视觉任务中的广泛采用，类增量学习（CIL）已成为处理动态演变数据分布的重要范例。然而，CIL面临着灾难性遗忘的核心挑战，通常表现为对新类别的预测偏差。现有的方法主要归因于任务内类的不平衡，并集中在分类器头部的校正。在本文中，我们强调一个被忽视的因素-时间不平衡-作为这种偏见的一个关键原因。早期的课程在训练结束时会受到更强的负面监督，导致不对称的精确度和召回率。建立了时间监督模型，形式化地定义了时间不平衡，提出了时间调整损失（TAL）算法，该算法利用时间衰减核构造监督强度向量，动态地对交叉熵损失中的负监督进行加权.理论分析表明，TAL在平衡条件下退化为标准交叉熵，在不平衡条件下有效地降低了预测偏差。大量的实验表明，TAL显著减少了遗忘，提高了多个CIL基准测试的性能，强调了时间建模对稳定长期学习的重要性。
摘要：With the widespread adoption of deep learning in visual tasks, Class-Incremental Learning (CIL) has become an important paradigm for handling dynamically evolving data distributions. However, CIL faces the core challenge of catastrophic forgetting, often manifested as a prediction bias toward new classes. Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head. In this paper, we highlight an overlooked factor -- temporal imbalance -- as a key cause of this bias. Earlier classes receive stronger negative supervision toward the end of training, leading to asymmetric precision and recall. We establish a temporal supervision model, formally define temporal imbalance, and propose Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss. Theoretical analysis shows that TAL degenerates to standard cross-entropy under balanced conditions and effectively mitigates prediction bias under imbalance. Extensive experiments demonstrate that TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, underscoring the importance of temporal modeling for stable long-term learning.

【8】Scaling Reward Modeling without Human Supervision
标题：在没有人类监督的情况下扩展奖励建模
链接：https://arxiv.org/abs/2603.02225

作者：Jingxuan Fan,Yueying Li,Zhenting Qi,Dinghuai Zhang,Kianté Brantley,Sham M. Kakade,Hanlin Zhang
摘要：从反馈中学习是提高前沿模型的能力和安全性的一个工具性过程，但其有效性往往受到成本和可扩展性的限制。我们提出了一个试点研究，探索通过无监督的方法缩放奖励模型。我们将基于奖励的缩放（RBS）以最简单的形式操作化，作为从大型网络语料库中提取的文档前缀和后缀的偏好学习。它的优势体现在各个方面：尽管没有使用人工注释，但对1100万个以数学为中心的Web数据的训练在RewardBench v1和v2上产生了稳定的收益，并且这些改进在跨越模型系列和规模的各种初始化骨干中持续转移。在各个模型中，我们的方法将RewardBench v2的准确性平均提高了+7.7个点，在域内数学子集上的增益高达+16.1，并且在域外安全性和一般子集上的改进一致。当应用于N中最佳选择和策略优化时，这些奖励模型大大提高了下游数学性能，并匹配或超过了类似大小的强监督奖励模型基线。总的来说，我们证明了训练奖励模型的可行性和前景，而无需昂贵且可能不可靠的人类注释。
摘要：Learning from feedback is an instrumental process for advancing the capabilities and safety of frontier models, yet its effectiveness is often constrained by cost and scalability. We present a pilot study that explores scaling reward models through unsupervised approaches. We operationalize reward-based scaling (RBS), in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora. Its advantage is demonstrated in various aspects: despite using no human annotations, training on 11M tokens of math-focused web data yields steady gains on RewardBench v1 and v2, and these improvements consistently transfer across diverse initialization backbones spanning model families and scales. Across models, our method improves RewardBench v2 accuracy by up to +7.7 points on average, with gains of up to +16.1 on in-domain math subsets and consistent improvements on out-of-domain safety and general subsets. When applied to best-of-N selection and policy optimization, these reward models substantially improve downstream math performance and match or exceed strong supervised reward model baselines of similar size. Overall, we demonstrate the feasibility and promise of training reward models without costly and potentially unreliable human annotations.

【9】ALARM: Automated MLLM-Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification
标题：ALARM：具有不确定性量化的复杂环境监测中基于MLLM的自动异常检测
链接：https://arxiv.org/abs/2512.03101

作者：Congjing Zhang,Feng Lin,Xinyi Zhao,Pei Guo,Wei Li,Lin Chen,Chaoyue Zhao,Shuai Huang
摘要：大语言模型（Large Language Models，LLM）的发展极大地激发了研究兴趣，即开发基于多模态LLM（Multi-Modal LLM，MLLM）的视觉异常检测（Visual Anomaly Detection，VAD）算法，以应用于复杂环境。面临的挑战是，在这些复杂的环境中，异常有时是高度上下文和模糊的，因此，不确定性量化（UQ）是一个基于MLLM的VAD系统成功的关键能力。在本文中，我们介绍了我们的UQ支持的MLLM为基础的VAD框架称为报警。ALARM将UQ与推理链、自反射和MLLM集成等质量保证技术相结合，以实现稳健和准确的性能，并基于严格的概率推理管道和计算过程进行设计。使用真实世界的智能家居基准数据和伤口图像分类数据进行了广泛的实证评估，这表明ALARM的卓越性能及其在不同领域的通用适用性，以进行可靠的决策。
摘要：The advance of Large Language Models (LLMs) has greatly stimulated research interest in developing multi-modal LLM (MLLM)-based visual anomaly detection (VAD) algorithms that can be deployed in complex environments. The challenge is that in these complex environments, the anomalies are sometimes highly contextual and also ambiguous, and thereby, uncertainty quantification (UQ) is a crucial capacity for an MLLM-based VAD system to succeed. In this paper, we introduce our UQ-supported MLLM-based VAD framework called ALARM. ALARM integrates UQ with quality-assurance techniques like reasoning chain, self-reflection, and MLLM ensemble for robust and accurate performance and is designed based on a rigorous probabilistic inference pipeline and computational process. Extensive empirical evaluations are conducted using the real-world smart-home benchmark data and wound image classification data, which shows ALARM's superior performance and its generic applicability across different domains for reliable decision-making.

【10】Scalable Uncertainty Quantification for Black-Box Density-Based Clustering
标题：基于黑箱密度的可扩展不确定性量化
链接：https://arxiv.org/abs/2603.03188

作者：Nicola Bariletto,Stephen G. Walker
摘要：我们引入了一个新的框架，聚类不确定性量化。通过结合鞅后验范例与基于密度的聚类，估计密度的不确定性自然传播到聚类结构。该方法通过利用现代神经密度估计器和GPU友好的并行计算，有效地扩展到高维和不规则形状的数据。我们建立了频率一致性保证，并验证合成和真实数据的方法。
摘要：We introduce a novel framework for uncertainty quantification in clustering. By combining the martingale posterior paradigm with density-based clustering, uncertainty in the estimated density is naturally propagated to the clustering structure. The approach scales effectively to high-dimensional and irregularly shaped data by leveraging modern neural density estimators and GPU-friendly parallel computation. We establish frequentist consistency guarantees and validate the methodology on synthetic and real data.

【11】Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection
标题：用于认知障碍检测的自我监督声学表示中的偏差和公平性
链接：https://arxiv.org/abs/2603.02937

作者：Kashaf Gulzar,Korbinian Riedhammer,Elmar Nöth,Andreas K. Maier,Paula Andrea Pérez-Toro
备注：12 pages, 4 figures, 6 tables, Journal paper
摘要：基于语音的认知障碍（CI）检测为早期诊断提供了一种很有前途的非侵入性方法，但人口统计学和临床亚组之间的性能差异仍未得到充分研究，引起了人们对公平性和普遍性的担忧。本研究提出了一个系统的偏见分析声学为基础的CI和抑郁症分类使用DementiaBank皮特语料库。我们比较了传统的声学特征（MFCC，eGeMAPS）与Wav 2 Vec 2.0（W2 V2）的上下文语音嵌入，并评估了性别，年龄和抑郁状态亚组的分类性能。对于CI检测，更高层的W2 V2嵌入优于基线特征（UAR高达80.6%），但表现出性能差异;特别是，女性和年轻参与者表现出较低的辨别力（AUC：0.769和0.746）和显著的特异性差异（Δ spec分别高达18%和15%），导致误分类的风险高于其对应物。这些差异反映了代表性偏倚，定义为人口统计学或临床亚组之间模型性能的系统性差异。CI受试者中的抑郁检测产生较低的整体性能，从低和中等水平的W2 V2层轻微改善。CI和抑郁症分类之间的跨任务概括是有限的，这表明每个任务依赖于不同的表征。这些研究结果强调，需要公平意识的模型评估和亚组特定的分析，在临床语音应用程序，特别是在现实世界的应用程序中的人口和临床异质性。
摘要：Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents a systematic bias analysis of acoustic-based CI and depression classification using the DementiaBank Pitt Corpus. We compare traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0 (W2V2), and evaluate classification performance across gender, age, and depression-status subgroups. For CI detection, higher-layer W2V2 embeddings outperform baseline features (UAR up to 80.6\%), but exhibit performance disparities; specifically, females and younger participants demonstrate lower discriminative power ($AUC$: 0.769 and 0.746, respectively) and substantial specificity disparities ($Δ_{spec}$ up to 18\% and 15\%, respectively), leading to a higher risk of misclassifications than their counterparts. These disparities reflect representational biases, defined as systematic differences in model performance across demographic or clinical subgroups. Depression detection within CI subjects yields lower overall performance, with mild improvements from low and mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited, indicating that each task depends on distinct representations. These findings emphasize the need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, particularly in light of demographic and clinical heterogeneity in real-world applications.

迁移|Zero/Few/One-Shot|自适应(10篇)

【1】Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective
标题：自适应方法在高隐私设置中是可选择的：可选择的角度
链接：https://arxiv.org/abs/2603.03226

作者：Enea Monzio Compagnoni,Alessandro Stanghellini,Rustem Islamov,Aurelien Lucchi,Anastasiia Koloskova
备注：Accepted at ICLR 2026 (Poster)
摘要：随着隐私法规的收紧，差分隐私（DP）正在成为大规模培训的核心。我们重新审视DP噪声如何通过随机微分方程的镜头在优化中与自适应性相互作用，提供第一个基于PDE的私有优化器分析。聚焦于每个示例裁剪下的DP-SGD和DP-SignSGD，我们在固定超参数下显示了鲜明的对比：DP-SGD在$\mathcal{O}的隐私-效用权衡下收敛（1/\varepsilon^2）$，速度与$\varepsilon$无关，而DP-SignSGD以$\varepsilon$的线性速度收敛，$\mathcal{O}（1/\vareps）$权衡，在高隐私或大批量噪声机制中占主导地位。相比之下，在最佳学习率下，这两种方法都实现了相当的理论渐近性能;然而，DP-SGD的最佳学习率与$\vareps $呈线性关系，而DP-SignSGD的最佳学习率基本上是$\vareps $-独立的。这使得自适应方法更加实用，因为它们的超参数在隐私级别之间传输，很少或没有重新调整。实证结果证实了我们在训练和测试指标上的理论，并从DP-SignSGD扩展到DP-Adam。
摘要：Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a Privacy-Utility Trade-Off of $\mathcal{O}(1/\varepsilon^2)$ with speed independent of $\varepsilon$, while DP-SignSGD converges at a speed linear in $\varepsilon$ with an $\mathcal{O}(1/\varepsilon)$ trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with $\varepsilon$, while that of DP-SignSGD is essentially $\varepsilon$-independent. This makes adaptive methods far more practical, as their hyperparameters transfer across privacy levels with little or no re-tuning. Empirical results confirm our theory across training and test metrics, and empirically extend from DP-SignSGD to DP-Adam.

【2】Stabilized Adaptive Loss and Residual-Based Collocation for Physics-Informed Neural Networks
标题：物理信息神经网络的稳定自适应损失和基于剩余的配置
链接：https://arxiv.org/abs/2603.03224

作者：Divyavardhan Singh,Shubham Kamble,Dimple Sonone,Kishor Upla
备注：6 pages, 2 Figures, 4 tables
摘要：物理信息神经网络（PINN）已被公认为是一种无网格的替代方案，以解决物理信息合并的偏微分方程。然而，在处理以高刚度或冲击主导动力学为特征的问题时，传统的PINN被发现具有局限性，包括不平衡的训练和解决方案的不准确性，即使具有小的物理残差。在这项研究中，我们试图解决这些限制使用粘性低粘度的Burgers方程和艾伦-卡恩方程作为测试问题。在解决不平衡训练问题时，我们开发了一种新的自适应损失平衡方案，使用平滑梯度范数来确保满足初始和边界条件。此外，为了解决不准确的解决方案，我们已经开发了一个自适应的残差为基础的配置方案，以提高高物理残差的区域的解决方案的准确性。所提出的新方法显着提高解决方案的精度与一致的物理残差的满意度。例如，在Burgers方程的情况下，与传统PINN相比，相对L2误差减少了约44%，而对于Allen-Cahn方程，相对L2误差减少了约70%。此外，我们显示了值得信赖的解决方案比较所提出的方法，使用强大的有限差分求解器。
摘要：Physics-Informed Neural Networks (PINNs) have been recognized as a mesh-free alternative to solve partial differential equations where physics information is incorporated. However, in dealing with problems characterized by high stiffness or shock-dominated dynamics, traditional PINNs have been found to have limitations, including unbalanced training and inaccuracy in solution, even with small physics residuals. In this research, we seek to address these limitations using the viscous Burgers' equation with low viscosity and the Allen-Cahn equation as test problems. In addressing unbalanced training, we have developed a new adaptive loss balancing scheme using smoothed gradient norms to ensure satisfaction of initial and boundary conditions. Further, to address inaccuracy in the solution, we have developed an adaptive residual-based collocation scheme to improve the accuracy of solutions in the regions with high physics residuals. The proposed new approach significantly improves solution accuracy with consistent satisfaction of physics residuals. For instance, in the case of Burgers' equation, the relative L2 error is reduced by about 44 percent compared to traditional PINNs, while for the Allen-Cahn equation, the relative L2 error is reduced by approximately 70 percent. Additionally, we show the trustworthy solution comparison of the proposed method using a robust finite difference solver.

【3】Channel-Adaptive Edge AI: Maximizing Inference Throughput by Adapting Computational Complexity to Channel States
标题：通道自适应边缘AI：通过调整计算复杂性以适应通道状态来最大化推理输出
链接：https://arxiv.org/abs/2603.03146

作者：Jierui Zhang,Jianhao Huang,Kaibin Huang
备注：14 pages, 14 figures
摘要：集成通信和计算（IC$^2$）已经成为在第六代（6 G）网络中实现高效边缘推断的新范例。然而，由于缺乏一个易于处理的理论框架来表征端到端（E2 E）推理性能，因此IC$^2$技术的设计受到阻碍。该度量非常复杂，因为它需要考虑信道失真和人工智能（AI）模型架构和计算复杂性。在这项工作中，我们通过开发一个易处理的E2 E推理准确性分析模型，并利用它来设计一个信道自适应AI算法，最大限度地提高推理吞吐量，称为边缘处理速率（EPR），在延迟和准确性的约束下，解决这一挑战。具体来说，我们考虑一个边缘推理系统，其中服务器部署的骨干模型与早期退出，这使得灵活的计算复杂度，执行推理的数据特征传输的移动终端。所提出的精度模型的特点高维特征分布在角域中使用的混合冯米塞斯（MVM）分布。这导致作为量化位宽和模型遍历深度的函数的推理精度的期望的闭合形式的表达式，量化位宽和模型遍历深度分别表示信道失真和计算复杂度。在此精度模型的基础上，我们制定并解决联合延迟和精度约束下的EPR最大化问题，从而实现完全集成的通道自适应AI算法。该算法根据信道条件联合调整发送端特征压缩和接收端模型复杂度，以最大化整体效率和推理吞吐量。实验结果表明，与固定复杂度的同类算法相比，该算法具有更好的性能。
摘要：\emph{Integrated communication and computation} (IC$^2$) has emerged as a new paradigm for enabling efficient edge inference in sixth-generation (6G) networks. However, the design of IC$^2$ technologies is hindered by the lack of a tractable theoretical framework for characterizing \emph{end-to-end} (E2E) inference performance. The metric is highly complicated as it needs to account for both channel distortion and artificial intelligence (AI) model architecture and computational complexity. In this work, we address this challenge by developing a tractable analytical model for E2E inference accuracy and leveraging it to design a \emph{channel-adaptive AI} algorithm that maximizes inference throughput, referred to as the edge processing rate (EPR), under latency and accuracy constraints. Specifically, we consider an edge inference system in which a server deploys a backbone model with early exit, which enables flexible computational complexity, to perform inference on data features transmitted by a mobile device. The proposed accuracy model characterizes high-dimensional feature distributions in the angular domain using a Mixture of von Mises (MvM) distribution. This leads to a desired closed-form expression for inference accuracy as a function of quantization bit-width and model traversal depth, which represents channel distortion and computational complexity, respectively. Building upon this accuracy model, we formulate and solve the EPR maximization problem under joint latency and accuracy constraints, leading to a channel-adaptive AI algorithm that achieves full IC$^2$ integration. The proposed algorithm jointly adapts transmit-side feature compression and receive-side model complexity according to channel conditions to maximize overall efficiency and inference throughput. Experimental results demonstrate its superior performance as compared with fixed-complexity counterparts.

【4】On the Structural Limitations of Weight-Based Neural Adaptation and the Role of Reversible Behavioral Learning
标题：基于权重的神经适应的结构局限性和可逆行为学习的作用
链接：https://arxiv.org/abs/2603.02934

作者：Pardhu Sri Rushi Varma Konduru
备注：19 pages, 5 figures. Preprint version
摘要：神经模型通常通过微调、基于增强的训练和强化学习来改变模型组件之间共享的参数。这些变化在短期优化中是有效的。然而，它们会导致模型基本行为的长期改变。在这项研究中，我们引入了结构不可逆性的概念，作为共享参数模型自适应的一个特点。这一概念指的是特定任务的目标与模型的代表身份交织在一起。我们表明，当参数直接突变，所得到的模型的行为偏离原来的模型。如果没有一个明确的参数快照，这种分歧就不能确定性地逆转。我们引入了可逆的行为学习，其中模型行为在结构上与身份参数分离，并且可以通过显式卸载过程确定性地卸载。我们还引入了可恢复性因子作为行为可恢复性的标准化度量，并提供了基于模型分歧的其他诊断。实验表明，可逆模型自适应实现数值精度内的回滚，而共享参数变异表现出持久的复位后发散。
摘要：Neural models are usually adapted through changes in parameters shared among model components via fine-tuning, alignment-based training, and reinforcement learning. These changes have been found effective in short-term optimization. However, they result in long-term alterations in the model's base behavior. In this study, we introduce the concept of structural irreversibility as a characteristic of shared-parameter model adaptation. This concept refers to the intertwining of task-specific objectives with the representational identity of the model. We show that when parameters are directly mutated, the resulting model behaves divergently from the original model. This divergence cannot be reversed deterministically without an explicit parameter snapshot. We introduce reversible behavioral learning, in which model behaviors are structurally dissociated from identity parameters and can be deterministically unloaded through an explicit unload process. We also introduce the Recoverability Factor as a normalized measure of behavioral recoverability and provide additional diagnostics based on model divergence. Experiments show that reversible model adaptation achieves rollback within numerical precision, whereas shared-parameter mutation exhibits persistent post-reset divergence.

【5】Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding
标题：视频TokenCom：采用基于UEP的自适应源通道编码的文本意图引导多速率视频令牌通信
链接：https://arxiv.org/abs/2603.02470

作者：Jingxuan Men,Mahdi Boloursaz Mashhadi,Ning Wang,Yi Ma,Mike Nilsson,Rahim Tafazolli
摘要：令牌通信（TokenCom）是一种新的范式，其动机是大型人工智能模型（LAM）和多模态大型语言模型（MLLM）最近的成功，其中令牌作为通信和计算的统一单元，在未来的无线网络中实现高效的语义和目标导向的信息交换。在本文中，我们提出了一种新的视频TokenCom框架的文本意图引导的多速率视频通信与不平等的错误保护（UEP）为基础的源信道编码自适应。该框架将用户预期的文本描述与离散视频标记化和不等错误保护相结合，以提高带宽限制下的语义保真度。首先，通过预训练的视频标记器提取离散的视频标记，而文本条件视觉语言建模和光流传播被联合用于识别与跨空间和时间的用户意图语义相对应的标记。接下来，我们介绍了一种语义感知的多速率比特分配策略，其中与用户意图高度相关的令牌使用全码本精度进行编码，而非预期令牌通过降低码本精度差分编码来表示，从而在保持语义质量的同时实现速率节省。最后，一个源和信道编码自适应方案的发展，以适应比特分配和信道编码，以改变资源和链路条件。在各种视频数据集上的实验表明，该框架在宽信噪比范围内的感知和语义质量上优于传统和语义通信基线。
摘要：Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs), where tokens serve as unified units of communication and computation, enabling efficient semantic- and goal-oriented information exchange in future wireless networks. In this paper, we propose a novel Video TokenCom framework for textual intent-guided multi-rate video communication with Unequal Error Protection (UEP)-based source-channel coding adaptation. The proposed framework integrates user-intended textual descriptions with discrete video tokenization and unequal error protection to enhance semantic fidelity under restrictive bandwidth constraints. First, discrete video tokens are extracted through a pretrained video tokenizer, while text-conditioned vision-language modeling and optical-flow propagation are jointly used to identify tokens that correspond to user-intended semantics across space and time. Next, we introduce a semantic-aware multi-rate bit-allocation strategy, in which tokens highly related to the user intent are encoded using full codebook precision, whereas non-intended tokens are represented through reduced codebook precision differential encoding, enabling rate savings while preserving semantic quality. Finally, a source and channel coding adaptation scheme is developed to adapt bit allocation and channel coding to varying resources and link conditions. Experiments on various video datasets demonstrate that the proposed framework outperforms both conventional and semantic communication baselines, in perceptual and semantic quality on a wide SNR range.

【6】Quantum-Inspired Fine-Tuning for Few-Shot AIGC Detection via Phase-Structured Reparameterization
标题：通过相结构重新参数化对Few-ShotAIC检测进行量子启发微调
链接：https://arxiv.org/abs/2603.02281

作者：Kaiyang Xing,Han Fang,Zhaoyun Chen,Zhonghui Li,Yang Yang,Weiming Zhang,Guoping Guo
备注：12 pages, 5 figures
摘要：最近的研究表明，量子神经网络（QNN）在Few-Shot机制下具有良好的泛化能力。为了将这一优势扩展到大规模任务，我们提出了Q-LoRA，这是一种量子增强的微调方案，将轻量级QNN集成到低秩自适应（LoRA）适配器中。应用于AI生成内容（AIGC）检测，Q-LoRA在Few-Shot设置下始终优于标准LoRA。我们分析了这种改进的来源，并从QNN中识别出两种可能的结构归纳偏差：（i）相位感知表示，它在正交幅度-相位分量上编码更丰富的信息，以及（ii）范数约束变换，它通过固有的正交性稳定优化。然而，由于量子模拟，Q-LoRA会产生不小的开销。受我们分析的启发，我们进一步介绍了H-LoRA，这是一种完全经典的变体，它在LoRA适配器中应用希尔伯特变换以保持相似的相位结构和约束。在Few-Shot AIGC检测上的实验表明，Q-LoRA和H-LoRA的准确度都超过标准LoRA 5%，H-LoRA在该任务中以显著更低的成本实现了相当的准确度。
摘要：Recent studies show that quantum neural networks (QNNs) generalize well in few-shot regimes. To extend this advantage to large-scale tasks, we propose Q-LoRA, a quantum-enhanced fine-tuning scheme that integrates lightweight QNNs into the low-rank adaptation (LoRA) adapter. Applied to AI-generated content (AIGC) detection, Q-LoRA consistently outperforms standard LoRA under few-shot settings. We analyze the source of this improvement and identify two possible structural inductive biases from QNNs: (i) phase-aware representations, which encode richer information across orthogonal amplitude-phase components, and (ii) norm-constrained transformations, which stabilize optimization via inherent orthogonality. However, Q-LoRA incurs non-trivial overhead due to quantum simulation. Motivated by our analysis, we further introduce H-LoRA, a fully classical variant that applies the Hilbert transform within the LoRA adapter to retain similar phase structure and constraints. Experiments on few-shot AIGC detection show that both Q-LoRA and H-LoRA outperform standard LoRA by over 5% accuracy, with H-LoRA achieving comparable accuracy at significantly lower cost in this task.

【7】Boosting Meta-Learning for Few-Shot Text Classification via Label-guided Distance Scaling
标题：通过标签引导距离缩放增强元学习用于Few-Shot文本分类
链接：https://arxiv.org/abs/2603.02267

作者：Yunlong Gao,Xinyue Liu,Yingbo Wang,Linlin Zong,Bo Xu
摘要：Few-Shot文本分类的目标是在有限的标记文本样本中识别不可见的类别。现有的方法集中在通过在训练阶段开发复杂的算法来提升元学习者。然而，标记样本是在测试阶段随机选择的，因此它们可能无法提供有效的监督信号，从而导致错误分类。为了解决这个问题，我们提出了一个\textbf{L} abel-guided\textbf{D}距离\textbf{S}缩放（LDS）策略。该方法的核心是在训练和测试阶段利用标签语义作为监督信号。具体来说，在训练阶段，我们设计了一个标签引导的损失注入标签语义信息，拉近样本表示和相应的标签表示。在测试阶段，我们提出了一个标签引导的缩放器，缩放样本表示与标签语义，以提供额外的监督信号。因此，即使标记的样本表示远离类中心，我们的标签引导缩放器也会将它们拉近类中心，从而减轻错误分类。结合两个常用的元学习器验证了该方法的有效性。大量的实验结果表明，我们的方法显着优于国家的最先进的模型。所有数据集和代码都可以在https://anonymous.4open.science/r/Label-guided-Text-Classification上找到。
摘要：Few-shot text classification aims to recognize unseen classes with limited labeled text samples. Existing approaches focus on boosting meta-learners by developing complex algorithms in the training stage. However, the labeled samples are randomly selected during the testing stage, so they may not provide effective supervision signals, leading to misclassification. To address this issue, we propose a \textbf{L}abel-guided \textbf{D}istance \textbf{S}caling (LDS) strategy. The core of our method is exploiting label semantics as supervision signals in both the training and testing stages. Specifically, in the training stage, we design a label-guided loss to inject label semantic information, pulling closer the sample representations and corresponding label representations. In the testing stage, we propose a Label-guided Scaler which scales sample representations with label semantics to provide additional supervision signals. Thus, even if labeled sample representations are far from class centers, our Label-guided Scaler pulls them closer to their class centers, thereby mitigating the misclassification. We combine two common meta-learners to verify the effectiveness of the method. Extensive experimental results demonstrate that our approach significantly outperforms state-of-the-art models. All datasets and codes are available at https://anonymous.4open.science/r/Label-guided-Text-Classification.

【8】Adaptive Personalized Federated Learning via Multi-task Averaging of Kernel Mean Embeddings
标题：通过核均值嵌入的多任务平均的自适应个性化联邦学习
链接：https://arxiv.org/abs/2603.02233

作者：Jean-Baptiste Fermanian,Batiste Le Bars,Aurélien Bellet
摘要：个性化联合学习（PFL）使一组代理能够在不共享原始数据的情况下协作学习单个模型。我们提出了一种新的PFL方法，其中每个代理优化所有代理的经验风险的加权组合，从数据中学习的权重，而不是指定的先验。我们的方法的新颖之处在于制定这些协作权重的估计作为多个数据源的内核均值嵌入估计问题，利用多任务平均的工具来捕获代理之间的统计关系。这种观点产生了一个完全自适应的过程，不需要事先了解数据的异质性，并可以自动转换之间的全球和当地的学习制度。通过重铸的目标作为一个高维均值估计问题，我们获得了有限样本保证局部超额风险的广泛的一类分布，明确量化的统计收益的合作。为了解决联邦设置固有的通信限制，我们还提出了一个实际的实现基于随机傅立叶特征，它允许一个贸易的统计效率的通信成本。数值实验验证了我们的理论结果。
摘要：Personalized Federated Learning (PFL) enables a collection of agents to collaboratively learn individual models without sharing raw data. We propose a new PFL approach in which each agent optimizes a weighted combination of all agents' empirical risks, with the weights learned from data rather than specified a priori. The novelty of our method lies in formulating the estimation of these collaborative weights as a kernel mean embedding estimation problem with multiple data sources, leveraging tools from multi-task averaging to capture statistical relationships between agents. This perspective yields a fully adaptive procedure that requires no prior knowledge of data heterogeneity and can automatically transition between global and local learning regimes. By recasting the objective as a high-dimensional mean estimation problem, we derive finite-sample guarantees on local excess risks for a broad class of distributions, explicitly quantifying the statistical gains of collaboration. To address communication constraints inherent to federated settings, we also propose a practical implementation based on random Fourier features, which allows one to trade communication cost for statistical efficiency. Numerical experiments validate our theoretical results.

【9】Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation
标题：子空间几何导致低等级适应中的灾难性遗忘
链接：https://arxiv.org/abs/2603.02224

作者：Brady Steele
备注：15 pages, 5 figures, 6 tables
摘要：低秩自适应（LoRA）已经成为一种适应大型预训练模型的参数有效方法，但其在持续学习下的行为仍然知之甚少。我们提出了一个几何理论，通过梯度子空间相互作用的镜头表征灾难性遗忘在LoRA。我们的主要发现是，遗忘是由一个简单的几何定律：$\mathcal{F} = α（1 - \cos^2θ_{\min}）+ β$，其中$θ_{\min}$是任务梯度子空间之间的最小主角。这个公式揭示了一个近似的秩不变性属性，在高的子空间角度，遗忘变得在很大程度上独立于适配器秩（变异系数$\approximately 0.8\%$在控制合成设置; CV $\approximately 10$-19\%$在真实基准，这表明这是政权依赖，而不是绝对的）。我们验证了我们的理论合成任务（r=0.994$相关性），分裂CIFAR 100与ViT-LoRA，和顺序胶与ROBERTa-LoRA。我们的分析调和了文献中看似矛盾的发现：我们表明，只有当任务子空间相似（低角度）时，秩才会影响遗忘，而当自然正交性已经很高时，O-LoRA等正交方法提供的好处最小。这些见解为持续学习提供了原则性指导，并进行了参数有效的微调。
摘要：Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for adapting large pre-trained models, yet its behavior under continual learning remains poorly understood. We present a geometric theory characterizing catastrophic forgetting in LoRA through the lens of gradient subspace interactions. Our central finding is that forgetting is governed by a simple geometric law: $\mathcal{F} = α(1 - \cos^2θ_{\min}) + β$, where $θ_{\min}$ is the minimum principal angle between task gradient subspaces. This formulation reveals an approximate rank-invariance property, at high subspace angles, forgetting becomes largely independent of the adapter rank (coefficient of variation $\approx 0.8\%$ in controlled synthetic settings; CV $\approx 10$-$19\%$ on real benchmarks, suggesting this is regime-dependent rather than absolute). We validate our theory on synthetic tasks ($r=0.994$ correlation), Split-CIFAR100 with ViT-LoRA, and sequential GLUE with RoBERTa-LoRA. Our analysis reconciles seemingly contradictory findings in the literature: we show that rank affects forgetting only when task subspaces are similar (low angle), while orthogonal methods like O-LoRA provide minimal benefit when natural orthogonality is already high. These insights provide principled guidance for continual learning with parameter-efficient fine-tuning.

【10】ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue
标题：ATPO：多轮医疗对话的自适应树政策优化
链接：https://arxiv.org/abs/2603.02216

作者：Ruike Cao,Shaojie Bai,Fugen Yao,Liang Dong,Jian Xu,Li Xiao
备注：Accepted to ICLR 2026
摘要：多轮医学对话中有效的信息寻求对于准确诊断至关重要，尤其是在处理不完全信息时。对齐这些交互式场景的大型语言模型（LLM）是具有挑战性的，由于用户-代理交互中固有的不确定性，我们制定为分层马尔可夫决策过程（H-MDP）。虽然传统的强化学习（RL）方法，如组相对策略优化（GRPO）与长期信用分配和最近策略优化（PPO）在这种情况下遭受不稳定的值估计的斗争，我们提出了一种新的不确定性感知自适应树策略优化（ATPO）算法。我们的方法自适应分配的推出预算的状态具有高度的不确定性，量化的贝尔曼误差和动作值方差的复合度量。这一策略可以实现更准确的价值估计，同时促进更有效和多样化的探索。为了减轻基于树的RL的高计算成本，我们引入了两个关键优化：一个不确定性指导的修剪机制，以最大限度地减少推出的数量，和一个异步搜索架构，利用KV缓存重用，以最大限度地提高推理吞吐量。在三个公共医疗对话基准上的大量实验表明，我们的算法显着优于几个强基线，最终在Qwen 3 -8B模型超过了更大的GPT-4 o（$+0.92\%$的准确性）。
摘要：Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o ($+0.92\%$ accuracy).

强化学习(4篇)

【1】Reinforcement Learning with Symbolic Reward Machines
标题：使用符号奖励机的强化学习
链接：https://arxiv.org/abs/2603.03068

作者：Thomas Krug,Daniel Neider
摘要：奖励机（Reward Machines，RM）是强化学习（Reinforcement Learning，RL）中的一种机制，用于表示和学习具有非马尔可夫奖励的稀疏、时间扩展的任务。RM依赖于以标签形式存在的高级信息，这些标签由环境在观察的同时发出。然而，这个概念需要针对每个环境和任务的手动用户输入。用户必须创建一个合适的标签函数来计算标签。这些限制导致广泛采用的RL框架的适用性差。我们提出了符号奖励机（SRM）与学习算法QSRM和LSRM，以克服RM的局限性。SRM只消耗环境的标准输出，并通过由符号公式表示的保护直接处理观察结果。在我们的评估中，我们的SRM方法优于基线RL方法，并产生与现有RM方法相同的结果。同时，我们的方法坚持广泛使用的环境定义，并提供可解释的任务表示给用户。
摘要：Reward Machines (RMs) are an established mechanism in Reinforcement Learning (RL) to represent and learn sparse, temporally extended tasks with non-Markovian rewards. RMs rely on high-level information in the form of labels that are emitted by the environment alongside the observation. However, this concept requires manual user input for each environment and task. The user has to create a suitable labeling function that computes the labels. These limitations lead to poor applicability in widely adopted RL frameworks. We propose Symbolic Reward Machines (SRMs) together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs. SRMs consume only the standard output of the environment and process the observation directly through guards that are represented by symbolic formulas. In our evaluation, our SRM methods outperform the baseline RL approaches and generate the same results as the existing RM methods. At the same time, our methods adhere to the widely used environment definition and provide interpretable representations of the task to the user.

【2】CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
标题：CGL：通过强化微调推进连续的图形用户界面学习
链接：https://arxiv.org/abs/2603.02951

作者：Zhenquan Yao,Zitong Huang,Yihan Zeng,Jianhua Han,Hang Xu,Chun-Mei Feng,Jianwei Ma,Wangmeng Zuo
摘要：图形用户界面（GUI）代理，受益于多模态大语言模型（MLLM）的最新进展，取得了显着的发展。然而，由于GUI应用程序的频繁更新，在GUI持续学习中适应新任务而不忘记旧任务仍然是一个悬而未决的问题。在这项工作中，我们发现，虽然监督微调（SFT）有助于快速适应，但它往往会触发知识的更新，而强化学习（RL）则表现出固有的弹性，可以保护先前的交互逻辑不被删除。基于这一见解，我们提出了一个\textbf{C}连续\textbf{G}UI \textbf{L}学习（CGL）框架，通过增强SFT和RL之间的协同作用，动态平衡适应效率和技能保留。具体来说，我们引入了一个由策略熵指导的SFT比例调整机制，以动态控制SFT和RL训练阶段之间的权重分配。为了解决显式梯度干扰，我们进一步开发了专门的梯度手术策略。通过将探索性SFT梯度投影到基于GRPO的锚定梯度上，我们的方法显式地剪辑与GRPO冲突的SFT梯度的分量。在此基础上，我们建立了一个AndroidControl-CL基准测试，它将GUI应用程序划分为不同的任务组，以有效地模拟和评估持续GUI学习的性能。实验结果表明，我们提出的CGL框架在持续学习场景的有效性。基准、代码和模型将公开提供。
摘要：Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a \textbf{C}ontinual \textbf{G}UI \textbf{L}earning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.

【3】Contextual Latent World Models for Offline Meta Reinforcement Learning
标题：离线Meta强化学习的上下文潜在世界模型
链接：https://arxiv.org/abs/2603.02935

作者：Mohammadreza Nakheai,Aidan Scannell,Kevin Luck,Joni Pajarinen
摘要：离线元强化学习试图从固定数据集中学习跨相关任务泛化的策略。基于上下文的方法从转换历史中推断任务表示，但是在没有监督的情况下学习有效的任务表示仍然是一个挑战。同时，潜在世界模型通过时间一致性展示了强大的自我监督表示学习。我们引入上下文潜在世界模型，条件推断任务表示和训练他们共同的上下文编码器的潜在世界模型。这强制执行任务条件的时间一致性，产生任务表示，捕捉任务依赖的动态，而不仅仅是区分任务。我们的方法学习了更具表现力的任务表示，并显着提高了对MuJoCo，Contextual-DeepMind Control和Meta-World基准测试中看不见的任务的泛化能力。
摘要：Offline meta-reinforcement learning seeks to learn policies that generalize across related tasks from fixed datasets. Context-based methods infer a task representation from transition histories, but learning effective task representations without supervision remains a challenge. In parallel, latent world models have demonstrated strong self-supervised representation learning through temporal consistency. We introduce contextual latent world models, which condition latent world models on inferred task representations and train them jointly with the context encoder. This enforces task-conditioned temporal consistency, yielding task representations that capture task-dependent dynamics rather than merely discriminating between tasks. Our method learns more expressive task representations and significantly improves generalization to unseen tasks across MuJoCo, Contextual-DeepMind Control, and Meta-World benchmarks.

【4】Heterogeneous Agent Collaborative Reinforcement Learning
标题：异类代理协同强化学习
链接：https://arxiv.org/abs/2603.02604

作者：Zhixia Zhang,Zixuan Huang,Xin Xia,Deqing Wang,Fuzhen Zhuang,Shuai Ma,Ning Ding,Yaodong Yang,Jianxin Li,Yikun Ban
摘要：我们引入异构代理协作强化学习（HACRL），一种新的学习范式，解决了孤立的策略优化的效率低下。HACRL支持独立执行的协作优化：异构代理在训练期间共享已验证的部署以相互改进，同时在推理时独立操作。与基于LLM的多智能体强化学习（MARL）不同，HACRL不需要协调部署，并且与on/off策略蒸馏不同，它可以实现异构智能体之间的双向相互学习，而不是单向的教师到学生的转移。基于这种模式，我们提出了HACPO，一种协作RL算法，可以实现原则性的推广共享，以最大限度地提高样本利用率和跨代理知识转移。为了缓解能力差异和策略分布变化，HACPO引入了四种定制机制，并在无偏优势估计和优化正确性方面提供了理论保证。在不同的异构模型组合和推理基准的广泛实验表明，HACPO一贯提高所有参与的代理，优于GSPO的平均3.3%，而使用只有一半的推出成本。
摘要：We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.

分层学习(1篇)

【1】ChemFlow:A Hierarchical Neural Network for Multiscale Representation Learning in Chemical Mixtures
标题：ChemFlow：用于化学混合物中多尺度表示学习的分层神经网络
链接：https://arxiv.org/abs/2603.02810

作者：Jinming Fan,Chao Qian,Wilhelm T. S. Huck,William E. Robinson,Shaodong Zhou
摘要：使用图神经网络准确预测分子混合物的物理化学性质仍然是一个重大挑战，因为它需要同时嵌入分子内相互作用，同时考虑混合物组成（即，浓度和比例）。现有的方法是装备不良，以模拟现实的混合物环境中，密集耦合的相互作用传播的层次-从原子和官能团到整个分子-和跨层次的信息交换不断调制的组成。为了弥合孤立的分子和现实的化学环境之间的差距，我们提出了ChemFlow，一种新的分层框架，集成了原子，官能团和分子水平的功能，促进这些水平的信息流，以预测复杂的化学混合物的行为。ChemFlow采用原子级特征融合模块Chem-embed来生成受混合物状态和原子特征影响的上下文感知原子表示。接下来，双向基团对分子和分子对基团的关注机制使ChemFlow能够捕获混合物中分子内和分子间的官能团相互作用。通过基于浓度和组成动态调整表示，ChemFlow擅长预测浓度依赖性特性，并在浓度敏感和浓度无关系统中显著优于最先进的模型。大量的实验证明了ChemFlow在模拟复杂化学混合物方面的卓越准确性和效率。
摘要：Accurate prediction of the physicochemical properties of molecular mixtures using graph neural networks remains a significant challenge, as it requires simultaneous embedding of intramolecular interactions while accounting for mixture composition (i.e., concentrations and ratios). Existing approaches are ill-equipped to emulate realistic mixture environments, where densely coupled interactions propagate across hierarchical levels - from atoms and functional groups to entire molecules - and where cross-level information exchange is continuously modulated by composition. To bridge the gap between isolated molecules and realistic chemical environments, we present ChemFlow, a novel hierarchical framework that integrates atomic, functional group, and molecular-level features, facilitating information flow across these levels to predict the behavior of complex chemical mixtures. ChemFlow employs an atomic-level feature fusion module, Chem-embed, to generate context-aware atomic representations influenced by the mixture state and atomic characteristics. Next, bidirectional group-to-molecule and molecule-to-group attention mechanisms enable ChemFlow to capture functional group interactions both within and across molecules in the mixture. By dynamically adjusting representations based on concentration and composition, ChemFlow excels at predicting concentration-dependent properties and significantly outperforms state-of-the-art models in both concentration-sensitive and concentration-independent systems. Extensive experiments demonstrate ChemFlow's superior accuracy and efficiency in modeling complex chemical mixtures.

医学相关(3篇)

【1】An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification
标题：多模式临床病情分类中的校准和选择性预测的实证分析
链接：https://arxiv.org/abs/2603.02719

作者：L. Julián Lechuga López,Farah E. Shamout,Tim G. J. Rudner
备注：33 pages, 14 figures, 8 tables
摘要：随着人工智能系统走向临床部署，确保可靠的预测行为是安全关键型决策任务的基础。一种建议的保障措施是选择性预测，模型可以将不确定的预测推迟给人类专家进行审查。在这项工作中，我们经验性地评估了基于不确定性的选择性预测的可靠性，在多标签临床条件分类使用多模式ICU数据。在一系列最先进的单峰和多峰模型中，我们发现，尽管有很强的标准评估指标，但选择性预测会大大降低性能。这种失败是由严重的类依赖性误校准造成的，模型将高不确定性分配给正确的预测，而将低不确定性分配给不正确的预测，特别是对于代表性不足的临床状况。我们的研究结果表明，常用的聚合指标可以掩盖这些影响，限制他们的能力，以评估选择性预测行为在这种情况下。总之，我们的研究结果描述了多模态临床状况分类中选择性预测的特定任务故障模式，并强调了校准感知评估的必要性，以提供临床AI安全性和鲁棒性的强有力保证。
摘要：As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.

【2】PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis
标题：PRism：探索非均匀预训练的脑电基础模型向临床鉴别诊断的转移
链接：https://arxiv.org/abs/2603.02268

作者：Jeet Bandhu Lahiri,Parshva Runwal,Arvasu Kulkarni,Mahir Jain,Aditya Ray Mishra,Siddharth Panwar,Sandeep Singh
备注：14 pages, 1 figure, 5 tables
摘要：EEG基础模型通常在窄源临床档案上进行预训练，并在同一生态系统的基准上进行评估，因此不清楚表示是否编码神经生理学或记录分布伪影。我们介绍了PRISM（群体代表不变信号模型），这是一个沿着两个轴烧蚀的掩蔽自动编码器-预训练群体和下游自适应-具有固定的架构和预处理。我们比较了窄源欧盟/美国语料库（TUH + PhysioNet）与多个EEG系统中多中心南亚临床记录增强的地理多样性池。有三个发现。首先，窄源预训练在分布匹配的基准上产生更强的线性探测，而多样化的预训练在微调下产生更适应的表示-这是单协议评估下不可见的权衡。在三个源语料库上进行训练后，PRISM在大多数任务上与REVE（92个数据集，60，000小时以上）相匹配或优于REVE，这表明有针对性的多样性可以取代无差别的规模，并且数据集数量是模型比较中的一个混杂变量。其次，在一项具有临床挑战性且以前未经测试的任务中-通过发作间期EEG将癫痫与诊断性模仿者区分开来-多样化检查点的平衡准确性超过窄源检查点+12.3 pp，这是所有评估中的最大差距。第三，EEG-Bench和EEG-FM-Bench反向模型在相同数据集上的排名之间的系统不一致性高达24 pp;我们确定了六个具体来源，包括拆分构建，检查点选择，片段长度和归一化，显示这些因素非相加。
摘要：EEG foundation models are typically pretrained on narrow-source clinical archives and evaluated on benchmarks from the same ecosystem, leaving unclear whether representations encode neural physiology or recording-distribution artifacts. We introduce PRISM (Population Representative Invariant Signal Model), a masked autoencoder ablated along two axes -- pretraining population and downstream adaptation -- with architecture and preprocessing fixed. We compare a narrow-source EU/US corpus (TUH + PhysioNet) against a geographically diverse pool augmented with multi-center South Asian clinical recordings across multiple EEG systems. Three findings emerge. First, narrow-source pretraining yields stronger linear probes on distribution-matched benchmarks, while diverse pretraining produces more adaptable representations under fine-tuning -- a trade-off invisible under single-protocol evaluation. Trained on three source corpora, PRISM matches or outperforms REVE (92 datasets, 60,000+ hours) on the majority of tasks, demonstrating that targeted diversity can substitute for indiscriminate scale and that dataset count is a confounding variable in model comparison. Second, on a clinically challenging and previously untested task -- distinguishing epilepsy from diagnostic mimickers via interictal EEG -- the diverse checkpoint outperforms the narrow-source checkpoint by +12.3 pp balanced accuracy, the largest gap across all evaluations. Third, systematic inconsistencies between EEG-Bench and EEG-FM-Bench reverse model rankings on identical datasets by up to 24 pp; we identify six concrete sources including split construction, checkpoint selection, segment length, and normalization, showing these factors compound non-additively.

【3】Detecting Structural Heart Disease from Electrocardiograms via a Generalized Additive Model of Interpretable Foundation-Model Predictors
标题：通过可解释基础模型预测因子的广义相加模型从心电图中检测结构性心脏病
链接：https://arxiv.org/abs/2603.02616

作者：Ya Zhou,Zhaohong Sun,Tianxiang Hao,Xiangjie Li
摘要：结构性心脏病（SHD）是一种常见的疾病，有许多未确诊的病例，早期发现往往受到超声心动图（ECHO）的高成本和可及性限制的限制。最近的研究表明，基于人工智能（AI）的心电图（ECG）分析可以检测SHD，提供可扩展的替代方案。然而，现有的方法是完全黑箱模型，限制了可解释性和临床采用。为了解决这些挑战，我们提出了一个可解释的和有效的框架，将临床上有意义的ECG基础模型预测因子集成在一个广义的加性模型中，在保持强大的预测性能的同时实现透明的风险归因。使用超过80，000个ECG-ECHO对的EchoNext基准测试，该方法在AUROC中显示出+0.98%的相对改进，在AUPRC中显示出+1.01%的相对改进，在F1得分中显示出+1.41%的相对改进，而即使只有30%的训练数据也能实现更好的性能。亚组分析证实了异质人群的稳健性能，估计的条目功能提供了对传统ECG诊断风险与SHD之间关系的可解释性见解。这项工作说明了经典统计建模和现代AI之间的互补范例，为可解释，高性能和临床可操作的基于ECG的SHD筛查提供了一条途径。
摘要：Structural heart disease (SHD) is a prevalent condition with many undiagnosed cases, and early detection is often limited by the high cost and accessibility constraints of echocardiography (ECHO). Recent studies show that artificial intelligence (AI)-based analysis of electrocardiograms (ECGs) can detect SHD, offering a scalable alternative. However, existing methods are fully black-box models, limiting interpretability and clinical adoption. To address these challenges, we propose an interpretable and effective framework that integrates clinically meaningful ECG foundation-model predictors within a generalized additive model, enabling transparent risk attribution while maintaining strong predictive performance. Using the EchoNext benchmark of over 80,000 ECG-ECHO pairs, the method demonstrates relative improvements of +0.98% in AUROC, +1.01% in AUPRC, and +1.41% in F1 score over the latest state-of-the-art deep-learning baseline, while achieving slightly better performance even with only 30% of the training data. Subgroup analyses confirm robust performance across heterogeneous populations, and the estimated entry-wise functions provide interpretable insights into the relationships between risks of traditional ECG diagnoses and SHD. This work illustrates a complementary paradigm between classical statistical modeling and modern AI, offering a pathway to interpretable, high-performing, and clinically actionable ECG-based SHD screening.

蒸馏|知识提取(3篇)

【1】Post Hoc Extraction of Pareto Fronts for Continuous Control
标题：事后提取帕累托前沿以实现连续控制
链接：https://arxiv.org/abs/2603.02628

作者：Raghav Thakar,Gaurav Dixit,Kagan Tumer
备注：10 pages, 4 figures. Submitted to IJCAI 2026
摘要：现实世界中的智能体必须经常平衡多个目标，例如连续控制中的速度、稳定性和能效。为了应对不断变化的条件和偏好，智能体必须理想地学习代表多个最优权衡的帕累托边界政策。多策略多目标强化学习（MORL）的最新进展使得能够直接学习帕累托前沿，但需要从训练开始就充分考虑多目标。在实践中，多目标偏好往往是在一项政策已经针对单一专门目标进行培训之后出现的。现有的MORL方法无法利用这些预先培训的“专家”来学习帕累托前沿，并避免再培训的样本成本。我们介绍了混合优势帕累托提取（MAPEX），离线MORL方法，通过重用预先训练的专家政策，评论家和重放缓冲区，构建了一个前沿的政策。MAPEX将专家评论家的评估结合成一个混合优势信号，并将行为克隆损失与之加权，以训练平衡多个目标的新政策。MAPEX的事后Pareto前沿提取保留了单目标非策略强化学习的简单性，并避免将这些算法改造成复杂的MORL框架。我们正式描述的MAPEX程序和评估MAPEX五个多目标MuJoCo环境。给定相同的起始政策，MAPEX生产可比的前沿在0.001\%$的样本成本的既定基线。
摘要：Agents in the real world must often balance multiple objectives, such as speed, stability, and energy efficiency in continuous control. To account for changing conditions and preferences, an agent must ideally learn a Pareto frontier of policies representing multiple optimal trade-offs. Recent advances in multi-policy multi-objective reinforcement learning (MORL) enable learning a Pareto front directly, but require full multi-objective consideration from the start of training. In practice, multi-objective preferences often arise after a policy has already been trained on a single specialised objective. Existing MORL methods cannot leverage these pre-trained `specialists' to learn Pareto fronts and avoid incurring the sample costs of retraining. We introduce Mixed Advantage Pareto Extraction (MAPEX), an offline MORL method that constructs a frontier of policies by reusing pre-trained specialist policies, critics, and replay buffers. MAPEX combines evaluations from specialist critics into a mixed advantage signal, and weights a behaviour cloning loss with it to train new policies that balance multiple objectives. MAPEX's post hoc Pareto front extraction preserves the simplicity of single-objective off-policy RL, and avoids retrofitting these algorithms into complex MORL frameworks. We formally describe the MAPEX procedure and evaluate MAPEX on five multi-objective MuJoCo environments. Given the same starting policies, MAPEX produces comparable fronts at $0.001\%$ the sample cost of established baselines.

【2】A Unified Revisit of Temperature in Classification-Based Knowledge Distillation
标题：分类知识提炼中温度的统一重新审视
链接：https://arxiv.org/abs/2603.02430

作者：Logan Frank,Jim Davis
摘要：知识蒸馏的中心思想是暴露嵌入在教师权重中的关系结构以供学生学习，这通常使用温度参数来促进。尽管它的广泛使用，仍然有有限的理解如何选择一个适当的温度值，或者这个值如何取决于其他训练元素，如优化器，教师预训练/微调等，在实践中，温度通常是通过网格搜索或通过采用值从以前的工作，这可能是耗时的，或可能导致次优的学生表现时，培训设置不同。在这项工作中，我们认为温度与这些训练成分密切相关，并提出了一项统一的研究，系统地研究了这种相互作用。通过分析这些交叉连接，我们确定并提出了对温度选择有明显影响的常见情况，为从业者在工作中采用知识蒸馏提供了有价值的指导。
摘要：A central idea of knowledge distillation is to expose relational structure embedded in the teacher's weights for the student to learn, which is often facilitated using a temperature parameter. Despite its widespread use, there remains limited understanding on how to select an appropriate temperature value, or how this value depends on other training elements such as optimizer, teacher pretraining/finetuning, etc. In practice, temperature is commonly chosen via grid search or by adopting values from prior work, which can be time-consuming or may lead to suboptimal student performance when training setups differ. In this work, we posit that temperature is closely linked to these training components and present a unified study that systematically examines such interactions. From analyzing these cross-connections, we identify and present common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.

【3】From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness
标题：从更少的样本到更少的位：将数据集蒸馏重新构建为精确度和紧凑度的联合优化
链接：https://arxiv.org/abs/2603.02411

作者：My H. Dinh,Aditya Sant,Akshay Malhotra,Keya Patani,Shahab Hamidi-Rad
备注：Accepted to CVPR 2026 - Findings Workshop
摘要：数据集蒸馏（DD）将大型数据集压缩成紧凑的合成数据集，以保持训练性能。然而，目前的方法主要针对样本减少，有限的考虑数据精度及其对效率的影响。我们提出了量化感知数据集蒸馏（QuADD），这是一个统一的框架，可以在固定的比特预算下联合优化数据集的紧凑性和精度。QuADD在蒸馏回路中集成了一个可微分量化模块，实现了合成样本和量化参数的端到端协同优化。在率失真视角的指导下，我们实证分析了样本数和精度之间的比特分配如何影响学习性能。我们的框架支持均匀和自适应非均匀量化，后者从数据中学习量化级别，以更好地表示信息密集区域。图像分类和3GPP波束管理任务的实验表明，QuADD在每比特精度方面超过了现有的DD和后量化基线，为信息高效的数据集蒸馏建立了新的标准。
摘要：Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and its impact on efficiency. We propose Quantization-aware Dataset Distillation (QuADD), a unified framework that jointly optimizes dataset compactness and precision under fixed bit budgets. QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. Guided by the rate-distortion perspective, we empirically analyze how bit allocation between sample count and precision influences learning performance. Our framework supports both uniform and adaptive non-uniform quantization, where the latter learns quantization levels from data to represent information-dense regions better. Experiments on image classification and 3GPP beam management tasks show that QuADD surpasses existing DD and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.

推荐(1篇)

【1】SOLAR: SVD-Optimized Lifelong Attention for Recommendation
标题：SOlar：经过VD优化的终身关注推荐
链接：https://arxiv.org/abs/2603.02561

作者：Chenghao Zhang,Chao Feng,Yuanhao Pu,Xunyong Yang,Wenhui Yu,Xiang Li,Yongqi Liu,Lantao Hu,Kaiqiao Zhan,Han Li,Kun Gai
备注：18 pages, 4 figures
摘要：注意力机制仍然是Transformers中的定义算子，因为它提供了表达性的全局信用分配，但其在序列长度$N$中的时间和内存成本为O（N^2 d）$，使得长上下文建模昂贵，并且经常强制截断或其他算法。线性注意力通过内核特征映射重新排序计算，将复杂度降低到O（N d^2）$，但这种重新表述放弃了softmax机制并改变了注意力分数分布。在推荐系统中，矩阵中的低秩结构并不罕见，而是其表征学习中的默认归纳偏差，特别是在用户行为序列建模中表现得更为明显。利用这种结构，我们引入了SVD-Attention，它在低秩矩阵上理论上是无损的，并保留了softmax，同时将注意力复杂度从$O（N^2 d）$降低到$O（Ndr）$。在SVD-Attention的基础上，我们提出了SOLAR，SVD-Optimized Lifelong Attention for Recommendation，一个序列建模框架，支持上万级的行为序列和数千个项目的候选集，无需任何过滤。在快手的在线推荐场景中，SOLAR提供了0.68\%的视频浏览量增益以及额外的业务指标改进。
摘要：Attention mechanism remains the defining operator in Transformers since it provides expressive global credit assignment, yet its $O(N^2 d)$ time and memory cost in sequence length $N$ makes long-context modeling expensive and often forces truncation or other heuristics. Linear attention reduces complexity to $O(N d^2)$ by reordering computation through kernel feature maps, but this reformulation drops the softmax mechanism and shifts the attention score distribution. In recommender systems, low-rank structure in matrices is not a rare case, but rather the default inductive bias in its representation learning, particularly explicit in the user behavior sequence modeling. Leveraging this structure, we introduce SVD-Attention, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from $O(N^2 d)$ to $O(Ndr)$. With SVD-Attention, we propose SOLAR, SVD-Optimized Lifelong Attention for Recommendation, a sequence modeling framework that supports behavior sequences of ten-thousand scale and candidate sets of several thousand items in cascading process without any filtering. In Kuaishou's online recommendation scenario, SOLAR delivers a 0.68\% Video Views gain together with additional business metrics improvements.

聚类(1篇)

【1】The elbow statistic: Multiscale clustering statistical significance
标题：肘部统计：多尺度聚集统计意义
链接：https://arxiv.org/abs/2603.03235

作者：Francisco J. Perez-Reche
备注：30 pages, 3 figures, 5 tables
摘要：选择聚类的数量仍然是无监督学习中的一个基本挑战。现有的标准通常针对单个“最优”分区，经常忽略在多个分辨率下存在的统计上有意义的结构。我们介绍ElbowSig，一个框架，正式的启发式“肘”的方法作为一个严格的推理问题。我们的方法集中在一个归一化的离散曲率统计来自集群的异质性序列，这是对非结构化数据的零分布进行评估。我们推导出这种零统计量在大样本和高维情况下的渐近性质，表征其基线行为和随机变异性。作为一个算法不可知的过程，ElbowSig只需要异质性序列，并与各种聚类方法兼容，包括硬聚类、模糊聚类和基于模型的聚类。在合成数据集和经验数据集上进行的大量实验表明，该方法保持了适当的I型错误控制，同时提供了解决通常被单分辨率选择标准所掩盖的多尺度组织结构的能力。
摘要：Selecting the number of clusters remains a fundamental challenge in unsupervised learning. Existing criteria typically target a single ``optimal'' partition, often overlooking statistically meaningful structure present at multiple resolutions. We introduce ElbowSig, a framework that formalizes the heuristic ``elbow'' method as a rigorous inferential problem. Our approach centers on a normalized discrete curvature statistic derived from the cluster heterogeneity sequence, which is evaluated against a null distribution of unstructured data. We derive the asymptotic properties of this null statistic in both large-sample and high-dimensional regimes, characterizing its baseline behavior and stochastic variability. As an algorithm-agnostic procedure, ElbowSig requires only the heterogeneity sequence and is compatible with a wide range of clustering methods, including hard, fuzzy, and model-based clustering. Extensive experiments on synthetic and empirical datasets demonstrate that the method maintains appropriate Type-I error control while providing the power to resolve multiscale organizational structures that are typically obscured by single-resolution selection criteria.

超分辨率|去噪|去模糊|去雾(2篇)

【1】Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising
标题：用于实时语音去噪的可区分时变IRR过滤
链接：https://arxiv.org/abs/2603.02794

作者：Riccardo Rota,Kiril Ratmanski,Jozef Coldenhoff,Milos Cernak
备注：Submitted to Interspeech 2026
摘要：我们提出了TVF（时变滤波），一个低延迟的语音增强模型与1万个参数。TVF将数字信号处理（DSP）的可解释性与深度学习的适应性相结合，弥合了传统滤波与现代神经语音建模之间的差距。该模型利用轻量级神经网络主干来实时预测可微分35频段IIR滤波器级联的系数，使其能够动态适应非平稳噪声。与“黑盒”深度学习方法不同，TVF提供了一个完全可解释的处理链，其中频谱修改是明确的和可调整的。我们使用Valentini-Botinhao数据集证明了这种方法在语音去噪任务中的有效性，并将结果与静态DDSP方法和完全基于深度学习的解决方案进行了比较，表明TVF能够有效地适应不断变化的噪声条件。
摘要：We present TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters. Combining the interpretability of Digital Signal Processing (DSP) with the adaptability of deep learning, TVF bridges the gap between traditional filtering and modern neural speech modeling. The model utilizes a lightweight neural network backbone to predict the coefficients of a differentiable 35-band IIR filter cascade in real time, allowing it to adapt dynamically to non-stationary noise. Unlike ``black-box'' deep learning approaches, TVF offers a completely interpretable processing chain, where spectral modifications are explicit and adjustable. We demonstrate the efficacy of this approach on a speech denoising task using the Valentini-Botinhao dataset and compare the results to a static DDSP approach and a fully deep-learning-based solution, showing that TVF achieves effective adaptation to changing noise conditions.

【2】Manifold Aware Denoising Score Matching (MAD)
标题：管汇感知去噪评分匹配（MAD）
链接：https://arxiv.org/abs/2603.02452

作者：Alona Levy-Jurgenson,Alvaro Prat,James Cuin,Yee Whye Teh
摘要：设计用于学习定义在流形上的分布的方法的主要焦点是减轻隐式学习流形的需要，以便学习可以集中在流形内的数据分布上。然而，实现这一点往往会导致计算密集型解决方案。在这项工作中，我们提出了一个简单的修改去噪得分匹配在环境空间中隐式占流形，从而减少学习流形的负担，同时保持计算效率。具体来说，我们提出了一个简单的分解得分函数为一个已知的组件$s^{base}$和一个剩余的组件$s-s^{base}$（学习目标），与前者隐含地包括信息的数据流形驻留。我们推导出已知的组件$s^{base}$在几个重要的情况下，包括分布在旋转矩阵和离散分布的分析形式，并使用它们来证明这种方法在这些情况下的效用。
摘要：A major focus in designing methods for learning distributions defined on manifolds is to alleviate the need to implicitly learn the manifold so that learning can concentrate on the data distribution within the manifold. However, accomplishing this often leads to compute-intensive solutions. In this work, we propose a simple modification to denoising score-matching in the ambient space to implicitly account for the manifold, thereby reducing the burden of learning the manifold while maintaining computational efficiency. Specifically, we propose a simple decomposition of the score function into a known component $s^{base}$ and a remainder component $s-s^{base}$ (the learning target), with the former implicitly including information on where the data manifold resides. We derive known components $s^{base}$ in analytical form for several important cases, including distributions over rotation matrices and discrete distributions, and use them to demonstrate the utility of this approach in those cases.

自动驾驶|车辆|车道检测等(2篇)

【1】SynthCharge: An Electric Vehicle Routing Instance Generator with Feasibility Screening to Enable Learning-Based Optimization and Benchmarking
标题：SynthCharge：一款电动汽车路由实例生成器，具有可行性筛选，可实现基于学习的优化和基准测试
链接：https://arxiv.org/abs/2603.03230

作者：Mertcan Daysalilar,Fuat Uyguroglu,Gabriel Nicolosi,Adam Meyers
备注：This work has been submitted to the IEEE for possible publication
摘要：带时间窗的电动车辆路径问题（EVRPTW）通过引入电池容量约束和充电站决策扩展了经典的VRPTW。现有的基准数据集通常是静态的，缺乏可验证的可行性，这限制了基于学习的路由模型的可重复评估。我们介绍了SynthCharge，一个参数生成器，它可以在不同的时空配置和可扩展的客户数量中生成多样化的，可行性筛选的EVRPTW实例。虽然SynthCharge目前可以生成多达500个客户的大规模实例，但我们的实验重点是5到100个客户的规模。与静态基准套件不同，SynthCharge将实例几何与自适应能量容量缩放和范围感知充电站放置集成在一起。为了保证结构的有效性，生成器通过快速可行性筛选过程系统地过滤掉无法解决的实例。最终，SynthCharge提供了系统地评估新兴神经路由和数据驱动方法的鲁棒性所需的动态基准测试基础设施。
摘要：The electric vehicle routing problem with time windows (EVRPTW) extends the classical VRPTW by introducing battery capacity constraints and charging station decisions. Existing benchmark datasets are often static and lack verifiable feasibility, which restricts reproducible evaluation of learning-based routing models. We introduce SynthCharge, a parametric generator that produces diverse, feasibility-screened EVRPTW instances across varying spatiotemporal configurations and scalable customer counts. While SynthCharge can currently generate large-scale instances of up to 500 customers, we focus our experiments on sizes ranging from 5 to 100 customers. Unlike static benchmark suites, SynthCharge integrates instance geometry with adaptive energy capacity scaling and range-aware charging station placement. To guarantee structural validity, the generator systematically filters out unsolvable instances through a fast feasibility screening process. Ultimately, SynthCharge provides the dynamic benchmarking infrastructure needed to systematically evaluate the robustness of emerging neural routing and data-driven approaches.

【2】Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving
标题：通过Langevin引导的自动驾驶流量匹配的实时生成策略
链接：https://arxiv.org/abs/2603.02613

作者：Tianze Zhu,Yinuo Wang,Wenjun Zou,Tianyi Zhang,Likun Wang,Letian Tao,Feihong Zhang,Yao Lyu,Shengbo Eben Li
摘要：强化学习（RL）是自动驾驶系统中的一种基本方法，其中生成策略通过利用其对复杂分布进行建模的能力来增强探索，从而展现出相当大的潜力。然而，它们固有的高推理延迟严重阻碍了它们在实时决策和控制中的部署。为了解决这个问题，我们提出了通过流匹配（DACER-F）通过将流匹配引入在线RL，从而在单个推理步骤中生成竞争性动作的熵调节器的扩散演员-评论家。通过利用朗之万动力学和Q函数的梯度，DACER-F动态地优化了从经验重放到目标分布的动作，该目标分布平衡了高Q值信息与探索性行为。然后训练流策略以有效地学习从简单先验分布到该动态目标的映射。在复杂的多车道和交叉口模拟中，DACER-F的性能优于具有熵调节器（DACER）和分布式软行动者-评论者（DSAC）的基线扩散行动者-评论者，同时保持超低的推理延迟。DACER-F进一步证明了其在标准RL基准DeepMind Control Suite（DMC）上的可扩展性，在类人站立任务中获得了775.8分，超过了之前的方法。总的来说，这些结果建立DACER-F作为一个高性能和计算效率的RL算法。
摘要：Reinforcement learning (RL) is a fundamental methodology in autonomous driving systems, where generative policies exhibit considerable potential by leveraging their ability to model complex distributions to enhance exploration. However, their inherent high inference latency severely impedes their deployment in real-time decision-making and control. To address this issue, we propose diffusion actor-critic with entropy regulator via flow matching (DACER-F) by introducing flow matching into online RL, enabling the generation of competitive actions in a single inference step. By leveraging Langevin dynamics and gradients of the Q-function, DACER-F dynamically optimizes actions from experience replay toward a target distribution that balances high Q-value information with exploratory behavior. The flow policy is then trained to efficiently learn a mapping from a simple prior distribution to this dynamic target. In complex multi-lane and intersection simulations, DACER-F outperforms baselines diffusion actor-critic with entropy regulator (DACER) and distributional soft actor-critic (DSAC), while maintaining an ultra-low inference latency. DACER-F further demonstrates its scalability on standard RL benchmark DeepMind Control Suite (DMC), achieving a score of 775.8 in the humanoid-stand task and surpassing prior methods. Collectively, these results establish DACER-F as a high-performance and computationally efficient RL algorithm.

联邦学习|隐私保护|加密(3篇)

【1】Breaking the Prototype Bias Loop: Confidence-Aware Federated Contrastive Learning for Highly Imbalanced Clients
标题：打破原型偏见循环：高度不平衡客户的信任意识联邦对比学习
链接：https://arxiv.org/abs/2603.03007

作者：Tian-Shuang Wu,Shen-Huan Lyu,Ning Chen,Yi-Xiao He,Bing Tang,Baoliu Ye,Qingfu Zhang
摘要：局部类的不平衡和客户端之间的数据异构性往往会使基于原型的联邦对比学习陷入原型偏差循环：由不平衡数据引起的有偏差的局部原型被聚合成有偏差的全局原型，这些原型被反复重用为对比锚，从而在通信回合中积累错误。为了打破这个循环，我们提出了置信度感知的联邦对比学习（CAFedCL），一个新的框架，改进了原型聚合机制，并加强了原型指导下的对比对齐。CAFedCL采用了一种信任感知的聚合机制，利用预测的不确定性来降低高方差本地原型的权重。此外，少数类的生成增强和几何一致性正则化相结合，以稳定类之间的结构。从理论的角度来看，我们提供了一个基于预期的分析表明，我们的聚合减少了估计方差，从而约束全局原型漂移，并确保收敛。在不同级别的类不平衡和数据异构性下进行的大量实验表明，CAFedCL在准确性和客户端公平性方面始终优于代表性的联邦基线。
摘要：Local class imbalance and data heterogeneity across clients often trap prototype-based federated contrastive learning in a prototype bias loop: biased local prototypes induced by imbalanced data are aggregated into biased global prototypes, which are repeatedly reused as contrastive anchors, accumulating errors across communication rounds. To break this loop, we propose Confidence-Aware Federated Contrastive Learning (CAFedCL), a novel framework that improves the prototype aggregation mechanism and strengthens the contrastive alignment guided by prototypes. CAFedCL employs a confidence-aware aggregation mechanism that leverages predictive uncertainty to downweight high-variance local prototypes. In addition, generative augmentation for minority classes and geometric consistency regularization are integrated to stabilize the structure between classes. From a theoretical perspective, we provide an expectation-based analysis showing that our aggregation reduces estimation variance, thereby bounding global prototype drift and ensuring convergence. Extensive experiments under varying levels of class imbalance and data heterogeneity demonstrate that CAFedCL consistently outperforms representative federated baselines in both accuracy and client fairness.

【2】EdgeFLow: Serverless Federated Learning via Sequential Model Migration in Edge Networks
标题：EdgeFLow：通过边缘网络中的顺序模型迁移进行无服务器联邦学习
链接：https://arxiv.org/abs/2603.02562

作者：Yuchen Shi,Qijun Hou,Pingyi Fan,Khaled B. Letaief
摘要：联邦学习（FL）已经成为物联网（IoT）时代的一种变革性分布式学习范式，重新定义了数据处理方法。然而，FL系统由于不可避免的客户端-服务器数据交换和长距离传输而面临显著的通信瓶颈。这项工作提出了EdgeFlow，这是一个创新的FL框架，通过在边缘基站之间顺序迁移模型来取代传统的云服务器，重新设计系统拓扑。通过仅在边缘集群进行模型聚合和传播，EdgeFlow消除了基于云的传输，并大大减少了全局通信开销。在非凸目标和非IID数据分布下，我们为EdgeFlow提供了严格的收敛性分析，扩展了经典的FL收敛理论。各种配置的实验结果验证了理论分析，表明EdgeFLow实现了相当的精度提高，同时显着降低了通信成本。作为通信高效FL的系统架构创新，EdgeFlow为物联网和边缘网络学习系统的未来发展建立了基础框架。
摘要：Federated Learning (FL) has emerged as a transformative distributed learning paradigm in the era of Internet of Things (IoT), reconceptualizing data processing methodologies. However, FL systems face significant communication bottlenecks due to inevitable client-server data exchanges and long-distance transmissions. This work presents EdgeFLow, an innovative FL framework that redesigns the system topology by replacing traditional cloud servers with sequential model migration between edge base stations. By conducting model aggregation and propagation exclusively at edge clusters, EdgeFLow eliminates cloud-based transmissions and substantially reduces global communication overhead. We provide rigorous convergence analysis for EdgeFLow under non-convex objectives and non-IID data distributions, extending classical FL convergence theory. Experimental results across various configurations validate the theoretical analysis, demonstrating that EdgeFLow achieves comparable accuracy improvements while significantly reducing communication costs. As a systemic architectural innovation for communication-efficient FL, EdgeFLow establishes a foundational framework for future developments in IoT and edge-network learning systems.

【3】Convex and Non-convex Federated Learning with Stale Stochastic Gradients: Diminishing Step Size is All You Need
标题：具有陈旧随机因素的凸和非凸联邦学习：缩小步距即可
链接：https://arxiv.org/abs/2603.02639

作者：Xinran Zheng,Tara Javidi,Behrouz Touri
摘要：我们提出了一个一般框架下的延迟梯度模型的分布式随机优化。在这种情况下，$n$本地代理利用自己的数据和计算，以协助中央服务器在最小化的代理的本地成本函数组成的全球目标。每个代理被允许传输随机的潜在偏差和延迟估计其本地梯度。虽然先前的工作提倡延迟自适应步长随机梯度下降（SGD）在延迟的存在下，我们证明了预先选择的递减步长是足够的，并匹配的自适应方案的性能。此外，我们的分析建立，减少步长恢复非凸和强凸目标的最佳SGD率。
摘要：We propose a general framework for distributed stochastic optimization under delayed gradient models. In this setting, $n$ local agents leverage their own data and computation to assist a central server in minimizing a global objective composed of agents' local cost functions. Each agent is allowed to transmit stochastic-potentially biased and delayed-estimates of its local gradient. While a prior work has advocated delay-adaptive step sizes for stochastic gradient descent (SGD) in the presence of delays, we demonstrate that a pre-chosen diminishing step size is sufficient and matches the performance of the adaptive scheme. Moreover, our analysis establishes that diminishing step sizes recover the optimal SGD rates for nonconvex and strongly convex objectives.

推理|分析|理解|解释(7篇)

【1】Step-Level Sparse Autoencoder for Reasoning Process Interpretation
标题：用于推理过程解释的分步稀疏自动编码器
链接：https://arxiv.org/abs/2603.03031

作者：Xuan Yang,Jiayu Liu,Yuhang Lai,Hao Xu,Zhenya Huang,Ning Miao
摘要：大型语言模型（LLM）通过思想链（CoT）推理实现了强大的复杂推理能力。然而，他们的推理模式仍然过于复杂，难以分析。虽然稀疏自动编码器（SAE）已经成为一个强大的工具，可解释性，现有的方法主要是在令牌级操作，创建一个粒度不匹配时，捕捉更关键的步骤级信息，如推理方向和语义转换。在这项工作中，我们提出了步骤级稀疏自动编码器（SSAE），它作为一个分析工具，解开不同方面的LLM的推理步骤到稀疏的功能。具体来说，通过精确控制一个步骤的功能条件下，其上下文的稀疏性，我们形成了一个信息瓶颈步骤重建，从背景信息中分离增量信息，并解开它到几个稀疏激活的维度。在多个基本模型和推理任务上的实验表明了所提取特征的有效性。通过线性探测，我们可以很容易地预测表层信息，如生成长度和第一个令牌分布，以及更复杂的属性，如步骤的正确性和逻辑性。这些观察结果表明，LLM应该已经至少部分地知道这些属性在生成过程中，这为LLM的自我验证能力提供了基础。该代码可在https://github.com/Miaow-Lab/SSAE上获得
摘要：Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at https://github.com/Miaow-Lab/SSAE

【2】SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety
标题：SaFeR-Tools Kit：通过虚拟工具进行结构化推理，以实现多模式安全
链接：https://arxiv.org/abs/2603.02635

作者：Zixuan Xu,Tiancheng He,Huahui Yi,Kun Wang,Xi Chen,Gongli Xi,Qiankun Li,Kang Li,Yang Liu,Zhigang Zeng
摘要：视觉语言模型仍然容易受到多模式越狱和过度拒绝的影响，因为安全取决于视觉证据和用户意图，而许多对齐管道只监督最终的响应。为了解决这个问题，我们提出了SaFeR-ToolKit，它将安全决策正式化为可检查的协议。具体来说，规划者指定一个人物角色，一个感知到推理到决策的工具集，以及一个受约束的转换图，而响应者在最终答案之前输出一个类型化的键值工具轨迹。为了使协议在实践中得到可靠的遵循，我们训练了一个具有三个阶段课程（SFT $\to$ DPO $\to$ GRPO）的单一策略，其中GRPO直接监督工具使用情况，而不是回答级别的反馈。我们的贡献是双重的：一。数据集。第一个基于工具的安全推理数据集，包括31，654个示例（SFT 6 k，DPO 18.6k，GRPO 6 k）和1 k个保留评估。二.实验在Qwen2.5-VL上，SaFeR-ToolKit显著提高了3B上的安全性/帮助性/推理严谨性（29.39/45.04/4.98 $\至84.40/71.13/78.87 $）和7 B（53.21/52.92/19.26 $\to $86.34/80.79/85.34），同时保留一般功能（3B：58.67 $\to $59.21; 7 B：66.39 $\to $66.81）。代码可在https://github.com/Duebassx/SaFeR_ToolKit上获得。
摘要：Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

【3】Implicit Bias in Deep Linear Discriminant Analysis
标题：深度线性鉴别分析中的隐性偏差
链接：https://arxiv.org/abs/2603.02622

作者：Jiawen Li
摘要：虽然标准损失函数的隐式偏差（或隐式正则化）已经被研究，但由判别度量学习目标引起的优化几何仍然在很大程度上未被探索。据我们所知，本文提出了由深度LDA引起的隐式正则化的初步理论分析，深度LDA是一种尺度不变目标，旨在最小化类内方差和最大化类间距离。通过分析L层对角线性网络的损失梯度流，证明了在平衡初始化下，该网络结构将标准的加性梯度更新转换为乘性权值更新，从而证明了（2/L）拟范数的自动守恒性.
摘要：While the Implicit Bias(or Implicit Regularization) of standard loss functions has been studied, the optimization geometry induced by discriminative metric-learning objectives remains largely unexplored.To the best of our knowledge, this paper presents an initial theoretical analysis of the implicit regularization induced by the Deep LDA,a scale invariant objective designed to minimize intraclass variance and maximize interclass distance. By analyzing the gradient flow of the loss on a L-layer diagonal linear network, we prove that under balanced initialization, the network architecture transforms standard additive gradient updates into multiplicative weight updates, which demonstrates an automatic conservation of the (2/L) quasi-norm.

【4】Joint Optimization of Model Partitioning and Resource Allocation for Anti-Jamming Collaborative Inference Systems
标题：抗干扰协同推理系统模型划分和资源分配联合优化
链接：https://arxiv.org/abs/2603.02579

作者：Mengru Wu,Jiawei Li,Jiaqi Wei,Bin Lyu,Kai-Kit Wong,Hyundong Shin
摘要：随着深度神经网络（DNN）推理对资源受限设备的计算需求不断增加，基于DNN分区的设备边缘协作推理已成为一种有前途的范例。然而，中间特征数据的传输容易受到恶意干扰，这显著降低了整体推理性能。为了应对这种威胁，这封信的重点是在存在恶意干扰的情况下的抗干扰协作推理系统。在该系统中，DNN模型被划分为两个不同的部分，分别由无线设备和边缘服务器执行。我们首先通过数据回归分析干扰和DNN划分对推理准确性的影响。在此基础上，我们的目标是通过联合优化计算资源分配，设备的发射功率和DNN分区，在推理精度和计算资源约束下最大限度地提高系统的延迟和准确性（RDA）收益。为了解决混合整数非线性规划问题，我们提出了一种有效的交替优化算法，该算法将问题分解为三个子问题，分别通过Karush-Kuhn-Tucker条件，凸优化方法和量子遗传算法来解决。大量的模拟表明，我们提出的方案优于基线的RDA。
摘要：With the increasing computational demands of deep neural network (DNN) inference on resource-constrained devices, DNN partitioning-based device-edge collaborative inference has emerged as a promising paradigm. However, the transmission of intermediate feature data is vulnerable to malicious jamming, which significantly degrades the overall inference performance. To counter this threat, this letter focuses on an anti-jamming collaborative inference system in the presence of a malicious jammer. In this system, a DNN model is partitioned into two distinct segments, which are executed by wireless devices and edge servers, respectively. We first analyze the effects of jamming and DNN partitioning on inference accuracy via data regression. Based on this, our objective is to maximize the system's revenue of delay and accuracy (RDA) under inference accuracy and computing resource constraints by jointly optimizing computation resource allocation, devices' transmit power, and DNN partitioning. To address the mixed-integer nonlinear programming problem, we propose an efficient alternating optimization-based algorithm, which decomposes the problem into three subproblems that are solved via Karush-Kuhn-Tucker conditions, convex optimization methods, and a quantum genetic algorithm, respectively. Extensive simulations demonstrate that our proposed scheme outperforms baselines in terms of RDA.

【5】Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
标题：通过对比镜头：VLM中的自我改进视觉推理
链接：https://arxiv.org/abs/2603.02556

作者：Zhiyu Pan,Yizheng Wu,Jiashen Hua,Junyi Feng,Shaotian Yan,Bing Deng,Zhiguo Cao,Jieping Ye
备注：19 pages, 9 figures, accepted to ICLR 2026 (oral)
摘要：推理已经成为大型语言模型的关键能力。在语言任务中，这种能力可以通过自我改进技术来增强，这些技术可以细化推理路径以供后续微调。然而，将这些基于语言的自我改进方法扩展到视觉语言模型（VLM）提出了一个独特的挑战：推理路径中的视觉幻觉无法有效地验证或纠正。我们的解决方案从关于视觉对比度的关键观察开始：当呈现对比VQA对时，即，两个视觉上相似的图像与同义的问题，VLM识别相关的视觉线索更精确。基于这一观察，我们提出了视觉对比自学推理器（VC-STaR），这是一种新型的自我改进框架，它利用视觉对比来减轻模型生成的理论中的幻觉。我们收集了一套不同的VQA数据集，根据多模态相似性进行对比，并使用VC-STaR生成理论依据。因此，我们获得了一个新的视觉推理数据集VisCoR-55 K，然后通过监督微调来提高各种VLM的推理能力。大量的实验表明，VC-STaR不仅优于现有的自我改进方法，而且还优于在SoTA视觉推理数据集上微调的模型，这表明VLM固有的对比能力可以引导自己的视觉推理。项目网址：https://github.com/zhiyupan42/VC-STaR。
摘要：Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.

【6】Federated Inference: Toward Privacy-Preserving Collaborative and Incentivized Model Serving
标题：联邦推理：面向隐私保护的协作和激励模型服务
链接：https://arxiv.org/abs/2603.02214

作者：Jungwon Seo,Ferhat Ozgur Catak,Chunming Rong,Jaeyeon Jang
备注：19 pages, 6 figures, 10 tables
摘要：联邦推理（FI）研究独立训练和私有模型如何在推理时进行协作，而无需共享数据或模型参数。虽然最近的工作已经从不同的角度探索了安全和分布式推理，但仍然缺乏对FI的统一抽象和系统级理解。本文将FI定位为一种独特的协作范式，与联邦学习互补，并确定了管理其可行性的两个基本要求：推理时间隐私保护和通过协作获得有意义的性能增益。我们正式FI作为一个受保护的协作计算，分析其核心设计尺寸，并检查结构的权衡时出现的隐私约束，非IID数据，有限的可观察性联合施加在推理时间。通过一个具体的实例和实证分析，我们强调经常性的摩擦点，隐私保护的推理，合奏为基础的合作，激励对齐。我们的研究结果表明，FI表现出系统级的行为，不能直接继承训练时间联邦或经典的合奏方法。总的来说，这项工作提供了一个统一的观点，FI和概述了开放的挑战，必须解决，以实现实用的，可扩展的和隐私保护的协作推理系统。
摘要：Federated Inference (FI) studies how independently trained and privately owned models can collaborate at inference time without sharing data or model parameters. While recent work has explored secure and distributed inference from disparate perspectives, a unified abstraction and system-level understanding of FI remain lacking. This paper positions FI as a distinct collaborative paradigm, complementary to federated learning, and identifies two fundamental requirements that govern its feasibility: inference-time privacy preservation and meaningful performance gains through collaboration. We formalize FI as a protected collaborative computation, analyze its core design dimensions, and examine the structural trade-offs that arise when privacy constraints, non-IID data, and limited observability are jointly imposed at inference time. Through a concrete instantiation and empirical analysis, we highlight recurring friction points in privacy-preserving inference, ensemble-based collaboration, and incentive alignment. Our findings suggest that FI exhibits system-level behaviors that cannot be directly inherited from training-time federation or classical ensemble methods. Overall, this work provides a unifying perspective on FI and outlines open challenges that must be addressed to enable practical, scalable, and privacy-preserving collaborative inference systems.

【7】Generalized Bayes for Causal Inference
标题：因果推理的广义Bayes
链接：https://arxiv.org/abs/2603.03035

作者：Emil Javurek,Dennis Frauen,Yuxin Wang,Stefan Feuerriegel
摘要：不确定性量化是因果机器学习的许多应用的核心，但因果效应的原则贝叶斯推理仍然具有挑战性。标准贝叶斯方法通常需要为数据生成过程指定概率模型，包括高维滋扰组件，如倾向分数和结果回归。因此，标准后验很容易受到强大的建模选择，包括复杂的先验启发。在本文中，我们提出了一个广义贝叶斯因果推理框架。我们的框架避免了明确的似然建模，相反，我们将先验直接对因果被估量和更新这些使用识别驱动的损失函数，产生广义后验因果效应。因此，我们的框架将现有的基于损失的因果估计量与完整的不确定性量化的估计。我们的框架是灵活的，适用于广泛的因果被估量（例如，ATE，CATE）。此外，我们的框架可以应用于最先进的因果机器学习管道（例如，Neyman-orthogonal元学习器）。对于奈曼正交损失，我们表明，广义后验收敛到他们的预言同行，并保持稳健的第一阶段滋扰估计误差。校准，因此，我们得到有效的频率论的不确定性，即使当滋扰估计收敛速度慢于参数。从经验上讲，我们证明了我们提出的框架提供了因果效应估计校准的不确定性在几个因果推理设置。据我们所知，这是第一个为因果机器学习构建广义贝叶斯后验的灵活框架。
摘要：Uncertainty quantification is central to many applications of causal machine learning, yet principled Bayesian inference for causal effects remains challenging. Standard Bayesian approaches typically require specifying a probabilistic model for the data-generating process, including high-dimensional nuisance components such as propensity scores and outcome regressions. Standard posteriors are thus vulnerable to strong modeling choices, including complex prior elicitation. In this paper, we propose a generalized Bayesian framework for causal inference. Our framework avoids explicit likelihood modeling; instead, we place priors directly on the causal estimands and update these using an identification-driven loss function, which yields generalized posteriors for causal effects. As a result, our framework turns existing loss-based causal estimators into estimators with full uncertainty quantification. Our framework is flexible and applicable to a broad range of causal estimands (e.g., ATE, CATE). Further, our framework can be applied on top of state-of-the-art causal machine learning pipelines (e.g., Neyman-orthogonal meta-learners). For Neyman-orthogonal losses, we show that the generalized posteriors converge to their oracle counterparts and remain robust to first-stage nuisance estimation error. With calibration, we thus obtain valid frequentist uncertainty even when nuisance estimators converge at slower-than-parametric rates. Empirically, we demonstrate that our proposed framework offers causal effect estimation with calibrated uncertainty across several causal inference settings. To the best of our knowledge, this is the first flexible framework for constructing generalized Bayesian posteriors for causal machine learning.

检测相关(1篇)

【1】Single Microphone Own Voice Detection based on Simulated Transfer Functions for Hearing Aids
标题：基于模拟传递函数的助听器单麦克风自身语音检测
链接：https://arxiv.org/abs/2603.02724

作者：Mathuranathan Mayuravaani,W. Bastiaan Kleijn,Andrew Lensen,Charlotte Sørensen
摘要：本文提出了一种基于仿真的方法，使用单个麦克风的助听器中自己的语音检测（OVD）。虽然OVD可以显著提高用户舒适度和语音清晰度，但现有的解决方案通常依赖于多个麦克风或额外的传感器，从而增加了设备的复杂性和成本。为了使基于ML的OVD，而不需要昂贵的传递函数测量，我们提出了一个数据增强策略的基础上模拟的声学传递函数（ATF），暴露模型的空间传播条件的范围很广。基于变换器的分类器首先在分析生成的ATF上进行训练，然后使用数值模拟的ATF进行逐步微调，从刚性球体模型过渡到详细的头部和躯干表示。这种分层适应使模型能够在保持泛化的同时改善其空间理解。实验结果表明，95.52%的准确率模拟头和躯干测试数据。在短持续时间条件下，该模型保持90.02%的准确率与一秒的话语。在真实的助听器录音中，该模型在没有微调的情况下达到了80%的准确率，并得到了轻量级测试时间特征补偿的帮助。这突出了该模型从模拟到真实世界条件的推广能力，证明了实际可行性，并为未来的助听器设计指明了一个有希望的方向。
摘要：This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increasing device complexity and cost. To enable ML-based OVD without requiring costly transfer-function measurements, we propose a data augmentation strategy based on simulated acoustic transfer functions (ATFs) that expose the model to a wide range of spatial propagation conditions. A transformer-based classifier is first trained on analytically generated ATFs and then progressively fine-tuned using numerically simulated ATFs, transitioning from a rigid-sphere model to a detailed head-and-torso representation. This hierarchical adaptation enabled the model to refine its spatial understanding while maintaining generalization. Experimental results show 95.52% accuracy on simulated head-and-torso test data. Under short-duration conditions, the model maintained 90.02% accuracy with one-second utterances. On real hearing aid recordings, the model achieved 80% accuracy without fine-tuning, aided by lightweight test-time feature compensation. This highlights the model's ability to generalize from simulated to real-world conditions, demonstrating practical viability and pointing toward a promising direction for future hearing aid design.

分类|识别(5篇)

【1】CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
标题：CFG-Control：基于控制的无分类器扩散指南
链接：https://arxiv.org/abs/2603.03281

作者：Hanyang Wang,Yiyang Liu,Jiawei Chi,Fangfu Liu,Ran Xue,Yueqi Duan
备注：Accepted by CVPR 2026; Project Page: https://hanyang-21.github.io/CFG-Ctrl
摘要：无分类器引导（CFG）已经成为一个中心的方法，以提高基于流的扩散模型中的语义对齐。在本文中，我们探索了一个统一的框架，称为CFG-Ctrl，它重新解释的CFG作为一个控制应用于一阶连续时间生成流，使用的条件-无条件的差异作为误差信号来调整速度场。从这个角度来看，我们总结香草CFG作为一个比例控制器（P-控制）与固定增益，和典型的后续变种发展扩展控制律的设计派生it. However，现有的方法主要依赖于线性控制，固有的不稳定性，超调，和降低语义保真度，特别是在大的指导尺度。为了解决这个问题，我们引入滑模控制CFG（SMC-CFG），它强制生成流朝向快速收敛的滑动流形。具体来说，我们定义了一个指数滑动模式表面的语义预测误差，并引入一个开关控制项，建立非线性反馈引导校正。此外，我们提供了一个李雅普诺夫稳定性分析，理论上支持有限时间收敛。在包括Stable Diffusion 3.5、Flux和Qwen-Image在内的文本到图像生成模型上的实验表明，SMC-CFG在语义对齐方面优于标准CFG，并在广泛的指导尺度上增强了鲁棒性。项目页面：https://hanyang-21.github.io/CFG-Ctrl
摘要：Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: https://hanyang-21.github.io/CFG-Ctrl

【2】The Price of Robustness: Stable Classifiers Need Overparameterization
标题：稳健性的代价：稳定的分类器需要过度参数化
链接：https://arxiv.org/abs/2603.02806

作者：Jonas von Berg,Adalbert Fono,Massimiliano Datres,Sohir Maskey,Gitta Kutyniok
备注：29 pages, 9 figures. Accepted at ICLR 2026
摘要：过度参数化，稳定性和泛化之间的关系仍然不完全理解的设置不连续的分类。我们通过建立有限函数类的泛化界来解决这个差距，该泛化界与类的稳定性成反比，定义为到输入域中的决策边界（保证金）的预期距离。解释类稳定性作为一个可量化的概念的鲁棒性，我们推导出作为一个推论的鲁棒性分类法，扩展的结果Bubeck和Sellke超越光滑假设不连续的功能。特别是，任何插值模型与$p \approx n$参数的$n$数据点必须是不稳定的，这意味着大量的overparameterization是必要的，以实现高稳定性。我们得到类似的结果参数化的无穷函数类通过分析一个更强的鲁棒性措施来自于在余域，我们称之为规范化的共同稳定的利润。实验支持我们的理论：稳定性随着模型的规模而增加，并与测试性能相关，而传统的基于规范的措施在很大程度上仍然没有信息。
摘要：The relationship between overparameterization, stability, and generalization remains incompletely understood in the setting of discontinuous classifiers. We address this gap by establishing a generalization bound for finite function classes that improves inversely with class stability, defined as the expected distance to the decision boundary in the input domain (margin). Interpreting class stability as a quantifiable notion of robustness, we derive as a corollary a law of robustness for classification that extends the results of Bubeck and Sellke beyond smoothness assumptions to discontinuous functions. In particular, any interpolating model with $p \approx n$ parameters on $n$ data points must be unstable, implying that substantial overparameterization is necessary to achieve high stability. We obtain analogous results for parameterized infinite function classes by analyzing a stronger robustness measure derived from the margin in the codomain, which we refer to as the normalized co-stability. Experiments support our theory: stability increases with model size and correlates with test performance, while traditional norm-based measures remain largely uninformative.

【3】Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild
标题：野外肤色分类的大规模数据集和基准
链接：https://arxiv.org/abs/2603.02475

作者：Vitor Pereira Matias,Márcus Vinícius Lobo Costa,João Batista Neto,Tiago Novello de Brito
备注：12 pages, 11 figures
摘要：深度学习模型通常会从训练数据中继承偏差。虽然跨性别和种族的公平性已经得到了很好的研究，但由于缺乏粒度的注释数据集，细粒度的肤色分析仍然是一个挑战。现有的方法通常依赖于医学6音Fitzpatrick量表，缺乏视觉代表性，或者使用小型的私人数据集，防止再现性，或者通常依赖于经典的计算机视觉管道，少数使用深度学习。它们忽略了训练测试泄漏和数据集不平衡等问题，并且受到小数据集或不可用数据集的限制。在这项工作中，我们提出了一个全面的框架，肤色公平。首先，我们介绍了STW，这是一个大规模的开放获取数据集，包括来自3，564个人的42，313张图像，使用10色调MST量表进行标记。其次，我们对经典计算机视觉（SkinToneCCV）和深度学习方法进行了基准测试，证明经典模型提供了近乎随机的结果，而深度学习达到了接近注释器的准确性。最后，我们提出了SkinToneNet，这是一个经过微调的ViT，可以对域外数据进行最先进的泛化，从而实现对CelebA和VGGFace2等公共数据集的可靠公平审计。这项工作提供了最先进的结果，在肤色分类和公平性评估。代码和数据即将提供
摘要：Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon

【4】Neural quantum support vector data description for one-class classification
标题：用于一类分类的神经量子支持量数据描述
链接：https://arxiv.org/abs/2603.02700

作者：Changjae Im,Hyeondo Oh,Daniel K. Park
备注：17 pages, 7 figures
摘要：单类分类（OCC）是机器学习中的一个基本问题，有许多应用，如异常检测和质量控制。随着现代数据集的复杂性和维数的增加，对具有更好表达性和效率的高级OCC技术的需求不断增长。我们介绍了神经量子支持向量数据描述（NQSVDD），这是一个用于OCC的经典-量子混合框架，可以执行端到端的优化分层表示学习。NQSVDD集成了一个经典的神经网络与可训练的量子数据编码和变分量子电路，使模型能够学习为OCC目标量身定制的非线性特征变换。混合架构将输入数据映射到中间高维特征空间，随后将其投影到通过量子测量定义的紧凑潜在空间中。重要的是，特征嵌入和潜在表示都被联合优化，使得正常数据形成一个紧凑的集群，最小体积的超球体提供了一个有效的决策边界。在基准数据集上的实验评估表明，与经典的Deep SVDD和量子基线相比，NQSVDD实现了具有竞争力或优越的AUC性能，同时在现实噪声条件下保持了参数效率和鲁棒性。
摘要：One-class classification (OCC) is a fundamental problem in machine learning with numerous applications, such as anomaly detection and quality control. With the increasing complexity and dimensionality of modern datasets, there is a growing demand for advanced OCC techniques with better expressivity and efficiency. We introduce Neural Quantum Support Vector Data Description (NQSVDD), a classical-quantum hybrid framework for OCC that performs end-to-end optimized hierarchical representation learning. NQSVDD integrates a classical neural network with trainable quantum data encoding and a variational quantum circuit, enabling the model to learn nonlinear feature transformations tailored to the OCC objective. The hybrid architecture maps input data into an intermediate high-dimensional feature space and subsequently projects it into a compact latent space defined through quantum measurements. Importantly, both the feature embedding and the latent representation are jointly optimized such that normal data form a compact cluster, for which a minimum-volume enclosing hypersphere provides an effective decision boundary. Experimental evaluations on benchmark datasets demonstrate that NQSVDD achieves competitive or superior AUC performance compared to classical Deep SVDD and quantum baselines, while maintaining parameter efficiency and robustness under realistic noise conditions.

【5】LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification
标题：基于LMU的序列学习和后验集融合用于跨领域婴儿哭声分类
链接：https://arxiv.org/abs/2603.02245

作者：Niloofar Jazaeri,Hilmi R. Dajani,Marco Janeczek,Martin Bouchard
备注：7 pages
摘要：由于短的非平稳信号、有限的注释以及婴儿和数据集之间的强域偏移，解码婴儿哭泣原因对于医疗保健监测仍然具有挑战性。我们提出了一个紧凑的声学框架，融合MFCC，STFT和音高功能的多分支CNN编码器和模型的时间动态使用增强的勒让德记忆单元（LMU）。与LSTM相比，LMU主干提供稳定的序列建模，具有显著更少的重复参数，支持高效部署。为了提高跨数据集的泛化能力，我们引入了具有熵门控加权的校准后验集成融合，以保留特定领域的专业知识，同时减轻数据集偏差。在Baby2020和Baby Crying上的实验证明了跨域评估下改进的宏F1，以及泄漏软件分裂和实时设备监控的可行性。
摘要：Decoding infant cry causes remains challenging for healthcare monitoring due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets. We propose a compact acoustic framework that fuses MFCC, STFT, and pitch features within a multi-branch CNN encoder and models temporal dynamics using an enhanced Legendre Memory Unit (LMU). Compared to LSTMs, the LMU backbone provides stable sequence modeling with substantially fewer recurrent parameters, supporting efficient deployment. To improve cross-dataset generalization, we introduce calibrated posterior ensemble fusion with entropy-gated weighting to preserve domain-specific expertise while mitigating dataset bias. Experiments on Baby2020 and Baby Crying demonstrate improved macro-F1 under cross-domain evaluation, along with leakageaware splits and real-time feasibility for on-device monitoring.

表征(3篇)

【1】Guiding Sparse Neural Networks with Neurobiological Principles to Elicit Biologically Plausible Representations
标题：用神经生物学原理引导稀疏神经网络以引出生物学上合理的表示
链接：https://arxiv.org/abs/2603.03234

作者：Patrick Inoue,Florian Röhrbein,Andreas Knoblauch
摘要：虽然深度神经网络（DNN）在图像识别等任务中取得了显着的性能，但它们经常在泛化、从少数例子中学习和持续适应（生物神经系统固有的能力）方面遇到困难。这些挑战的出现是由于DNN未能模仿生物网络的有效，自适应学习机制。为了解决这些问题，我们探讨了神经生物学启发的假设在神经网络学习中的整合。这项研究介绍了一种生物启发的学习规则，自然地集成了神经生物学原理，包括稀疏性，对数正态权重分布，并遵守戴尔定律，而不需要明确的执行。通过与这些核心神经生物学原理保持一致，我们的模型增强了对抗性攻击的鲁棒性，并展示了卓越的泛化能力，特别是在Few-Shot学习场景中。值得注意的是，整合这些约束导致生物学上合理的神经表征的出现，强调了将神经生物学假设纳入神经网络设计的有效性。初步结果表明，这种方法可以从特定于特征的编码扩展到特定于任务的编码，可能为复杂任务的神经资源分配提供见解。
摘要：While deep neural networks (DNNs) have achieved remarkable performance in tasks such as image recognition, they often struggle with generalization, learning from few examples, and continuous adaptation - abilities inherent in biological neural systems. These challenges arise due to DNNs' failure to emulate the efficient, adaptive learning mechanisms of biological networks. To address these issues, we explore the integration of neurobiologically inspired assumptions in neural network learning. This study introduces a biologically inspired learning rule that naturally integrates neurobiological principles, including sparsity, lognormal weight distributions, and adherence to Dale's law, without requiring explicit enforcement. By aligning with these core neurobiological principles, our model enhances robustness against adversarial attacks and demonstrates superior generalization, particularly in few-shot learning scenarios. Notably, integrating these constraints leads to the emergence of biologically plausible neural representations, underscoring the efficacy of incorporating neurobiological assumptions into neural network design. Preliminary results suggest that this approach could extend from feature-specific to task-specific encoding, potentially offering insights into neural resource allocation for complex tasks.

【2】Information Routing in Atomistic Foundation Models: How Equivariance Creates Linearly Disentangled Representations
标题：原子基础模型中的信息路由：等方差如何创建线性解纠缠表示
链接：https://arxiv.org/abs/2603.03155

作者：Joshua Steier
摘要：原子基础模型在其中间表示中编码了什么，这些信息是如何组织的？我们介绍了组合投影分解（CPD），它使用QR投影线性地从学习的表示中去除组合信号，并探测几何残差。在QM 9分子和Materials Project晶体的五个架构系列的八个模型中，我们发现了一个解纠缠梯度：张量积等变体系结构（MACE）生成的表示在组合去除后几何几乎完全线性可访问（对于HOMO-LUMO间隙，R^2_{\text{geom}} = 0.782$），而手工描述符（ANI-2x）非线性地纠缠相同的信息（在Ridge下$R^2_{\text{geom}} =-0.792 $;在MLP下$R^2 =+0.784 $）。MACE通过不可约表示通道（偶极至$L = 1$，HOMO-LUMO间隙至$L = 0$）路由目标特定信号-这是在同一探针下在ViSNet的矢量-标量架构中未观察到的模式。我们表明，梯度提升树探测投影残差系统膨胀，恢复R^2 = 0.68$-0.95 $的纯成分的目标，并建议线性探测作为主要指标。线性解纠缠表示在线性探测下更有样本效率，这表明了等变结构在原始预测精度之外的实际优势。
摘要：What do atomistic foundation models encode in their intermediate representations, and how is that information organized? We introduce Composition Projection Decomposition (CPD), which uses QR projection to linearly remove composition signal from learned representations and probes the geometric residual. Across eight models from five architectural families on QM9 molecules and Materials Project crystals, we find a disentanglement gradient: tensor product equivariant architectures (MACE) produce representations where geometry is almost fully linearly accessible after composition removal ($R^2_{\text{geom}} = 0.782$ for HOMO-LUMO gap), while handcrafted descriptors (ANI-2x) entangle the same information nonlinearly ($R^2_{\text{geom}} = -0.792$ under Ridge; $R^2 = +0.784$ under MLP). MACE routes target-specific signal through irreducible representation channels -- dipole to $L = 1$, HOMO-LUMO gap to $L = 0$ -- a pattern not observed in ViSNet's vector-scalar architecture under the same probe. We show that gradient boosted tree probes on projected residuals are systematically inflated, recovering $R^2 = 0.68$--$0.95$ on a purely compositional target, and recommend linear probes as the primary metric. Linearly disentangled representations are more sample-efficient under linear probing, suggesting a practical advantage for equivariant architectures beyond raw prediction accuracy.

【3】Concept Heterogeneity-aware Representation Steering
标题：概念同质性感知的表示转向
链接：https://arxiv.org/abs/2603.02237

作者：Laziz U. Abdullaev,Noelle Y. L. Wong,Ryan T. Z. Lee,Shiqi Jiang,Khoi N. M. Nguyen,Tan M. Nguyen
摘要：表示转向提供了一种轻量级的机制，通过在推理时干预内部激活来控制大型语言模型（LLM）的行为。大多数现有的方法依赖于一个单一的全局转向方向，通常通过对比数据集的均值差获得。这种方法隐含地假设目标概念在嵌入空间中是均匀表示的。然而，在实践中，LLM表示可以是高度非均匀的，表现出集群的、上下文相关的结构，这使得全局转向方向脆弱。在这项工作中，我们通过最优传输（OT）的镜头查看表示转向，注意到标准的均值差异转向隐含地对应于具有相同协方差的两个单峰高斯分布之间的OT图，产生全局平移。为了放松这一限制性的假设，我们理论上建模的源和目标表示为高斯混合模型，并制定指导作为一个离散的OT问题之间的语义潜在的集群。从由此产生的运输计划，我们通过重心投影得到一个明确的，依赖于输入的导向图，产生一个平滑的，内核加权的组合集群级的变化。我们将这种方法称为概念异质性感知表示转向（CHaRS）。通过大量的实验设置，我们表明，CHaRS产生更有效的行为控制比全球转向。
摘要：Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

编码器(1篇)

【1】On Geometry Regularization in Autoencoder Reduced-Order Models with Latent Neural ODE Dynamics
标题：具有潜在神经ODE动力学的自动编码器降阶模型中的几何正规化
链接：https://arxiv.org/abs/2603.03238

作者：Mikhail Osipov
备注：25 pages, 2 figures, 3 tables
摘要：我们研究了编码器-解码器降阶模型中学习到的潜在表示的几何正则化策略。在对流-扩散-反应（ADR）方程的固定实验设置中，我们使用神经ODE对潜在动力学进行建模，并评估在自动编码器预训练期间应用的四种正则化方法：（a）解码器雅可比矩阵的近等距正则化，（b）基于随机方向增益的随机解码器增益惩罚，（c）二阶方向曲率惩罚，以及（d）第一解码器层的Stiefel投影。在多个种子中，我们发现（a）-（c）通常会产生潜在的表示，这使得后续的使用冻结自动编码器的潜在动态训练更加困难，特别是对于长时间滚动，即使它们提高了本地解码器平滑度或相关的灵敏度代理。相比之下，（d）始终改进学习的潜在动态的条件相关诊断，并倾向于产生更好的推出性能。我们讨论的假设，在这种情况下，潜在的几何不匹配的下游影响超过了改进的解码器平滑的好处。
摘要：We investigate geometric regularization strategies for learned latent representations in encoder--decoder reduced-order models. In a fixed experimental setting for the advection--diffusion--reaction (ADR) equation, we model latent dynamics using a neural ODE and evaluate four regularization approaches applied during autoencoder pre-training: (a) near-isometry regularization of the decoder Jacobian, (b) a stochastic decoder gain penalty based on random directional gains, (c) a second-order directional curvature penalty, and (d) Stiefel projection of the first decoder layer. Across multiple seeds, we find that (a)--(c) often produce latent representations that make subsequent latent-dynamics training with a frozen autoencoder more difficult, especially for long-horizon rollouts, even when they improve local decoder smoothness or related sensitivity proxies. In contrast, (d) consistently improves conditioning-related diagnostics of the learned latent dynamics and tends to yield better rollout performance. We discuss the hypothesis that, in this setting, the downstream impact of latent-geometry mismatch outweighs the benefits of improved decoder smoothness.

优化|敛散性(6篇)

【1】LAGO: A Local-Global Optimization Framework Combining Trust Region Methods and Bayesian Optimization
标题：LAGO：结合信任域方法和Bayesian优化的局部-全局优化框架
链接：https://arxiv.org/abs/2603.02970

作者：Eliott Van Dieren,Tommaso Vanzan,Fabio Nobile
备注：22 pages, 8 figures
摘要：我们介绍LAGO，局部-全局优化算法，通过自适应竞争机制，结合梯度增强贝叶斯优化（BO）与基于梯度的信任域局部细化。在每次迭代中，全局和局部优化策略独立地提出候选点，并且基于预测的改进来选择下一个评估。LAGO在建议级别将全局探索与局部细化分开：BO采集函数在主动信任区域之外进行优化，而局部函数和梯度评估仅在满足基于长度尺度的最小距离标准时才被纳入全局梯度增强高斯过程，从而降低了局部开发期间数值不稳定的风险。这使得在到达有希望的区域时能够进行有效的局部细化，而不会牺牲设计空间的全局搜索。因此，与用于平滑函数的标准非线性局部优化算法相比，该方法实现了对整个设计空间的改进的探索，同时在感兴趣的区域中保持快速局部收敛。
摘要：We introduce LAGO, a LocAl-Global Optimization algorithm that combines gradient-enhanced Bayesian Optimization (BO) with gradient-based trust region local refinement through an adaptive competition mechanism. At each iteration, global and local optimization strategies independently propose candidate points, and the next evaluation is selected based on predicted improvement. LAGO separates global exploration from local refinement at the proposal level: the BO acquisition function is optimized outside the active trust region, while local function and gradient evaluations are incorporated into the global gradient-enhanced Gaussian process only when they satisfy a lengthscale-based minimum-distance criterion, reducing the risk of numerical instability during the local exploitation. This enables efficient local refinement when reaching promising regions, without sacrificing a global search of the design space. As a result, the method achieves an improved exploration of the full design space compared to standard non-linear local optimization algorithms for smooth functions, while maintaining fast local convergence in regions of interest.

【2】Deep learning-guided evolutionary optimization for protein design
标题：深度学习引导的蛋白质设计进化优化
链接：https://arxiv.org/abs/2603.02753

作者：Erik Hartman,Di Tang,Johan Malmström
备注：Code available at GitHub
摘要：由于大的序列空间和序列-功能关系的复杂性，设计具有所需特征的新型蛋白质仍然是一个重大挑战。有效探索这一空间以鉴定满足特定设计标准的序列对于推进治疗和生物技术至关重要。在这里，我们提出了BoGA（贝叶斯优化遗传算法），一个框架，结合进化搜索与贝叶斯优化，有效地导航序列空间。通过将遗传算法集成为代理建模循环中的随机建议生成器，BoGA根据先前的评估和代理模型预测对候选人进行优先级排序，从而实现数据高效优化。我们通过对序列和结构设计任务进行基准测试，证明了BoGA的实用性，然后将其应用于设计针对肺炎链球菌溶血素的肽结合剂，肺炎链球菌溶血素是肺炎链球菌的关键毒力因子。BoGA加速了高置信度结合剂的发现，展示了跨多种目标进行高效蛋白质设计的潜力。该算法在BoPep套件中实现，并在MIT许可下在\href{https：//github.com/ErikHartman/bopep}{GitHub}上提供。
摘要：Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence-function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data-efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of \textit{Streptococcus pneumoniae}. BoGA accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \href{https://github.com/ErikHartman/bopep}{GitHub}.

【3】Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence
标题：KL分歧中欠衰减Langevin Monte Carlo的独立收敛
链接：https://arxiv.org/abs/2603.02429

作者：Shiyuan Zhang,Qiwei Di,Xuheng Li,Quanquan Gu
备注：51 pages, 1 table
摘要：欠阻尼朗之万动力学（Underdamped Langevin dynamics，ULD）是吉布斯分布$π\propto e^{-V}$的一个广泛使用的采样器，并且通常在高维中经验有效。然而，现有的非渐近收敛保证离散ULD通常规模多项式与周围的维度$d$，导致真空的界限时，$d$是大的。主要已知的无量纲结果涉及Wasserstein-2距离中的随机中点离散化（Liu等人，2023），而KL分歧中ULD离散化的维度独立保证仍然是开放的。我们关闭这一差距，证明第一个无量纲KL离散ULD的发散界限。我们的分析细化了KL局部误差框架（Altschlovel等人，2025）到无量纲设置，并产生依赖于$\mathrm{tr}（\mathbf{H}）$的边界，其中$\mathbf{H}$上界$V$的Hessian，而不是$d$。因此，在$\mathrm{tr}（\mathbf{H}）\ll d$的情况下，相对于过阻尼Langevin方法，我们获得了欠阻尼Langevin Monte Carlo的改进迭代复杂性。
摘要：Underdamped Langevin dynamics (ULD) is a widely-used sampler for Gibbs distributions $π\propto e^{-V}$, and is often empirically effective in high dimensions. However, existing non-asymptotic convergence guarantees for discretized ULD typically scale polynomially with the ambient dimension $d$, leading to vacuous bounds when $d$ is large. The main known dimension-free result concerns the randomized midpoint discretization in Wasserstein-2 distance (Liu et al.,2023), while dimension-independent guarantees for ULD discretizations in KL divergence have remained open. We close this gap by proving the first dimension-free KL divergence bounds for discretized ULD. Our analysis refines the KL local error framework (Altschuler et al., 2025) to a dimension-free setting and yields bounds that depend on $\mathrm{tr}(\mathbf{H})$, where $\mathbf{H}$ upper bounds the Hessian of $V$, rather than on $d$. As a consequence, we obtain improved iteration complexity for underdamped Langevin Monte Carlo relative to overdamped Langevin methods in regimes where $\mathrm{tr}(\mathbf{H})\ll d$.

【4】Learning Optimal Search Strategies
标题：学习最佳搜索策略
链接：https://arxiv.org/abs/2603.02356

作者：Stefan Ankirchner,Maximilian Philipp Thiel
摘要：我们探讨的问题，如何学习最佳的搜索策略内的停车问题的例子，停车机会到达根据一个未知的非齐次泊松过程。最优策略是一个阈值型停止规则，其特征在于无差异位置。我们提出了一种算法，学习这个阈值估计的综合跳跃强度，而不是强度函数本身。我们表明，我们的算法实现了对数遗憾的增长，均匀在广泛的一类环境。此外，我们证明了一个对数极大极小遗憾下界，建立所提出的方法的增长最优。
摘要：We explore the question of how to learn an optimal search strategy within the example of a parking problem where parking opportunities arrive according to an unknown inhomogeneous Poisson process. The optimal policy is a threshold-type stopping rule characterized by an indifference position. We propose an algorithm that learns this threshold by estimating the integrated jump intensity rather than the intensity function itself. We show that our algorithm achieves a logarithmic regret growth, uniformly over a broad class of environments. Moreover, we prove a logarithmic minimax regret lower bound, establishing the growth optimality of the proposed approach.

【5】Shape Derivative-Informed Neural Operators with Application to Risk-Averse Shape Optimization
标题：形状求导神经运算符及其在规避风险形状优化中的应用
链接：https://arxiv.org/abs/2603.03211

作者：Xindi Gong,Dingcheng Luo,Thomas O'Leary-Roseberry,Ruanui Nicholson,Omar Ghattas
摘要：由于在许多不确定性实现和不同的几何形状中重复进行基于采样的风险评估的成本很高，因此经典的基于偏微分方程的方法在不确定性下的形状优化（OUU）是计算密集型的，而标准的神经代理通常无法提供准确和有效的优化灵敏度。我们介绍Shape-DINO，这是一个基于导数的神经运算符框架，用于学习不同几何形状家族的PDE解运算符，特别关注加速PDE约束的形状OUU。Shape-DINO通过到固定参考域的同构映射对几何可变性进行编码，并采用一个基于导数的算子学习目标，该目标联合学习PDE解及其相对于设计变量和不确定参数的Fréchet导数，从而实现大规模OUU的准确状态预测和可靠梯度。我们建立了一个先验误差界连接代理精度优化误差，并证明了通用的逼近结果，多输入减少基神经操作员在适当的C^1 $规范。我们展示了三个代表性的形状OUU问题，包括边界设计的泊松方程和形状设计的稳态Navier-Stokes外部流在二维和三维的效率和可扩展性。在这些示例中，Shape-DINO比在没有衍生信息的情况下训练的运算符代理产生更可靠的优化结果。在我们的示例中，Shape-DINO在状态和梯度评估方面实现了3-8个数量级的加速。计算训练数据生成，与针对单个OUU问题的严格基于PDE的方法相比，Shape-DINO将必要的PDE求解减少了1-2个数量级。此外，Shape-DINO的建造成本可以在许多目标和风险措施中分摊，从而为复杂系统提供大规模的形状OUU。
摘要：Shape optimization under uncertainty (OUU) is computationally intensive for classical PDE-based methods due to the high cost of repeated sampling-based risk evaluation across many uncertainty realizations and varying geometries, while standard neural surrogates often fail to provide accurate and efficient sensitivities for optimization. We introduce Shape-DINO, a derivative-informed neural operator framework for learning PDE solution operators on families of varying geometries, with a particular focus on accelerating PDE-constrained shape OUU. Shape-DINOs encode geometric variability through diffeomorphic mappings to a fixed reference domain and employ a derivative-informed operator learning objective that jointly learns the PDE solution and its Fréchet derivatives with respect to design variables and uncertain parameters, enabling accurate state predictions and reliable gradients for large-scale OUU. We establish a priori error bounds linking surrogate accuracy to optimization error and prove universal approximation results for multi-input reduced basis neural operators in suitable $C^1$ norms. We demonstrate efficiency and scalability on three representative shape OUU problems, including boundary design for a Poisson equation and shape design governed by steady-state Navier-Stokes exterior flows in two and three dimensions. Across these examples, Shape-DINOs produce more reliable optimization results than operator surrogates trained without derivative information. In our examples, Shape-DINOs achieve 3-8 orders-of-magnitude speedups in state and gradient evaluations. Counting training data generation, Shape-DINOs reduce necessary PDE solves by 1-2 orders-of-magnitude compared to a strictly PDE-based approach for a single OUU problem. Moreover, Shape-DINO construction costs can be amortized across many objectives and risk measures, enabling large-scale shape OUU for complex systems.

【6】Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits
标题：随机梯度下降中的Fisher几何扩散：最优速率、预言复杂性和信息论极限
链接：https://arxiv.org/abs/2603.02417

作者：Daniel Zantedeschi,Kumar Muthuraman
摘要：我们开发了一个随机梯度下降（SGD）的Fisher几何理论，其中小批量噪声是一个内在的，损失引起的矩阵-而不是一个外生的标量方差。在可交换抽样下，小批量梯度协方差被每个样本梯度的投影协方差固定（到首阶）：它等于良好指定的似然损失的投影Fisher信息和一般M估计损失的投影Godambe（三明治）矩阵。这种识别迫使扩散近似Fisher/Godambe结构的波动率（有效温度τ = eta/b），并产生一个Ornstein-Uhlenbeck线性化，其平稳的协方差是由一个Fisher-李雅普诺夫方程的封闭形式。建立在这个几何，我们证明匹配的极大极小的上界和下界的顺序θ（1/N）的总Oracle预算N下的Fisher/Godambe风险;下界下的鞅预言条件（有界可预测的二次变差），严格包含独立同分布。和可交换采样。这些结果意味着预言复杂性保证ε平稳性的Fisher对偶范数，依赖于一个内在的有效尺寸和Fisher/Godambe条件数，而不是环境尺寸或欧几里得条件。实验证实了李雅普诺夫的预测，并表明标量温度匹配不能再现方向性噪声结构。
摘要：We develop a Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix -- not an exogenous scalar variance. Under exchangeable sampling, the mini-batch gradient covariance is pinned down (to leading order) by the projected covariance of per-sample gradients: it equals projected Fisher information for well-specified likelihood losses and the projected Godambe (sandwich) matrix for general M-estimation losses. This identification forces a diffusion approximation with Fisher/Godambe-structured volatility (effective temperature tau = eta/b) and yields an Ornstein-Uhlenbeck linearization whose stationary covariance is given in closed form by a Fisher-Lyapunov equation. Building on this geometry, we prove matching minimax upper and lower bounds of order Theta(1/N) for Fisher/Godambe risk under a total oracle budget N; the lower bound holds under a martingale oracle condition (bounded predictable quadratic variation), strictly subsuming i.i.d. and exchangeable sampling. These results imply oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number rather than ambient dimension or Euclidean conditioning. Experiments confirm the Lyapunov predictions and show that scalar temperature matching cannot reproduce directional noise structure.

预测|估计(9篇)

【1】Safe and Robust Domains of Attraction for Discrete-Time Systems: A Set-Based Characterization and Certifiable Neural Network Estimation
标题：离散时间系统的安全稳健吸引域：基于集的特征和可认证的神经网络估计
链接：https://arxiv.org/abs/2603.03082

作者：Mohamed Serry,Maxwell Fitzsimmons,Jun Liu
摘要：分析具有吸引鲁棒不变集（RIS）的非线性系统需要估计其吸引域（DOA）。尽管广泛的研究，准确地表征一般非线性系统的DOA仍然具有挑战性，由于理论和计算的限制，特别是在存在不确定性和状态约束。在本文中，我们提出了一个新的框架，准确估计的安全（状态约束）和鲁棒DOA的离散时间非线性不确定系统的连续动态，开放的安全集，紧凑的干扰集，和一致的局部$\ell_p$-稳定的紧凑RISs。一致$\ell_p$稳定性的概念是相当普遍的，包括，作为特殊情况，一致指数和多项式稳定性。DOA的特征在于通过新引入的值函数定义在紧集的度量空间。我们建立了它们的基本数学性质，并推导出相关的Bellman型（Zubov型）函数方程。在此特征的基础上，我们开发了一个物理信息神经网络（NN）框架，通过将派生的Bellman型方程直接嵌入到训练过程中来学习相应的值函数。为了从学习的神经近似中获得安全鲁棒DOA的可验证估计，我们进一步引入了一个利用现有形式验证工具的验证过程。所提出的方法的有效性和适用性通过四个数值例子，涉及非线性不确定系统的状态约束，其性能与现有的方法从文献中进行了比较。
摘要：Analyzing nonlinear systems with attracting robust invariant sets (RISs) requires estimating their domains of attraction (DOAs). Despite extensive research, accurately characterizing DOAs for general nonlinear systems remains challenging due to both theoretical and computational limitations, particularly in the presence of uncertainties and state constraints. In this paper, we propose a novel framework for the accurate estimation of safe (state-constrained) and robust DOAs for discrete-time nonlinear uncertain systems with continuous dynamics, open safe sets, compact disturbance sets, and uniformly locally $\ell_p$-stable compact RISs. The notion of uniform $\ell_p$ stability is quite general and encompasses, as special cases, uniform exponential and polynomial stability. The DOAs are characterized via newly introduced value functions defined on metric spaces of compact sets. We establish their fundamental mathematical properties and derive the associated Bellman-type (Zubov-type) functional equations. Building on this characterization, we develop a physics-informed neural network (NN) framework to learn the corresponding value functions by embedding the derived Bellman-type equations directly into the training process. To obtain certifiable estimates of the safe robust DOAs from the learned neural approximations, we further introduce a verification procedure that leverages existing formal verification tools. The effectiveness and applicability of the proposed methodology are demonstrated through four numerical examples involving nonlinear uncertain systems subject to state constraints, and its performance is compared with existing methods from the literature.

【2】Towards Accurate and Interpretable Time-series Forecasting: A Polynomial Learning Approach
标题：实现准确且可解释的时间序列预测：一种多元学习方法
链接：https://arxiv.org/abs/2603.02906

作者：Bo Liu,Shao-Bo Lin,Changmiao Wang,Xiaotong Liu
摘要：Time series forecasting enables early warning and has driven asset performance management from traditional planned maintenance to predictive maintenance. However, the lack of interpretability in forecasting methods undermines users' trust and complicates debugging for developers. Consequently, interpretable time-series forecasting has attracted increasing research attention. Nevertheless, existing methods suffer from several limitations, including insufficient modeling of temporal dependencies, lack of feature-level interpretability to support early warning, and difficulty in simultaneously achieving the accuracy and interpretability. This paper proposes the interpretable polynomial learning (IPL) method, which integrates interpretability into the model structure by explicitly modeling original features and their interactions of arbitrary order through polynomial representations. This design preserves temporal dependencies, provides feature-level interpretability, and offers a flexible trade-off between prediction accuracy and interpretability by adjusting the polynomial degree. We evaluate IPL on simulated and Bitcoin price data, showing that it achieves high prediction accuracy with superior interpretability compared with widely used explainability methods. Experiments on field-collected antenna data further demonstrate that IPL yields simpler and more efficient early warning mechanisms.

【3】Distributed Dynamic Invariant Causal Prediction in Environmental Time Series
标题：环境时间序列的分布式动态不变因果预测
链接：https://arxiv.org/abs/2603.02902

作者：Ziruo Hao,Tao Yang,Xiaofeng Wu,Bo Hu
摘要：The extraction of invariant causal relationships from time series data with environmental attributes is critical for robust decision-making in domains such as climate science and environmental monitoring. However, existing methods either emphasize dynamic causal analysis without leveraging environmental contexts or focus on static invariant causal inference, leaving a gap in distributed temporal settings. In this paper, we propose Distributed Dynamic Invariant Causal Prediction in Time-series (DisDy-ICPT), a novel framework that learns dynamic causal relationships over time while mitigating spatial confounding variables without requiring data communication. We theoretically prove that DisDy-ICPT recovers stable causal predictors within a bounded number of communication rounds under standard sampling assumptions. Empirical evaluations on synthetic benchmarks and environment-segmented real-world datasets show that DisDy-ICPT achieves superior predictive stability and accuracy compared to baseline methods A and B. Our approach offers promising applications in carbon monitoring and weather forecasting. Future work will extend DisDy-ICPT to online learning scenarios.

【4】Next Embedding Prediction Makes World Models Stronger
标题：下一个嵌入预测让世界模型更强大
链接：https://arxiv.org/abs/2603.02765

作者：George Bredis,Nikita Balagansky,Daniil Gavrilov,Ruslan Rakhimov
摘要：Capturing temporal dependencies is critical for model-based reinforcement learning (MBRL) in partially observable, high-dimensional domains. We introduce NE-Dreamer, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space. This approach enables NE-Dreamer to learn coherent, predictive state representations without reconstruction losses or auxiliary supervision. On the DeepMind Control Suite, NE-Dreamer matches or exceeds the performance of DreamerV3 and leading decoder-free agents. On a challenging subset of DMLab tasks involving memory and spatial reasoning, NE-Dreamer achieves substantial gains. These results establish next-embedding prediction with temporal transformers as an effective, scalable framework for MBRL in complex, partially observable environments.

【5】Learning-Augmented Moment Estimation on Time-Decay Models
标题：时间衰减模型的学习增广矩估计
链接：https://arxiv.org/abs/2603.02488

作者：Soham Nagawanshi,Shalini Panthangi,Chen Wang,David P. Woodruff,Samson Zhou
摘要：Motivated by the prevalence and success of machine learning, a line of recent work has studied learning-augmented algorithms in the streaming model. These results have shown that for natural and practical oracles implemented with machine learning models, we can obtain streaming algorithms with improved space efficiency that are otherwise provably impossible. On the other hand, our understanding is much more limited when items are weighted unequally, for example, in the sliding-window model, where older data must be expunged from the dataset, e.g., by privacy regulation laws. In this paper, we utilize an oracle for the heavy-hitters of datasets to give learning-augmented algorithms for a number of fundamental problems, such as norm/moment estimation, frequency estimation, cascaded norms, and rectangular moment estimation, in the time-decay setting. We complement our theoretical results with a number of empirical evaluations that demonstrate the practical efficiency of our algorithms on real and synthetic datasets.

【6】High-order Knowledge Based Network Controllability Robustness Prediction: A Hypergraph Neural Network Approach
标题：基于高级知识的网络可控性鲁棒性预测：超图神经网络方法
链接：https://arxiv.org/abs/2603.02265

作者：Shibing Mo,Jiarui Zhang,Jiayu Xie,Xiangyi Teng,Jing Liu
摘要：In order to evaluate the invulnerability of networks against various types of attacks and provide guidance for potential performance enhancement as well as controllability maintenance, network controllability robustness (NCR) has attracted increasing attention in recent years. Traditionally, controllability robustness is determined by attack simulations, which are computationally time-consuming and only applicable to small-scale networks. Although some machine learning-based methods for predicting network controllability robustness have been proposed, they mainly focus on pairwise interactions in complex networks, and the underlying relationships between high-order structural information and controllability robustness have not been explored. In this paper, a dual hypergraph attention neural network model based on high-order knowledge (NCR-HoK) is proposed to accomplish robustness learning and controllability robustness curve prediction. Through a node feature encoder, hypergraph construction with high-order relations, and a dedicated dual hypergraph attention module, the proposed method can effectively learn three types of network information simultaneously: explicit structural information in the original graph, high-order connection information in local neighborhoods, and hidden features in the embedding space. Notably, we explore for the first time the impact of high-order knowledge on network controllability robustness. Compared with state-of-the-art methods for network robustness learning, the proposed method achieves superior performance on both synthetic and real-world networks with low computational overhead.

【7】Characterizing and Predicting Wildfire Evacuation Behavior: A Dual-Stage ML Approach
标题：描述和预测野火疏散行为：两阶段ML方法
链接：https://arxiv.org/abs/2603.02223

作者：Sazzad Bin Bashar Polock,Anandi Dutta,Subasish Das
备注：This is the author's preprint version of a paper accepted for presentation at SoutheastConn 2026. The final published version will appear in the official conference proceedings. Conference site: https://ieeesoutheastcon.org/
摘要：Wildfire evacuation behavior is highly variable and influenced by complex interactions among household resources, preparedness, and situational cues. Using a large-scale MTurk survey of residents in California, Colorado, and Oregon, this study integrates unsupervised and supervised machine learning methods to uncover latent behavioral typologies and predict key evacuation outcomes. Multiple Correspondence Analysis, K-Modes clustering, and Latent Class Analysis reveal consistent subgroups differentiated by vehicle access, disaster planning, technological resources, pet ownership, and residential stability. Complementary supervised models show that transportation mode can be predicted with high reliability from household characteristics, whereas evacuation timing remains difficult to classify due to its dependence on dynamic, real-time fire conditions. These findings advance data-driven understanding of wildfire evacuation behavior and demonstrate how machine learning can support targeted preparedness strategies, resource allocation, and equitable emergency planning.

【8】Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting
标题：预测即渲染：一个用于时间序列预测的2D高斯溅射框架
链接：https://arxiv.org/abs/2603.02220

作者：Yixin Wang,Yifan Hu,Peiyuan Liu,Naiqi Li,Dai Tao,Shu-Tao Xia
摘要：Time series forecasting (TSF) remains a challenging problem due to the intricate entanglement of intraperiod-fluctuations and interperiod-trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal limitations.Firstly, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a continuous latent surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art performance.

【9】Neural Demand Estimation with Habit Formation and Rationality Constraints
标题：具有习惯形成和理性约束的神经需求估计
链接：https://arxiv.org/abs/2603.02331

作者：Marta Grzeskiewicz
摘要：We develop a flexible neural demand system for continuous budget allocation that estimates budget shares on the simplex by minimizing KL divergence. Shares are produced via a softmax of a state-dependent preference scorer and disciplined with regularity penalties (monotonicity, Slutsky symmetry) to support coherent comparative statics and welfare without imposing a parametric utility form. State dependence enters through a habit stock defined as an exponentially weighted moving average of past consumption. Simulations recover elasticities and welfare accurately and show sizable gains when habit formation is present. In our empirical application using Dominick's analgesics data, adding habit reduces out-of-sample error by c.33%, reshapes substitution patterns, and increases CV losses from a 10% ibuprofen price rise by about 15-16% relative to a static model. The code is available at https://github.com/martagrz/neural_demand_habit .

其他神经网络|深度学习|模型|建模(35篇)

【1】Inverse Reconstruction of Shock Time Series from Shock Response Spectrum Curves using Machine Learning
标题：使用机器学习从冲击响应谱曲线逆重建冲击时间序列
链接：https://arxiv.org/abs/2603.03229

作者：Adam Watts,Andrew Jeon,Destry Newton,Ryan Bowering
备注：Extended journal-style manuscript. 27 pages, 13 figures
摘要：The shock response spectrum (SRS) is widely used to characterize the response of single-degree-of-freedom (SDOF) systems to transient accelerations. Because the mapping from acceleration time history to SRS is nonlinear and many-to-one, reconstructing time-domain signals from a target spectrum is inherently ill-posed. Conventional approaches address this problem through iterative optimization, typically representing signals as sums of exponentially decayed sinusoids, but these methods are computationally expensive and constrained by predefined basis functions. We propose a conditional variational autoencoder (CVAE) that learns a data-driven inverse mapping from SRS to acceleration time series. Once trained, the model generates signals consistent with prescribed target spectra without requiring iterative optimization. Experiments demonstrate improved spectral fidelity relative to classical techniques, strong generalization to unseen spectra, and inference speeds three to six orders of magnitude faster. These results establish deep generative modeling as a scalable and efficient approach for inverse SRS reconstruction.

【2】Coalgebras for categorical deep learning: Representability and universal approximation
标题：类别深度学习的余代数：可表示性和普适逼近
链接：https://arxiv.org/abs/2603.03227

作者：Dragan Mašulović
摘要：Categorical deep learning (CDL) has recently emerged as a framework that leverages category theory to unify diverse neural architectures. While geometric deep learning (GDL) is grounded in the specific context of invariants of group actions, CDL aims to provide domain-independent abstractions for reasoning about models and their properties. In this paper, we contribute to this program by developing a coalgebraic foundation for equivariant representation in deep learning, as classical notions of group actions and equivariant maps are naturally generalized by the coalgebraic formalism. Our first main result demonstrates that, given an embedding of data sets formalized as a functor from SET to VECT, and given a notion of invariant behavior on data sets modeled by an endofunctor on SET, there is a corresponding endofunctor on VECT that is compatible with the embedding in the sense that this lifted functor recovers the analogous notion of invariant behavior on the embedded data. Building on this foundation, we then establish a universal approximation theorem for equivariant maps in this generalized setting. We show that continuous equivariant functions can be approximated within our coalgebraic framework for a broad class of symmetries. This work thus provides a categorical bridge between the abstract specification of invariant behavior and its concrete realization in neural architectures.

【3】cPNN: Continuous Progressive Neural Networks for Evolving Streaming Time Series
标题：cPNN：用于演变流媒体时间序列的连续渐进神经网络
链接：https://arxiv.org/abs/2603.03040

作者：Federico Giannini,Giacomo Ziffer,Emanuele Della Valle
摘要：Dealing with an unbounded data stream involves overcoming the assumption that data is identically distributed and independent. A data stream can, in fact, exhibit temporal dependencies (i.e., be a time series), and data can change distribution over time (concept drift). The two problems are deeply discussed, and existing solutions address them separately: a joint solution is absent. In addition, learning multiple concepts implies remembering the past (a.k.a. avoiding catastrophic forgetting in Neural Networks' terminology). This work proposes Continuous Progressive Neural Networks (cPNN), a solution that tames concept drifts, handles temporal dependencies, and bypasses catastrophic forgetting. cPNN is a continuous version of Progressive Neural Networks, a methodology for remembering old concepts and transferring past knowledge to fit the new concepts quickly. We base our method on Recurrent Neural Networks and exploit the Stochastic Gradient Descent applied to data streams with temporal dependencies. Results of an ablation study show a quick adaptation of cPNN to new concepts and robustness to drifts.

【4】SEHFS: Structural Entropy-Guided High-Order Correlation Learning for Multi-View Multi-Label Feature Selection
标题：SEHFS：用于多视图多标签特征选择的结构信息引导的高级相关学习
链接：https://arxiv.org/abs/2603.03022

作者：Cheng Peng,Yonghao Li,Wanfu Gao,Jie Wen,Weiping Ding
摘要：In recent years, multi-view multi-label learning (MVML) has attracted extensive attention due to its close alignment to real-world scenarios. Information-theoretic methods have gained prominence for learning nonlinear correlations. However, two key challenges persist: first, features in real-world data commonly exhibit high-order structural correlations, but existing information-theoretic methods struggle to learn such correlations; second, commonly relying on heuristic optimization, information-theoretic methods are prone to converging to local optima. To address these two challenges, we propose a novel method called Structural Entropy Guided High-Order Correlation Learning for Multi-View Multi-Label Feature Selection (SEHFS). The core idea of SEHFS is to convert the feature graph into a structural-entropy-minimizing encoding tree, quantifying the information cost of high-order dependencies and thus learning high-order feature correlations beyond pairwise correlations. Specifically, features exhibiting strong high-order redundancy are grouped into a single cluster within the encoding tree, while inter-cluster feaeture correlations are minimized, thereby eliminating redundancy both within and across clusters. Furthermore, a new framework based on the fusion of information theory and matrix methods is adopted, which learns a shared semantic matrix and view-specific contribution matrices to reconstruct a global view matrix, thereby enhancing the information-theoretic method and balancing the global and local optimization. The ability of structural entropy to learn high-order correlations is theoretically established, and and both experiments on eight datasets from various domains and ablation studies demonstrate that SEHFS achieves superior performance in feature selection.

【5】On the Topology of Neural Network Superlevel Sets
标题：神经网络超水平集的布局
链接：https://arxiv.org/abs/2603.02973

作者：Bahman Gharesifard
摘要：We show that neural networks with activations satisfying a Riccati-type ordinary differential equation condition, an assumption arising in recent universal approximation results in the uniform topology, produce Pfaffian outputs on analytic domains with format controlled only by the architecture. Consequently, superlevel sets, as well as Lie bracket rank drop loci for neural network parameterized vector fields, admit architecture-only bounds on topological complexity, in particular on total Betti numbers, uniformly over all weights.

【6】Integrating Homomorphic Encryption and Synthetic Data in FL for Privacy and Learning Quality
标题：在FL中集成同态加密和合成数据以提高隐私和学习质量
链接：https://arxiv.org/abs/2603.02969

作者：Yenan Wang,Carla Fabiana Chiasserini,Elad Michael Schiller
摘要：Federated learning (FL) enables collaborative training of machine learning models without sharing sensitive client data, making it a cornerstone for privacy-critical applications. However, FL faces the dual challenge of ensuring learning quality and robust privacy protection while keeping resource consumption low, particularly when using computationally expensive techniques such as homomorphic encryption (HE). In this work, we enhance an FL process that preserves privacy using HE by integrating it with synthetic data generation and an interleaving strategy. Specifically, our solution, named Alternating Federated Learning (Alt-FL), consists of alternating between local training with authentic data (authentic rounds) and local training with synthetic data (synthetic rounds) and transferring the encrypted and plaintext model parameters on authentic and synthetic rounds (resp.). Our approach improves learning quality (e.g., model accuracy) through datasets enhanced with synthetic data, preserves client data privacy via HE, and keeps manageable encryption and decryption costs through our interleaving strategy. We evaluate our solution against data leakage attacks, such as the DLG attack, demonstrating robust privacy protection. Also, Alt-FL provides 13.4% higher model accuracy and decreases HE-related costs by up to 48% with respect to Selective HE.

【7】Enhancing Physics-Informed Neural Networks with Domain-aware Fourier Features: Towards Improved Performance and Interpretable Results
标题：利用域感知傅里叶特征增强物理信息神经网络：提高性能和可解释结果
链接：https://arxiv.org/abs/2603.02948

作者：Alberto Miño Calero,Luis Salamanca,Konstantinos E. Tatsis
摘要：Physics-Informed Neural Networks (PINNs) incorporate physics into neural networks by embedding partial differential equations (PDEs) into their loss function. Despite their success in learning the underlying physics, PINN models remain difficult to train and interpret. In this work, a novel modeling approach is proposed, which relies on the use of Domain-aware Fourier Features (DaFFs) for the positional encoding of the input space. These features encapsulate all the domain-specific characteristics, such as the geometry and boundary conditions, and unlike Random Fourier Features (RFFs), eliminate the need for explicit boundary condition loss terms and loss balancing schemes, while simplifying the optimization process and reducing the computational cost associated with training. We further develop an LRP-based explainability framework tailored to PINNs, enabling the extraction of relevance attribution scores for the input space. It is demonstrated that PINN-DaFFs achieve orders-of-magnitude lower errors and allow faster convergence compared to vanilla PINNs and RFFs-based PINNs. Furthermore, LRP analysis reveals that the proposed leads to more physically consistent feature attributions, while PINN-RFFs and vanilla PINNs display more scattered and less physics-relevant patterns. These results demonstrate that DaFFs not only enhance PINNs' accuracy and efficiency but also improve interpretability, laying the ground for more robust and informative physics-informed learning.

【8】Embedding interpretable $\ell_1$-regression into neural networks for uncovering temporal structure in cell imaging
链接：https://arxiv.org/abs/2603.02899

作者：Fabian Kabus,Maren Hackenberg,Julia Hindel,Thibault Cholvin,Antje Kilias,Thomas Brox,Abhinav Valada,Marlene Bartos,Harald Binder
摘要：While artificial neural networks excel in unsupervised learning of non-sparse structure, classical statistical regression techniques offer better interpretability, in particular when sparseness is enforced by $\ell_1$ regularization, enabling identification of which factors drive observed dynamics. We investigate how these two types of approaches can be optimally combined, exemplarily considering two-photon calcium imaging data where sparse autoregressive dynamics are to be extracted. We propose embedding a vector autoregressive (VAR) model as an interpretable regression technique into a convolutional autoencoder, which provides dimension reduction for tractable temporal modeling. A skip connection separately addresses non-sparse static spatial information, selectively channeling sparse structure into the $\ell_1$-regularized VAR. $\ell_1$-estimation of regression parameters is enabled by differentiating through the piecewise linear solution path. This is contrasted with approaches where the autoencoder does not adapt to the VAR model. Having an embedded statistical model also enables a testing approach for comparing temporal sequences from the same observational unit. Additionally, contribution maps visualize which spatial regions drive the learned dynamics.

【9】Learning in Markov Decision Processes with Exogenous Dynamics
标题：具有外生动力学的马尔科夫决策过程学习
链接：https://arxiv.org/abs/2603.02862

作者：Davide Maran,Davide Salaorni,Marcello Restelli
摘要：Reinforcement learning algorithms are typically designed for generic Markov Decision Processes (MDPs), where any state-action pair can lead to an arbitrary transition distribution. In many practical systems, however, only a subset of the state variables is directly influenced by the agent's actions, while the remaining components evolve according to exogenous dynamics and account for most of the stochasticity. In this work, we study a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent's actions. We show that exploiting this structure yields significantly improved learning guarantees, with only the size of the exogenous state space appearing in the leading terms of the regret bounds. We further establish a matching lower bound, showing that this dependence is information-theoretically optimal. Finally, we empirically validate our approach across classical toy settings and real-world-inspired environments, demonstrating substantial gains in sample efficiency compared to standard reinforcement learning methods.

【10】Learning Memory-Enhanced Improvement Heuristics for Flexible Job Shop Scheduling
标题：学习记忆增强的改进启发式方法实现灵活的车间调度
链接：https://arxiv.org/abs/2603.02846

作者：Jiaqi Wang,Zhiguang Cao,Peng Zhao,Rui Cao,Yubin Xiao,Yuan Jiang,You Zhou
备注：39th Conference on Neural Information Processing Systems (NeurIPS 2025)
摘要：The rise of smart manufacturing under Industry 4.0 introduces mass customization and dynamic production, demanding more advanced and flexible scheduling techniques. The flexible job-shop scheduling problem (FJSP) has attracted significant attention due to its complex constraints and strong alignment with real-world production scenarios. Current deep reinforcement learning (DRL)-based approaches to FJSP predominantly employ constructive methods. While effective, they often fall short of reaching (near-)optimal solutions. In contrast, improvement-based methods iteratively explore the neighborhood of initial solutions and are more effective in approaching optimality. However, the flexible machine allocation in FJSP poses significant challenges to the application of this framework, including accurate state representation, effective policy learning, and efficient search strategies. To address these challenges, this paper proposes a Memory-enhanced Improvement Search framework with heterogeneous graph representation--MIStar. It employs a novel heterogeneous disjunctive graph that explicitly models the operation sequences on machines to accurately represent scheduling solutions. Moreover, a memoryenhanced heterogeneous graph neural network (MHGNN) is designed for feature extraction, leveraging historical trajectories to enhance the decision-making capability of the policy network. Finally, a parallel greedy search strategy is adopted to explore the solution space, enabling superior solutions with fewer iterations. Extensive experiments on synthetic data and public benchmarks demonstrate that MIStar significantly outperforms both traditional handcrafted improvement heuristics and state-of-the-art DRL-based constructive methods.

【11】Scale-invariant Gaussian derivative residual networks
标题：尺度不变高斯求导剩余网络
链接：https://arxiv.org/abs/2603.02843

作者：Andrzej Perzanowski,Tony Lindeberg
备注：39 pages, 23 figures, 5 tables
摘要：Generalisation across image scales remains a fundamental challenge for deep networks, which often fail to handle images at scales not seen during training (the out-of-distribution problem). In this paper, we present provably scale-invariant Gaussian derivative residual networks (GaussDerResNets), constructed out of scale-covariant Gaussian derivative residual blocks coupled in cascade, aimed at addressing this problem. By adding residual skip connections to the previous notion of Gaussian derivative layers, deeper networks with substantially increased accuracy can be constructed, while preserving very good scale generalisation properties at the higher level of accuracy. Explicit proofs are provided regarding the underlying scale-covariant and scale-invariant properties in arbitrary dimensions. To analyse the ability of GaussDerResNets to generalise to new scales, we apply them on the new rescaled version of the STL-10 dataset, where training is done at a single fixed scale and evaluation is performed on multiple copies of the test set, each rescaled to a single distinct spatial scale, with scale factors extending over a range of 4. We also conduct similar systematic experiments on the rescaled versions of Fashion-MNIST and CIFAR-10 datasets. Experimentally, we demonstrate that the GaussDerResNets have strong scale generalisation and scale selection properties on all the three rescaled datasets. In our ablation studies, we investigate different architectural variants of GaussDerResNets, demonstrating that basing the architecture on depthwise-separable convolutions allows for decreasing both the number of parameters and the amount of computations, with reasonably maintained accuracy and scale generalisation.

【12】Adapting Time Series Foundation Models through Data Mixtures
标题：通过数据混合调整时间序列基础模型
链接：https://arxiv.org/abs/2603.02840

作者：Thomas L. Lee,Edoardo M. Ponti,Amos Storkey
备注：Preprint, 8 pages
摘要：Time series foundation models (TSFMs) have become increasingly popular for zero-shot forecasting. However, for a new time series domain not fully covered by the pretraining set, performance can suffer. Therefore, when a practitioner cares about a new domain and has access to a set of related datasets, the question arises: how best to fine-tune a TSFM to improve zero-shot forecasting? A typical approach to this type of problem is to fine-tune a LoRA module on all datasets or separately on each dataset. Tuning a separate module on each dataset allows for the specialisation of the TSFM to different types of data distribution, by selecting differing combinations of per-dataset modules for different time series contexts. However, we find that, using per-dataset modules might not be optimal, since a time series dataset can contain data from several types of distributions, i.e. sub-domains. This can be due to the distribution shifting or having differing distributions for different dimensions of the time series. Hence, we propose MixFT which re-divides the data using Bayesian mixtures into sets that best represent the sub-domains present in the data, and fine-tunes separately on each of these sets. This re-division of the data ensures that each set is more homogeneous, leading to fine-tuned modules focused on specific sub-domains. Our experiments show that MixFT performs better than per-dataset methods and when fine-tuning a single module on all the data. This suggests that by re-partitioning the data to represent sub-domains we can better specialise TSFMs to improve zero-shot forecasting.

【13】Toward Early Quality Assessment of Text-to-Image Diffusion Models
标题：迈向文本到图像扩散模型的早期质量评估
链接：https://arxiv.org/abs/2603.02829

作者：Huanlei Guo,Hongxin Wei,Bingyi Jing
摘要：Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.

【14】Lattice-based Deep Neural Networks: Regularity and Tailored Regularization
标题：基于网格的深度神经网络：规则性和定制规则化
链接：https://arxiv.org/abs/2603.02809

作者：Alexander Keller,Frances Y. Kuo,Dirk Nuyens,Ian H. Sloan
摘要：This survey article is concerned with the application of lattice rules to Deep Neural Networks (DNNs), lattice rules being a family of quasi-Monte Carlo methods. They have demonstrated effectiveness in various contexts for high-dimensional integration and function approximation. They are extremely easy to implement thanks to their very simple formulation -- all that is required is a good integer generating vector of length matching the dimensionality of the problem. In recent years there has been a burst of research activities on the application and theory of DNNs. We review our recent article on using lattice rules as training points for DNNs with a smooth activation function, where we obtained explicit regularity bounds of the DNNs. By imposing restrictions on the network parameters to match the regularity features of the target function, we prove that DNNs with tailored lattice training points can achieve good theoretical generalization error bounds, with implied constants independent of the input dimension. We also demonstrate numerically that DNNs trained with our tailored regularization perform significantly better than with standard $\ell_2$ regularization.

【15】Enhancing User Throughput in Multi-panel mmWave Radio Access Networks for Beam-based MU-MIMO Using a DRL Method
标题：使用DRL方法增强多面板毫米波无线电接入网络中的用户吞吐量，用于基于束的MU-MMO
链接：https://arxiv.org/abs/2603.02745

作者：Ramin Hashemi,Vismika Ranasinghe,Teemu Veijalainen,Petteri Kela,Risto Wichman
备注：Accepted to the IEEE International Conference on Communications (ICC) 2026
摘要：Millimeter-wave (mmWave) communication systems, particularly those leveraging multi-user multiple-input and multiple-output (MU-MIMO) with hybrid beamforming, face challenges in optimizing user throughput and minimizing latency due to the high complexity of dynamic beam selection and management. This paper introduces a deep reinforcement learning (DRL) approach for enhancing user throughput in multi-panel mmWave radio access networks in a practical network setup. Our DRL-based formulation utilizes an adaptive beam management strategy that models the interaction between the communication agent and its environment as a Markov decision process (MDP), optimizing beam selection based on real-time observations. The proposed framework exploits spatial domain (SD) characteristics by incorporating the cross-correlation between the beams in different antenna panels, the measured reference signal received power (RSRP), and the beam usage statistics to dynamically adjust beamforming decisions. As a result, the spectral efficiency is improved and end-to-end latency is reduced. The numerical results demonstrate an increase in throughput of up to 16% and a reduction in latency by factors 3-7x compared to baseline (legacy beam management).

【16】Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs
标题：Hopper图形处理器上大规模MoE模型的实用FP 4训练
链接：https://arxiv.org/abs/2603.02731

作者：Wuyue Zhang,Chongdong Huang,Chunbo You,Cheng Gu,Fengjuan Wang,Mou Sun
摘要：Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native MXFP4 or NVFP4 support. In this work, we present a training recipe that enables MXFP4 efficiency for MoE models on Hopper architectures without native 4-bit computation support. A central challenge is to integrate FP4 into an existing BF16/FP8 hybrid training pipeline without incurring costly precision round-trips (e.g., FP4 $\leftrightarrow$ BF16 $\leftrightarrow$ FP8). We address this challenge by introducing direct FP8-to-FP4 quantization and de-quantization, together with scaling-aware FP4 row-wise to column-wise conversion, enabling FP4 activations and expert-parallel communication with minimal overhead. Core MoE computations are executed in FP8, while activations and expert-parallel communication are compressed using MXFP4, achieving substantial memory and bandwidth savings without degrading convergence. At the 671B parameter scale, our method achieves end-to-end training performance comparable to strong FP8 baselines, while reducing peak activation memory by 14.8\% (11.8 GB) and improving training throughput by 12.5\%, from 1157 to 1302 tokens per GPU per second. These results show that FP4 efficiency can be practically realized for large-scale MoE training through careful software-hardware co-design, even without native FP4 Tensor Core support.

【17】Causal Learning Should Embrace the Wisdom of the Crowd
标题：因果学习应拥抱人群的智慧
链接：https://arxiv.org/abs/2603.02678

作者：Ryan Feng Lin,Yuantao Wei,Huiling Liao,Xiaoning Qian,Shuai Huang
摘要：Learning causal structures typically represented by directed acyclic graphs (DAGs) from observational data is notoriously challenging due to the combinatorial explosion of possible graphs and inherent ambiguities in observations. This paper argues that causal learning is now ready for the emergence of a new paradigm supported by rapidly advancing technologies, fulfilling the long-standing vision of leveraging human causal knowledge. This paradigm integrates scalable crowdsourcing platforms for data collection, interactive knowledge elicitation for expert opinion modeling, robust aggregation techniques for expert reconciliation, and large language model (LLM)-based simulation for augmenting AI-driven information acquisition. In this paper, we focus on DAG learning for causal discovery and frame the problem as a distributed decision-making task, recognizing that each participant (human expert or LLM agent) possesses fragmented and imperfect knowledge about different subsets of the variables of interest in the causal graph. By proposing a systematic framework to synthesize these insights, we aim to enable the recovery of a global causal structure unachievable by any individual agent alone.We advocate for a new research frontier and outline a comprehensive framework for new research thrusts that range from eliciting, modeling, aggregating, and optimizing human causal knowledge contributions.

【18】Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees
标题：具有理论概括保证的混合专家模型的稳健异类模拟数字计算
链接：https://arxiv.org/abs/2603.02633

作者：Mohammed Nowaz Rabbani Chowdhury,Hsinyu Tsai,Geoffrey W. Burr,Kaoutar El Maghraoui,Liu Liu,Meng Wang
摘要：Sparse Mixture-of-Experts (MoE) models enable efficient scalability by activating only a small sub-set of experts per input, yet their massive parameter counts lead to substantial memory and energy inefficiency during inference. Analog in-memory computing (AIMC) offers a promising solution by eliminating frequent data movement between memory and compute units. However, mitigating hardware nonidealities of AIMC typically requires noise-aware retraining, which is infeasible for large MoE models. In this paper, we propose a retraining-free heterogeneous computation framework in which noise-sensitive experts, which are provably identifiable by their maximum neuron norm, are computed digitally while the majority of the experts are executed on AIMC hardware. We further assign densely activated modules, such as attention layers, to digital computation due to their high noise sensitivity despite comprising a small fraction of parameters. Extensive experiments on large MoE language models, including DeepSeekMoE and OLMoE, across multiple benchmark tasks validate the robustness of our approach in maintaining accuracy under analog nonidealities.

【19】Towards Parameter-Free Temporal Difference Learning
标题：迈向无参数时间差异学习
链接：https://arxiv.org/abs/2603.02577

作者：Yunxiang Li,Mark Schmidt,Reza Babanezhad,Sharan Vaswani
摘要：Temporal difference (TD) learning is a fundamental algorithm for estimating value functions in reinforcement learning. Recent finite-time analyses of TD with linear function approximation quantify its theoretical convergence rate. However, they often require setting the algorithm parameters using problem-dependent quantities that are difficult to estimate in practice -- such as the minimum eigenvalue of the feature covariance ($ω$) or the mixing time of the underlying Markov chain ($τ_{\text{mix}}$). In addition, some analyses rely on nonstandard and impractical modifications, exacerbating the gap between theory and practice. To address these limitations, we use an exponential step-size schedule with the standard TD(0) algorithm. We analyze the resulting method under two sampling regimes: independent and identically distributed (i.i.d.) sampling from the stationary distribution, and the more practical Markovian sampling along a single trajectory. In the i.i.d.\ setting, the proposed algorithm does not require knowledge of problem-dependent quantities such as $ω$, and attains the optimal bias-variance trade-off for the last iterate. In the Markovian setting, we propose a regularized TD(0) algorithm with an exponential step-size schedule. The resulting algorithm achieves a comparable convergence rate to prior works, without requiring projections, iterate averaging, or knowledge of $τ_{\text{mix}}$ or $ω$.

【20】Thermodynamic Regulation of Finite-Time Gibbs Training in Energy-Based Models: A Restricted Boltzmann Machine Study
标题：基于能量的模型中临时吉布斯训练的热力学调节：限制性Boltzmann机研究
链接：https://arxiv.org/abs/2603.02525

作者：Görkem Can Süleymanoğlu
备注：35 pages, 12 Tables, 7 figures. Includes theoretical analysis and experimental validation on MNIST
摘要：Restricted Boltzmann Machines (RBMs) are typically trained using finite-length Gibbs chains under a fixed sampling temperature. This practice implicitly assumes that the stochastic regime remains valid as the energy landscape evolves during learning. We argue that this assumption can become structurally fragile under finite-time training dynamics. This fragility arises because, in nonconvex energy-based models, fixed-temperature finite-time training can generate admissible trajectories with effective-field amplification and conductance collapse. As a result, the Gibbs sampler may asymptotically freeze, the negative phase may localize, and, without sufficiently strong regularization, parameters may exhibit deterministic linear drift. To address this instability, we introduce an endogenous thermodynamic regulation framework in which temperature evolves as a dynamical state variable coupled to measurable sampling statistics. Under standard local Lipschitz conditions and a two-time-scale separation regime, we establish global parameter boundedness under strictly positive L2 regularization. We further prove local exponential stability of the thermodynamic subsystem and show that the regulated regime mitigates inverse-temperature blow-up and freezing-induced degeneracy within a forward-invariant neighborhood. Experiments on MNIST demonstrate that the proposed self-regulated RBM substantially improves normalization stability and effective sample size relative to fixed-temperature baselines, while preserving reconstruction performance. Overall, the results reinterpret RBM training as a controlled non-equilibrium dynamical process rather than a static equilibrium approximation.

【21】Spectral Regularization for Diffusion Models
标题：扩散模型的谱正规化
链接：https://arxiv.org/abs/2603.02447

作者：Satish Chandran,Nicolas Roque dos Santos,Yunshu Wu,Greg Ver Steeg,Evangelos Papalexakis
摘要：Diffusion models are typically trained using pointwise reconstruction objectives that are agnostic to the spectral and multi-scale structure of natural signals. We propose a loss-level spectral regularization framework that augments standard diffusion training with differentiable Fourier- and wavelet-domain losses, without modifying the diffusion process, model architecture, or sampling procedure. The proposed regularizers act as soft inductive biases that encourage appropriate frequency balance and coherent multi-scale structure in generated samples. Our approach is compatible with DDPM, DDIM, and EDM formulations and introduces negligible computational overhead. Experiments on image and audio generation demonstrate consistent improvements in sample quality, with the largest gains observed on higher-resolution, unconditional datasets where fine-scale structure is most challenging to model.

【22】Using the SEKF to Transfer NN Models of Dynamical Systems with Limited Data
标题：使用SEKF传输有限数据动态系统的神经网络模型
链接：https://arxiv.org/abs/2603.02439

作者：Joshua E. Hammond,Tyler A. Soderstrom,Brian A. Korgel,Michael Baldea
摘要：Data-driven models of dynamical systems require extensive amounts of training data. For many practical applications, gathering sufficient data is not feasible due to cost or safety concerns. This work uses the Subset Extended Kalman Filter (SEKF) to adapt pre-trained neural network models to new, similar systems with limited data available. Experimental validation across damped spring and continuous stirred-tank reactor systems demonstrates that small parameter perturbations to the initial model capture target system dynamics while requiring as little as 1% of original training data. In addition, finetuning requires less computational cost and reduces generalization error.

【23】Personalized Multi-Agent Average Reward TD-Learning via Joint Linear Approximation
标题：通过联合线性逼近的个性化多智能体平均奖励TD学习
链接：https://arxiv.org/abs/2603.02426

作者：Leo,Wang,Pengkun Yang,Lili Su
摘要：We study personalized multi-agent average reward TD learning, in which a collection of agents interacts with different environments and jointly learns their respective value functions. We focus on the setting where there exists a shared linear representation, and the agents' optimal weights collectively lie in an unknown linear subspace. Inspired by the recent success of personalized federated learning (PFL), we study the convergence of cooperative single-timescale TD learning in which agents iteratively estimate the common subspace and local heads. We showed that this decomposition can filter out conflicting signals, effectively mitigating the negative impacts of ``misaligned'' signals, and achieving linear speedup. The main technical challenges lie in the heterogeneity, the Markovian sampling, and their intricate interplay in shaping error evolutions. Specifically, not only are the error dynamics of multiple variables closely interconnected, but there is also no direct contraction for the principal angle distance between the optimal subspace and the estimated subspace. We hope our analytical techniques can be useful to inspire research on deeper exploration into leveraging common structures. Experiments are provided to show the benefits of learning via a shared structure to the more general control problem.

【24】The Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks
标题：恶性尾巴：过度参数化网络中标签噪音的频谱隔离
链接：https://arxiv.org/abs/2603.02293

作者：Zice Wang
摘要：While implicit regularization facilitates benign overfitting in low-noise regimes, recent theoretical work predicts a sharp phase transition to harmful overfitting as the noise-to-signal ratio increases. We experimentally isolate the geometric mechanism of this transition: the Malignant Tail, a failure mode where networks functionally segregate signal and noise, reducing coherent semantic features into low-rank subspaces while pushing stochastic label noise into high-frequency orthogonal components, distinct from systematic or corruption-aligned noise. Through a Spectral Linear Probe of training dynamics, we demonstrate that Stochastic Gradient Descent (SGD) fails to suppress this noise, instead implicitly biasing it toward high-frequency orthogonal subspaces, effectively preserving signal-noise separability. We show that this geometric separation is distinct from simple variance reduction in untrained models. In trained networks, SGD actively segregates noise, allowing post-hoc Explicit Spectral Truncation (d << D) to surgically prune the noise-dominated subspace. This approach recovers the optimal generalization capability latent in the converged model. Unlike unstable temporal early stopping, Geometric Truncation provides a stable post-hoc intervention. Our findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability that allows for noise memorization, necessitating explicit rank constraints to filter stochastic corruptions for robust generalization.

【25】Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback
标题：超越二元偏好：带有有序反馈的奖励建模的原则框架
链接：https://arxiv.org/abs/2603.02232

作者：Amirhossein Afsharrad,Ruida Zhou,Luca Viano,Sanjay Lall,Mohammad Ghavamzadeh
摘要：Reward modeling is crucial for aligning large language models with human preferences, yet current approaches lack a principled mathematical framework for leveraging ordinal preference data. When human annotators provide graded preferences on a Likert scale (e.g., significantly better, better, slightly better, negligibly better), existing methods typically apply ad-hoc heuristics, such as margin terms or scaling factors, to loss functions derived from binary preference models like Bradley-Terry. These approaches lack an underlying mathematical model for how ordinal preference data is generated. We present a theoretically grounded framework that formulates reward modeling with Likert scale preferences as a discrete ordinal regression problem. We derive two loss functions from this formulation: a negative log-likelihood loss and an all-threshold loss, both of which learn threshold parameters that naturally capture the ordinal structure of preferences. Unlike existing heuristic methods that manually specify fixed margins or scaling weights, our approach learns these parameters directly from data within a coherent probabilistic framework. Experimental results on multiple benchmarks demonstrate that our ordinal regression approach consistently achieves competitive or superior performance compared to existing heuristic methods across diverse evaluation categories including chat, reasoning, and safety tasks. Our work provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, moving beyond ad-hoc modifications of binary preference models to enable more effective utilization of fine-grained human feedback.

【26】Physics-Informed Neural Networks with Architectural Physics Embedding for Large-Scale Wave Field Reconstruction
标题：具有建筑物理嵌入的物理知识神经网络用于大规模波场重建
链接：https://arxiv.org/abs/2603.02231

作者：Huiwen Zhang,Feng Ye,Chu Ma
备注：20 pages, 17 figures
摘要：Large-scale wave field reconstruction requires precise solutions but faces challenges with computational efficiency and accuracy. The physics-based numerical methods like Finite Element Method (FEM) provide high accuracy but struggle with large-scale or high-frequency problems due to prohibitive computational costs. Pure data-driven approaches excel in speed but often lack sufficient labeled data for complex scenarios. Physics-informed neural networks (PINNs) integrate physical principles into machine learning models, offering a promising solution by bridging these gaps. However, standard PINNs embed physical principles only in loss functions, leading to slow convergence, optimization instability, and spectral bias, limiting their ability for large-scale wave field reconstruction. This work introduces architecture physics embedded (PE)-PINN, which integrates additional physical guidance directly into the neural network architecture beyond Helmholtz equations and boundary conditions in loss functions. Specifically, a new envelope transformation layer is designed to mitigate spectral bias with kernels parameterized by source properties, material interfaces, and wave physics. Experiments demonstrate that PE-PINN achieves more than 10 times speedup in convergence compared to standard PINNs and several orders of magnitude reduction in memory usage compared to FEM. This breakthrough enables high-fidelity modeling for large-scale 2D/3D electromagnetic wave reconstruction involving reflections, refractions, and diffractions in room-scale domains, readily applicable to wireless communications, sensing, room acoustics, and other fields requiring large-scale wave field analysis.

【27】Neural Paging: Learning Context Management Policies for Turing-Complete Agents
标题：神经页面：图灵完全代理的学习上下文管理策略
链接：https://arxiv.org/abs/2603.02228

作者：Liang Chen,Qi Liu
摘要：The proof that Large Language Models (LLMs) augmented with external read-write memory constitute a computationally universal system has established the theoretical foundation for general-purpose agents. However, existing implementations face a critical bottleneck: the finite and costly Context Window, which functions not as infinite memory but as a scarce semantic cache. In this work, we introduce \textit{Neural Paging}, a hierarchical architecture that decouples symbolic reasoning from information resource management. We formulate the \textit{Context Paging Problem (CPP)} and propose a lightweight, differentiable \textit{Page Controller} designed to approximate ``Semantic Belady's Optimality'' -- retaining tokens with high future utility under explicit assumptions on access patterns. We provide theoretical analysis showing that, under bounded context window size~$K$, Neural Paging reduces the asymptotic complexity of long-horizon reasoning from quadratic $O(N^2)$ to $O(N \cdot K^2)$, and we derive a robustness bound (Theorem~4) that quantifies competitive-ratio degradation under policy-dependent access with bounded sensitivity. We validate these bounds on synthetic paging traces, confirming that the theoretical guarantees hold and identifying significant slack that motivates learned policies.

【28】Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling
标题：用于长期序列建模的高效稀疏选择性更新RNN
链接：https://arxiv.org/abs/2603.02226

作者：Bojian Yin,Shurong Wang,Haoyu Tan,Sander Bohte,Federico Corradi,Guoqi Li
摘要：Real-world sequential signals, such as audio or video, contain critical information that is often embedded within long periods of silence or noise. While recurrent neural networks (RNNs) are designed to process such data efficiently, they often suffer from ``memory decay'' due to a rigid update schedule: they typically update their internal state at every time step, even when the input is static. This constant activity forces the model to overwrite its own memory and makes it hard for the learning signal to reach back to distant past events. Here we show that we can overcome this limitation using Selective-Update RNNs (suRNNs), a non-linear architecture that learns to preserve its memory when the input is redundant. By using a neuron-level binary switch that only opens for informative events, suRNNs decouple the recurrent updates from the raw sequence length. This mechanism allows the model to maintain an exact, unchanged memory of the past during low-information intervals, creating a direct path for gradients to flow across time. Our experiments on the Long Range Arena, WikiText, and other synthetic benchmarks show that suRNNs match or exceed the accuracy of much more complex models such as Transformers, while remaining significantly more efficient for long-term storage. By allowing each neuron to learn its own update timescale, our approach resolves the mismatch between how long a sequence is and how much information it actually contains. By providing a principled approach to managing temporal information density, this work establishes a new direction for achieving Transformer-level performance within the highly efficient framework of recurrent modeling.

【29】A Covering Framework for Offline POMDPs Learning using Belief Space Metric
标题：使用信念空间度量的离线POMDPs学习覆盖框架
链接：https://arxiv.org/abs/2603.03191

作者：Youheng Zhu,Yiping Lu
摘要：In off policy evaluation (OPE) for partially observable Markov decision processes (POMDPs), an agent must infer hidden states from past observations, which exacerbates both the curse of horizon and the curse of memory in existing OPE methods. This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of the belief space (distributions over latent states) to relax traditional coverage assumptions. By assuming value relevant functions are Lipschitz continuous in the belief space, we derive error bounds that mitigate exponential blow ups in horizon and memory length. Our unified analysis technique applies to a broad class of OPE algorithms, yielding concrete error bounds and coverage requirements expressed in terms of belief space metrics rather than raw history coverage. We illustrate the improved sample efficiency of this framework via case studies: the double sampling Bellman error minimization algorithm, and the memory based future dependent value functions (FDVF). In both cases, our coverage definition based on the belief space metric yields tighter bounds.

【30】From Reachability to Learnability: Geometric Design Principles for Quantum Neural Networks
标题：从可达性到可学习性：量子神经网络的几何设计原则
链接：https://arxiv.org/abs/2603.03071

作者：Vishal S. Ngairangbam,Michael Spannowsky
备注：29 pages, 5 figures, 3 tables
摘要：Classical deep networks are effective because depth enables adaptive geometric deformation of data representations. In quantum neural networks (QNNs), however, depth or state reachability alone does not guarantee this feature-learning capability. We study this question in the pure-state setting by viewing encoded data as an embedded manifold in $\mathbb{C}P^{2^n-1}$ and analysing infinitesimal unitary actions through Lie-algebra directions. We introduce Classical-to-Lie-algebra (CLA) maps and the criterion of almost Complete Local Selectivity (aCLS), which combines directional completeness with data-dependent local selectivity. Within this framework, we show that data-independent trainable unitaries are complete but non-selective, i.e. learnable rigid reorientations, whereas pure data encodings are selective but non-tunable, i.e. fixed deformations. Hence, geometric flexibility requires a non-trivial joint dependence on data and trainable weights. We further show that accessing high-dimensional deformations of many-qubit state manifolds requires parametrised entangling directions; fixed entanglers such as CNOT alone do not provide adaptive geometric control. Numerical examples validate that CLS-satisfying data re-uploading models outperform non-tunable schemes while requiring only a quarter of the gate operations. Thus, the resulting picture reframes QNN design from state reachability to controllable geometry of hidden quantum representations.

【31】Sparse autoencoders reveal organized biological knowledge but minimal regulatory logic in single-cell foundation models: a comparative atlas of Geneformer and scGPT
标题：稀疏自动编码器揭示了单细胞基础模型中有组织的生物知识，但最少的监管逻辑：Geneforer和scGPT的比较地图集
链接：https://arxiv.org/abs/2603.02952

作者：Ihor Kendiukhov
摘要：Background: Single-cell foundation models such as Geneformer and scGPT encode rich biological information, but whether this includes causal regulatory logic rather than statistical co-expression remains unclear. Sparse autoencoders (SAEs) can resolve superposition in neural networks by decomposing dense activations into interpretable features, yet they have not been systematically applied to biological foundation models. Results: We trained TopK SAEs on residual stream activations from all layers of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512), producing atlases of 82525 and 24527 features, respectively. Both atlases confirm massive superposition, with 99.8 percent of features invisible to SVD. Systematic characterization reveals rich biological organization: 29 to 59 percent of features annotate to Gene Ontology, KEGG, Reactome, STRING, or TRRUST, with U-shaped layer profiles reflecting hierarchical abstraction. Features organize into co-activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (median 2.36x), and form cross-layer information highways (63 to 99.8 percent). When tested against genome-scale CRISPRi perturbation data, only 3 of 48 transcription factors (6.2 percent) show regulatory-target-specific feature responses. A multi-tissue control yields marginal improvement (10.4 percent, 5 of 48 TFs), establishing model representations as the bottleneck. Conclusions: These models have internalized organized biological knowledge, including pathway membership, protein interactions, functional modules, and hierarchical abstraction, yet they encode minimal causal regulatory logic. We release both feature atlases as interactive web platforms enabling exploration of more than 107000 features across 30 layers of two leading single-cell foundation models.

【32】Exact Functional ANOVA Decomposition for Categorical Inputs Models
标题：类别输入模型的精确功能方差分析分解
链接：https://arxiv.org/abs/2603.02673

作者：Baptiste Ferrere,Nicolas Bousquet,Fabrice Gamboa,Jean-Michel Loubes,Joseph Muré
摘要：Functional ANOVA offers a principled framework for interpretability by decomposing a model's prediction into main effects and higher-order interactions. For independent features, this decomposition is well-defined, strongly linked with SHAP values, and serves as a cornerstone of additive explainability. However, the lack of an explicit closed-form expression for general dependent distributions has forced practitioners to rely on costly sampling-based approximations. We completely resolve this limitation for categorical inputs. By bridging functional analysis with the extension of discrete Fourier analysis, we derive a closed-form decomposition without any assumption. Our formulation is computationally very efficient. It seamlessly recovers the classical independent case and extends to arbitrary dependence structures, including distributions with non-rectangular support. Furthermore, leveraging the intrinsic link between SHAP and ANOVA under independence, our framework yields a natural generalization of SHAP values for the general categorical setting.

【33】Combinatorial Sparse PCA Beyond the Spiked Identity Model
标题：超越尖峰身份模型的组合稀疏PCA
链接：https://arxiv.org/abs/2603.02607

作者：Syamantak Kumar,Purnamrita Sarkar,Kevin Tian,Peiyuan Zhang
备注：36 pages, 6 figures
摘要：Sparse PCA is one of the most well-studied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance $Σ$, whose top eigenvector $v \in R^d$ is $s$-sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where $Σ= I_d + γvv^\top$ for some $γ> 0$), whereas SDP-based algorithms require no additional assumptions on $Σ$. We demonstrate explicit counterexample covariances $Σ$ against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general $Σ$ using $s^2 \cdot \mathrm{polylog}(d)$ samples and $d^2 \cdot \mathrm{poly}(s, \log(d))$ time, by providing a global convergence guarantee on a variant of the truncated power method of Yuan and Zhang (2013). We provide a natural generalization of our method to recovering a vector in a sparse leading eigenspace. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets.

【34】Optimizing Orbital Parameters of Satellites for a Global Quantum Network
标题：全球量子网络卫星轨道参数优化
链接：https://arxiv.org/abs/2603.02480

作者：Athul Ashok,Owen DePoint,Jackson MacDonald,Albert Williams,Don Towsley
备注：Long (8 page, 5 figure) version of paper appearing at QCNC 2026
摘要：Due to fundamental limitations on terrestrial quantum links, satellites have received considerable attention for their potential as entanglement generation sources in a global quantum internet. In this work, we focus on the problem of designing a constellation of satellites for such a quantum network. We find satellite inclination angles and satellite cluster allocations to achieve maximal entanglement generation rates to fixed sets of globally distributed ground stations. Exploring two black-box optimization frameworks: a Bayesian Optimization (BO) approach and a Genetic Algorithm (GA) approach, we find comparable results, indicating their effectiveness for this optimization task. While GA and BO often perform remarkably similar, BO often converges more efficiently, while later growth noted in GAs is indicative of less susceptibility towards local maxima. In either case, they offer substantial improvements over naive approaches that maximize coverage with respect to ground station placement.

【35】Large Electron Model: A Universal Ground State Predictor
标题：大电子模型：通用的接地状态预测器
链接：https://arxiv.org/abs/2603.02346

作者：Timothy Zaklama,Max Geier,Liang Fu
备注：8+5 pages, 5+4 figures, 1+1 tables
摘要：We introduce Large Electron Model, a single neural network model that produces variational wavefunctions of interacting electrons over the entire Hamiltonian parameter manifold. Our model employs the Fermi Sets architecture, a universal representation of many-body fermionic wavefunctions, which is further conditioned on Hamiltonian parameter and particle number. On interacting electrons in a two-dimensional harmonic potential, a single trained model accurately predicts the ground state wavefunction while generalizing across unseen coupling strengths and particle-number sectors, producing both accurate real-space charge densities and ground state energies, even up to $50$ particles. Our results establish a foundation model method for material discovery that is grounded in the variational principle, while accurately treating strong electron correlation beyond the capacity of density functional theory.

其他(51篇)

【1】How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
标题：如何用刀削皮：将精细操作与人类偏好对齐
链接：https://arxiv.org/abs/2603.03280

作者：Toru Lin,Shuying Deng,Zhao-Heng Yin,Pieter Abbeel,Jitendra Malik
备注：Project page can be found at https://toruowo.github.io/peel
摘要：Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

【2】LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
标题：LoGeR：使用混合存储器的长上下文几何重建
链接：https://arxiv.org/abs/2603.03269

作者：Junyi Zhang,Charles Herrmann,Junhwa Hur,Chen Sun,Ming-Hsuan Yang,Forrester Cole,Trevor Darrell,Deqing Sun
备注：Project page: https://LoGeR-project.github.io/
摘要：Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

【3】Physics-informed post-processing of stabilized finite element solutions for transient convection-dominated problems
标题：瞬时对流主导问题稳定化有限单元解的物理信息后处理
链接：https://arxiv.org/abs/2603.03259

作者：Süleyman Cengizci,Ömür Uğur,Srinivasan Natesan
摘要：The numerical simulation of convection-dominated transient transport phenomena poses significant computational challenges due to sharp gradients and propagating fronts across the spatiotemporal domain. Classical discretization methods often generate spurious oscillations, requiring advanced stabilization techniques. However, even stabilized finite element methods may require additional regularization to accurately resolve localized steep layers. On the other hand, standalone physics-informed neural networks (PINNs) struggle to capture sharp solution structures in convection-dominated regimes and typically require a large number of training epochs. This work presents a hybrid computational framework that extends the PINN-Augmented SUPG with Shock-Capturing (PASSC) methodology from steady to unsteady problems. The approach combines a semi-discrete stabilized finite element method with a PINN-based correction strategy for transient convection-diffusion-reaction equations. Stabilization is achieved using the Streamline-Upwind Petrov-Galerkin (SUPG) formulation augmented with a YZbeta shock-capturing operator. Rather than training over the entire space-time domain, the neural network is applied selectively near the terminal time, enhancing the finite element solution using the last K_s temporal snapshots while enforcing residual constraints from the governing equations and boundary conditions. The network incorporates residual blocks with random Fourier features and employs progressive training with adaptive loss weighting. Numerical experiments on five benchmark problems, including boundary and interior layers, traveling waves, and nonlinear Burgers dynamics, demonstrate significant accuracy improvements at the terminal time compared to standalone stabilized finite element solutions.

【4】Speculative Speculative Decoding
标题：投机投机解码
链接：https://arxiv.org/abs/2603.03251

作者：Tanishq Kumar,Tri Dao,Avner May
摘要：Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.

【5】Infinite dimensional generative sensing
标题：无限维生成传感
链接：https://arxiv.org/abs/2603.03196

作者：Paolo Angella,Vito Paolo Pastore,Matteo Santacesaria
摘要：Deep generative models have become a standard for modeling priors for inverse problems, going beyond classical sparsity-based methods. However, existing theoretical guarantees are mostly confined to finite-dimensional vector spaces, creating a gap when the physical signals are modeled as functions in Hilbert spaces. This work presents a rigorous framework for generative compressed sensing in Hilbert spaces. We extend the notion of local coherence in an infinite-dimensional setting, to derive optimal, resolution-independent sampling distributions. Thanks to a generalization of the Restricted Isometry Property, we show that stable recovery holds when the number of measurements is proportional to the prior's intrinsic dimension (up to logarithmic factors), independent of the ambient dimension. Finally, numerical experiments on the Darcy flow equation validate our theoretical findings and demonstrate that in severely undersampled regimes, employing lower-resolution generators acts as an implicit regularizer, improving reconstruction stability.

【6】Less Noise, Same Certificate: Retain Sensitivity for Unlearning
标题：更少的噪音，相同的证书：保留忘记学习的敏感性
链接：https://arxiv.org/abs/2603.03172

作者：Carolin Heinzler,Kasra Malihi,Amartya Sanyal
摘要：Certified machine unlearning aims to provably remove the influence of a deletion set $U$ from a model trained on a dataset $S$, by producing an unlearned output that is statistically indistinguishable from retraining on the retain set $R:=S\setminus U$. Many existing certified unlearning methods adapt techniques from Differential Privacy (DP) and add noise calibrated to global sensitivity, i.e., the worst-case output change over all adjacent datasets. We show that this DP-style calibration is often overly conservative for unlearning, based on a key observation: certified unlearning, by definition, does not require protecting the privacy of the retained data $R$. Motivated by this distinction, we define retain sensitivity as the worst-case output change over deletions $U$ while keeping $R$ fixed. While insufficient for DP, retain sensitivity is exactly sufficient for unlearning, allowing for the same certificates with less noise. We validate these reductions in noise theoretically and empirically across several problems, including the weight of minimum spanning trees, PCA, and ERM. Finally, we refine the analysis of two widely used certified unlearning algorithms through the lens of retain sensitivity, leveraging the regularity induced by $R$ to further reduce noise and improve utility.

【7】Torus embeddings
标题：圆环嵌入
链接：https://arxiv.org/abs/2603.03135

作者：Dan Stowell
摘要：Many data representations are vectors of continuous values. In particular, deep learning embeddings are data-driven representations, typically either unconstrained in Euclidean space, or constrained to a hypersphere. These may also be translated into integer representations (quantised) for efficient large-scale use. However, the fundamental (and most efficient) numeric representation in the overwhelming majority of existing computers is integers with overflow -- and vectors of these integers do not correspond to either of these spaces, but instead to the topology of a (hyper)torus. This mismatch can lead to wasted representation capacity. Here we show that common deep learning frameworks can be adapted, quite simply, to create representations with inherent toroidal topology. We investigate two alternative strategies, demonstrating that a normalisation-based strategy leads to training with desirable stability and performance properties, comparable to a standard hyperspherical L2 normalisation. We also demonstrate that a torus embedding maintains desirable quantisation properties. The torus embedding does not outperform hypersphere embeddings in general, but is comparable, and opens the possibility to train deep embeddings which have an extremely simple pathway to efficient `TinyML' embedded implementation.

【8】Joint Training Across Multiple Activation Sparsity Regimes
标题：跨多重激活稀疏机制的联合训练
链接：https://arxiv.org/abs/2603.03131

作者：Haotian Wang
摘要：Generalization in deep neural networks remains only partially understood. Inspired by the stronger generalization tendency of biological systems, we explore the hypothesis that robust internal representations should remain effective across both dense and sparse activation regimes. To test this idea, we introduce a simple training strategy that applies global top-k constraints to hidden activations and repeatedly cycles a single model through multiple activation budgets via progressive compression and periodic reset. Using CIFAR-10 without data augmentation and a WRN-28-4 backbone, we find in single-run experiments that two adaptive keep-ratio control strategies both outperform dense baseline training. These preliminary results suggest that joint training across multiple activation sparsity regimes may provide a simple and effective route to improved generalization.

【9】Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
标题：为什么Adam可以击败SGD：二阶矩归一化产生更尖锐的尾部
链接：https://arxiv.org/abs/2603.03099

作者：Ruinan Jin,Yingbin Liang,Shaofeng Zou
备注：59 pages
摘要：Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.

【10】IoUCert: Robustness Verification for Anchor-based Object Detectors
标题：IoUCert：基于锚点的对象检测器的稳健性验证
链接：https://arxiv.org/abs/2603.03043

作者：Benedikt Brückner,Alejandro Mercado,Yanghao Zhang,Panagiotis Kouvaros,Alessio Lomuscio
摘要：While formal robustness verification has seen significant success in image classification, scaling these guarantees to object detection remains notoriously difficult due to complex non-linear coordinate transformations and Intersection-over-Union (IoU) metrics. We introduce {\sc \sf IoUCert}, a novel formal verification framework designed specifically to overcome these bottlenecks in foundational anchor-based object detection architectures. Focusing on the object localisation component in single-object settings, we propose a coordinate transformation that enables our algorithm to circumvent precision-degrading relaxations of non-linear box prediction functions. This allows us to optimise bounds directly with respect to the anchor box offsets which enables a novel Interval Bound Propagation method that derives optimal IoU bounds. We demonstrate that our method enables, for the first time, the robustness verification of realistic, anchor-based models including SSD, YOLOv2, and YOLOv3 variants against various input perturbations.

【11】Why Does RLAIF Work At All?
标题：为什么RLAIF有效？
链接：https://arxiv.org/abs/2603.03000

作者：Robin Young
摘要：Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.

【12】Rethinking Time Series Domain Generalization via Structure-Stratified Calibration
标题：通过结构分层校准重新思考时间序列域概括
链接：https://arxiv.org/abs/2603.02756

作者：Jinyang Li,Shuhao Mei,Xiaoyu Xiao,Shuhang Li,Ruoxi Yun,Jinbo Sun
摘要：For time series arising from latent dynamical systems, existing cross-domain generalization methods commonly assume that samples are comparably meaningful within a shared representation space. In real-world settings, however, different datasets often originate from structurally heterogeneous families of dynamical systems, leading to fundamentally distinct feature distributions. Under such circumstances, performing global alignment while neglecting structural differences is highly prone to establishing spurious correspondences and inducing negative transfer. From the new perspective of cross-domain structural correspondence failure, we revisit this problem and propose a structurally stratified calibration framework. This approach explicitly distinguishes structurally consistent samples and performs amplitude calibration exclusively within structurally compatible sample clusters, thereby effectively alleviating generalization failures caused by structural incompatibility. Notably, the proposed framework achieves substantial performance improvements through a concise and computationally efficient calibration strategy. Evaluations on 19 public datasets (100.3k samples) demonstrate that SSCF significantly outperforms strong baselines under the zero-shot setting. These results confirm that establishing structural consistency prior to alignment constitutes a more reliable and effective pathway for improving cross-domain generalization of time series governed by latent dynamical systems.

【13】The power of small initialization in noisy low-tubal-rank tensor recovery
标题：小初始化在有噪音的低管阶张量恢复中的作用
链接：https://arxiv.org/abs/2603.02729

作者：ZHiyu Liu,Haobo Geng,Xudong Wang,Yandong Tang,Zhi Han,Yao Wang
摘要：We study the problem of recovering a low-tubal-rank tensor $\mathcal{X}\_\star\in \mathbb{R}^{n \times n \times k}$ from noisy linear measurements under the t-product framework. A widely adopted strategy involves factorizing the optimization variable as $\mathcal{U} * \mathcal{U}^\top$, where $\mathcal{U} \in \mathbb{R}^{n \times R \times k}$, followed by applying factorized gradient descent (FGD) to solve the resulting optimization problem. Since the tubal-rank $r$ of the underlying tensor $\mathcal{X}_\star$ is typically unknown, this method often assumes $r < R \le n$, a regime known as over-parameterization. However, when the measurements are corrupted by some dense noise (e.g., Gaussian noise), FGD with the commonly used spectral initialization yields a recovery error that grows linearly with the over-estimated tubal-rank $R$. To address this issue, we show that using a small initialization enables FGD to achieve a nearly minimax optimal recovery error, even when the tubal-rank $R$ is significantly overestimated. Using a four-stage analytic framework, we analyze this phenomenon and establish the sharpest known error bound to date, which is independent of the overestimated tubal-rank $R$. Furthermore, we provide a theoretical guarantee showing that an easy-to-use early stopping strategy can achieve the best known result in practice. All these theoretical findings are validated through a series of simulations and real-data experiments.

【14】FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing
标题：FinTexTS：通过基于语义和多层配对的金融文本配对时间序列数据集
链接：https://arxiv.org/abs/2603.02702

作者：Jaehoon Lee,Suhwan Park,Tae Yoon Lim,Seunghan Lee,Jun Seo,Dongwan Kang,Hwanil Choi,Minjae Kim,Sungdong Yoo,SoonYoung Lee,Yongjae Lee,Wonbin Ahn
备注：14 pages
摘要：The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company's stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbf{FinTexTS}, a new large-scale text-paired stock price dataset. Experimental results on \textbf{FinTexTS} demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbf{FinTexTS}, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.

【15】Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
标题：在一个解决方案中解决缺失和嘈杂的模式：针对低质量多模式数据的统一模式质量框架
链接：https://arxiv.org/abs/2603.02695

作者：Sijie Mai,Shiqin Han,Haifeng Hu
摘要：Multimodal data encountered in real-world scenarios are typically of low quality, with noisy modalities and missing modalities being typical forms that severely hinder model performance and robustness. However, prior works often handle noisy and missing modalities separately. In contrast, we jointly address missing and noisy modalities to enhance model robustness in low-quality data scenarios. We regard both noisy and missing modalities as a unified low-quality modality problem, and propose a unified modality-quality (UMQ) framework to enhance low-quality representations for multimodal affective computing. Firstly, we train a quality estimator with explicit supervised signals via a rank-guided training strategy that compares the relative quality of different representations by adding a ranking constraint, avoiding training noise caused by inaccurate absolute quality labels. Then, a quality enhancer for each modality is constructed, which uses the sample-specific information provided by other modalities and the modality-specific information provided by the defined modality baseline representation to enhance the quality of unimodal representations. Finally, we propose a quality-aware mixture-of-experts module with particular routing mechanism to enable multiple modality-quality problems to be addressed more specifically. UMQ consistently outperforms state-of-the-art baselines on multiple datasets under the settings of complete, missing, and noisy modalities.

【16】From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
标题：从浅到深：通过因果GRPO锁定语义意图
链接：https://arxiv.org/abs/2603.02675

作者：Shuyi Zhou,Zeen Song,Wenwen Qiang,Jiyan Sun,Yao Zhou,Yinlong Liu,Wei Ma
摘要：Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.

【17】SorryDB: Can AI Provers Complete Real-World Lean Theorems?
标题：SorryDB：人工智能证明者能否完成现实世界的精益理论？
链接：https://arxiv.org/abs/2603.02668

作者：Austin Letson,Leopoldo Sarra,Auguste Poiroux,Oliver Dressler,Paul Lezeau,Dhyan Aranha,Frederick Pu,Aaron Hill,Miguel Corredera Hidalgo,Julian Berman,George Tsoukalas,Lenny Taelman
摘要：We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

【18】HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization
标题：HomeAdam：Adam和AdamW算法有时会回家以获得更好的可证明概括
链接：https://arxiv.org/abs/2603.02649

作者：Feihu Huang,Guanyi Zhang,Songcan Chen
备注：39 pages
摘要：Adam and AdamW are a class of default optimizers for training deep learning models in machine learning. These adaptive algorithms converge faster but generalize worse compared to SGD. In fact, their proved generalization error $O(\frac{1}{\sqrt{N}})$ also is larger than $O(\frac{1}{N})$ of SGD, where $N$ denotes training sample size. Recently, although some variants of Adam have been proposed to improve its generalization, their improved generalizations are still unexplored in theory. To fill this gap, in the paper, we restudy generalization of Adam and AdamW via algorithmic stability, and first prove that Adam and AdamW without square-root (i.e., Adam(W)-srf) have a generalization error $O(\frac{\hatρ^{-2T}}{N})$, where $T$ denotes iteration number and $\hatρ>0$ denotes the smallest element of second-order momentum plus a small positive number. To improve generalization, we propose a class of efficient clever Adam (i.e., HomeAdam(W)) algorithms via sometimes returning momentum-based SGD. Moreover, we prove that our HomeAdam(W) have a smaller generalization error $O(\frac{1}{N})$ than $O(\frac{\hatρ^{-2T}}{N})$ of Adam(W)-srf, since $\hatρ$ is generally very small. In particular, it is also smaller than the existing $O(\frac{1}{\sqrt{N}})$ of Adam(W). Meanwhile, we prove our HomeAdam(W) have a faster convergence rate of $O(\frac{1}{T^{1/4}})$ than $O(\frac{\breveρ^{-1}}{T^{1/4}})$ of the Adam(W)-srf, where $\breveρ\leq\hatρ$ also is very small. Extensive numerical experiments demonstrate efficiency of our HomeAdam(W) algorithms.

【19】Uni-Skill: Building Self-Evolving Skill Repository for Generalizable Robotic Manipulation
标题：Uni-Skill：为可推广的机器人操纵构建自我进化的技能库
链接：https://arxiv.org/abs/2603.02623

作者：Senwei Xie,Yuntian Zhang,Ruiping Wang,Xilin Chen
备注：Accepted to ICRA2026
摘要：While skill-centric approaches leverage foundation models to enhance generalization in compositional tasks, they often rely on fixed skill libraries, limiting adaptability to new tasks without manual intervention. To address this, we propose Uni-Skill, a Unified Skill-centric framework that supports skill-aware planning and facilitates automatic skill evolution. Unlike prior methods that restrict planning to predefined skills, Uni-Skill requests for new skill implementations when existing ones are insufficient, ensuring adaptable planning with self-augmented skill library. To support automatic implementation of diverse skills requested by the planning module, we construct SkillFolder, a VerbNet-inspired repository derived from large-scale unstructured robotic videos. SkillFolder introduces a hierarchical skill taxonomy that captures diverse skill descriptions at multiple levels of abstraction. By populating this taxonomy with large-scale, automatically annotated demonstrations, Uni-Skill shifts the paradigm of skill acquisition from inefficient manual annotation to efficient offline structural retrieval. Retrieved examples provide semantic supervision over behavior patterns and fine-grained references for spatial trajectories, enabling few-shot skill inference without deployment-time demonstrations. Comprehensive experiments in both simulation and real-world settings verify the state-of-the-art performance of Uni-Skill over existing VLM-based skill-centric approaches, highlighting its advanced reasoning capabilities and strong zero-shot generalization across a wide range of novel tasks.

【20】Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series
标题：相同的错误，不同的功能：作为金融时间序列中隐性先行者的优化者
链接：https://arxiv.org/abs/2603.02620

作者：Federico Vittorio Cortesi,Giuseppe Iannone,Giulia Crippa,Tomaso Poggio,Pierfrancesco Beneventano
备注：39 pages, 24 figures
摘要：Neural networks applied to financial time series operate in a regime of underspecification, where model predictors achieve indistinguishable out-of-sample error. Using large-scale volatility forecasting for S$\&$P 500 stocks, we show that different model-training-pipeline pairs with identical test loss learn qualitatively different functions. Across architectures, predictive accuracy remains unchanged, yet optimizer choice reshapes non-linear response profiles and temporal dependence differently. These divergences have material consequences for decisions: volatility-ranked portfolios trace a near-vertical Sharpe-turnover frontier, with nearly $3\times$ turnover dispersion at comparable Sharpe ratios. We conclude that in underspecified settings, optimization acts as a consequential source of inductive bias, thus model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

【21】GPUTOK: GPU Accelerated Byte Level BPE Tokenization
标题：GPUTOK：图形处理器加速字节级BPE令牌化
链接：https://arxiv.org/abs/2603.02597

作者：Venu Gopal Kadamba,Kanishkha Jaisankar
摘要：As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer's outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical.

【22】Wasserstein Proximal Policy Gradient
标题：沃瑟斯坦近距离政策梯度
链接：https://arxiv.org/abs/2603.02576

作者：Zhaoyu Zhu,Shuhan Zhang,Rui Gao,Shuang Li
摘要：We study policy gradient methods for continuous-action, entropy-regularized reinforcement learning through the lens of Wasserstein geometry. Starting from a Wasserstein proximal update, we derive Wasserstein Proximal Policy Gradient (WPPG) via an operator-splitting scheme that alternates an optimal transport update with a heat step implemented by Gaussian convolution. This formulation avoids evaluating the policy's log density or its gradient, making the method directly applicable to expressive implicit stochastic policies specified as pushforward maps. We establish a global linear convergence rate for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error. Empirically, WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.

【23】FlashEvaluator: Expanding Search Space with Parallel Evaluation
标题：Flash评估器：通过并行评估扩大搜索空间
链接：https://arxiv.org/abs/2603.02565

作者：Chao Feng,Yuanhao Pu,Chenghao Zhang,Shanqi Liu,Shuchang Liu,Xiang Li,Yongqi Liu,Lantao Hu,Kaiqiao Zhan,Han Li,Kun Gai
备注：23 pages, 2 figures
摘要：The Generator-Evaluator (G-E) framework, i.e., evaluating K sequences from a generator and selecting the top-ranked one according to evaluator scores, is a foundational paradigm in tasks such as Recommender Systems (RecSys) and Natural Language Processing (NLP). Traditional evaluators process sequences independently, suffering from two major limitations: (1) lack of explicit cross-sequence comparison, leading to suboptimal accuracy; (2) poor parallelization with linear complexity of O(K), resulting in inefficient resource utilization and negative impact on both throughput and latency. To address these challenges, we propose FlashEvaluator, which enables cross-sequence token information sharing and processes all sequences in a single forward pass. This yields sublinear computational complexity that improves the system's efficiency and supports direct inter-sequence comparisons that improve selection accuracy. The paper also provides theoretical proofs and extensive experiments on recommendation and NLP tasks, demonstrating clear advantages over conventional methods. Notably, FlashEvaluator has been deployed in online recommender system of Kuaishou, delivering substantial and sustained revenue gains in practice.

【24】Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery
标题：给我剪刀：用于器械输送的无碰撞双臂手术辅助机器人
链接：https://arxiv.org/abs/2603.02553

作者：Xuejin Luo,Shiquan Sun,Runshi Zhang,Ruizhi Zhang,Junchen Wang
备注：8 pages, 10 figures. Accepted by IEEE International Conference on Robotics and Automation (ICRA), 2026
摘要：During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot's grasping and delivery trajectories in a zero-shot manner based on surgeons' instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot's autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at https://give-me-scissors.github.io/.

【25】Functional Properties of the Focal-Entropy
标题：焦度信息的功能性质
链接：https://arxiv.org/abs/2603.02533

作者：Jaimin Shah,Martina Cardone,Alex Dytso
备注：Accepted to AISTATS 2026
摘要：The focal-loss has become a widely used alternative to cross-entropy in class-imbalanced classification problems, particularly in computer vision. Despite its empirical success, a systematic information-theoretic study of the focal-loss remains incomplete. In this work, we adopt a distributional viewpoint and study the focal-entropy, a focal-loss analogue of the cross-entropy. Our analysis establishes conditions for finiteness, convexity, and continuity of the focal-entropy, and provides various asymptotic characterizations. We prove the existence and uniqueness of the focal-entropy minimizer, describe its structure, and show that it can depart significantly from the data distribution. In particular, we rigorously show that the focal-loss amplifies mid-range probabilities, suppresses high-probability outcomes, and, under extreme class imbalance, induces an over-suppression regime in which very small probabilities are further diminished. These results, which are also experimentally validated, offer a theoretical foundation for understanding the focal-loss and clarify the trade-offs that it introduces when applied to imbalanced learning tasks.

【26】Bridging Diffusion Guidance and Anderson Acceleration via Hopfield Dynamics
标题：通过霍普菲尔德动力学连接扩散引导和安德森加速
链接：https://arxiv.org/abs/2603.02531

作者：Kwanyoung Kim
备注：24 pages, 11 figures
摘要：Classifier-Free Guidance (CFG) has significantly enhanced the generative quality of diffusion models by extrapolating between conditional and unconditional outputs. However, its high inference cost and limited applicability to distilled or single-step models have shifted research focus toward attention-space extrapolation. While these methods offer computational efficiency, their theoretical underpinnings remain elusive. In this work, we establish a foundational framework for attention-space extrapolation by modeling attention dynamics as fixed-point iterations within Modern Hopfield Networks. We demonstrate that the extrapolation effect in attention space constitutes a special case of Anderson Acceleration applied to these dynamics. Building on this insight and the weak contraction property, we propose Geometry Aware Attention Guidance (GAG). By decomposing attention updates into parallel and orthogonal components relative to the guidance direction, GAG stabilizes the acceleration process and maximizes guidance efficiency. Our plug-and-play method seamlessly integrates with existing frameworks while significantly improving generation quality.

【27】ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution
标题：ParEVO：不规则数据合成代码：通过快速进化实现高性能并行主义
链接：https://arxiv.org/abs/2603.02510

作者：Liu Yang,Zeyu Nie,Andrew Liu,Felix Zou,Deniz Altinbüken,Amir Yazdanbakhsh,Quanquan C. Liu
摘要：The transition from sequential to parallel computing is essential for modern high-performance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non-uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current Large Language Models (LLMs) often fail catastrophically on these tasks, generating code plagued by subtle race conditions, deadlocks, and sub-optimal scaling. We bridge this gap with ParEVO, a framework designed to synthesize high-performance parallel algorithms for irregular data. Our contributions include: (1) The Parlay-Instruct Corpus, a curated dataset of 13,820 tasks synthesized via a "Critic-Refine" pipeline that explicitly filters for empirically performant algorithms that effectively utilize Work-Span parallel primitives; (2) specialized DeepSeek, Qwen, and Gemini models fine-tuned to align probabilistic generation with the rigorous semantics of the ParlayLib library; and (3) an Evolutionary Coding Agent (ECA) that improves the "last mile" of correctness by iteratively repairing code using feedback from compilers, dynamic race detectors, and performance profilers. On the ParEval benchmark, ParEVO achieves an average 106x speedup (with a maximum of 1103x) across the suite, and a robust 13.6x speedup specifically on complex irregular graph problems, outperforming state-of-the-art commercial models. Furthermore, our evolutionary approach matches state-of-the-art expert human baselines, achieving up to a 4.1x speedup on specific highly-irregular kernels. Source code and datasets are available at https://github.com/WildAlg/ParEVO.

【28】Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles
标题：蛋白质设计和形态集合的刚性意识几何预训练
链接：https://arxiv.org/abs/2603.02406

作者：Zhanghan Ni,Yanjing Li,Zeju Qiu,Bernhard Schölkopf,Hongyu Guo,Weiyang Liu,Shengchao Liu
备注：The Fourteenth International Conference on Learning Representations
摘要：Generative models have recently advanced $\textit{de novo}$ protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce $\textbf{RigidSSL}$ ($\textit{Rigidity-Aware Self-Supervised Learning}$), a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43\% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8\% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.

【29】COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management
标题：COOL-MC：概述并解释血小板库存管理的RL政策
链接：https://arxiv.org/abs/2603.02396

作者：Dennis Gross
摘要：Platelets expire within five days. Blood banks face uncertain daily demand and must balance ordering decisions between costly wastage from overstocking and life-threatening shortages from understocking. Reinforcement learning (RL) can learn effective ordering policies for this Markov decision process (MDP), but the resulting neural policies remain black boxes, hindering trust and adoption in safety-critical domains. We apply COOL-MC, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al. By constructing a policy-induced discrete-time Markov chain (which includes only the reachable states under the trained policy to reduce memory usage), we verify PCTL properties and provide feature-level explanations. Results show that the trained policy achieves a 2.9% stockout probability and a 1.1% inventory-full (potential wastage) probability within a 200-step horizon, primarily attends to the age distribution of inventory rather than other features such as day of week or pending orders. Action reachability analysis reveals that the policy employs a diverse replenishment strategy, with most order quantities reached quickly, while several are never selected. Counterfactual analysis shows that replacing medium-large orders with smaller ones leaves both safety probabilities nearly unchanged, indicating that these orders are placed in well-buffered inventory states. This first formal verification and explanation of an RL platelet inventory management policy demonstrates COOL-MC's value for transparent, auditable decision-making in safety-critical healthcare supply chain domains.

【30】CUCo: An Agentic Framework for Compute and Communication Co-design
标题：CUCo：计算和通信协同设计的抽象框架
链接：https://arxiv.org/abs/2603.02376

作者：Bodun Hu,Yoga Sri Varshan,Saurabh Agarwal,Aditya Akella
摘要：Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.

【31】RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
标题：RO-N3 WS：通过多样化的罗马尼亚语音基准增强低资源ASB的概括性
链接：https://arxiv.org/abs/2603.02368

作者：Alexandra Diaconu,Mădălina Vînaga,Bogdan Alexe
摘要：We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.

【32】Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris
标题：离散领域中的扩散MPC：可行性约束、地平线效应和批判性一致：俄罗斯方块的案例研究
链接：https://arxiv.org/abs/2603.02348

作者：Haochuan Kevin Wang
备注：7 pages, 3 figures, 2 tables. Includes regret diagnostics and compute-quality frontier analysis. Code and experiment configurations available in the Diffusion-Tetris repository
摘要：We study diffusion-based model predictive control (Diffusion-MPC) in discrete combinatorial domains using Tetris as a case study. Our planner samples candidate placement sequences with a MaskGIT-style discrete denoiser and selects actions via reranking. We analyze three key factors: (1) feasibility-constrained sampling via logit masking over valid placements, (2) reranking strategies using a heuristic score, a pretrained DQN critic, and a hybrid combination, and (3) compute scaling in candidate count and planning horizon. We find that feasibility masking is necessary in discrete domains, removing invalid action mass (46%) and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling. Naive DQN reranking is systematically misaligned with rollout quality, producing high decision regret (mean 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse and delayed rewards, suggesting uncertainty compounding in long imagined rollouts. Overall, compute choices (K, H) determine dominant failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch. Our findings highlight structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.

【33】Preconditioned Score and Flow Matching
标题：预处理评分和流量匹配
链接：https://arxiv.org/abs/2603.02337

作者：Shadab Ahamed,Eshed Gal,Simon Ghyselincks,Md Shahriar Rahim Siddiqui,Moshe Eliasof,Eldad Haber
备注：24 pages, 12 figures, 5 tables
摘要：Flow matching and score-based diffusion train vector fields under intermediate distributions $p_t$, whose geometry can strongly affect their optimization. We show that the covariance $Σ_t$ of $p_t$ governs optimization bias: when $Σ_t$ is ill-conditioned, and gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning that plateaus at suboptimal weights. We formalize this effect in analytically tractable settings and propose reversible, label-conditional \emph{preconditioning} maps that reshape the geometry of $p_t$ by improving the conditioning of $Σ_t$ without altering the underlying generative model. Rather than accelerating early convergence, preconditioning primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions. Across MNIST latent flow matching, and additional high-resolution datasets, we empirically track conditioning diagnostics and distributional metrics and show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.

【34】A Comparative Study of UMAP and Other Dimensionality Reduction Methods
标题：UMAP与其他简化方法的比较研究
链接：https://arxiv.org/abs/2603.02275

作者：Guanzhe Zhang,Shanshan Ding,Zhezhen Jin
备注：31 pages, 4 figures
摘要：Uniform Manifold Approximation and Projection (UMAP) is a widely used manifold learning technique for dimensionality reduction. This paper studies UMAP, supervised UMAP, and several competing dimensionality reduction methods, including Principal Component Analysis (PCA), Kernel PCA, Sliced Inverse Regression (SIR), Kernel SIR, and t-distributed Stochastic Neighbor Embedding, through a comprehensive comparative analysis. Although UMAP has attracted substantial attention for preserving local and global structures, its supervised extensions, particularly for regression settings, remain rather underexplored. We provide a systematic evaluation of supervised UMAP for both regression and classification using simulated and real datasets, with performance assessed via predictive accuracy on low-dimensional embeddings. Our results show that supervised UMAP performs well for classification but exhibits limitations in effectively incorporating response information for regression, highlighting an important direction for future development.

【35】The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety
标题：对齐飞轮：以政府为中心的混合MAS，实现结构不可知的安全
链接：https://arxiv.org/abs/2603.02259

作者：Elias Malomgré,Pieter Simoens
摘要：Multi-agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent-based systems. While learned and generative models substantially expand system capability, their safety behavior is often entangled with training, making it opaque, difficult to audit, and costly to update after deployment. This paper formalizes the Alignment Flywheel as a governance-centric hybrid MAS architecture that decouples decision generation from safety governance. A Proposer, representing any autonomous decision component, generates candidate trajectories, while a Safety Oracle returns raw safety signals through a stable interface. An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty-driven verification, and versioned refinement. The central engineering principle is patch locality: many newly observed safety failures can be mitigated by updating the governed oracle artifact and its release pipeline rather than retracting or retraining the underlying decision component. The architecture is implementation-agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout across distributed deployments. The result is a hybrid MAS engineering framework for integrating highly capable but fallible autonomous systems under explicit, version-controlled, and auditable oversight.

【36】Universal Conceptual Structure in Neural Translation: Probing NLLB-200's Multilingual Geometry
标题：神经翻译中的通用概念结构：探究NLLB-200的多语言几何学
链接：https://arxiv.org/abs/2603.02258

作者：Kyle Elliott Mathewson
备注：14 figures; code and interactive toolkit available at https://github.com/kylemathewson/InterpretCognates
摘要：Do neural machine translation models learn language-universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta's NLLB-200, a 200-language encoder-decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program ($ρ= 0.13$, $p = 0.020$), demonstrating that NLLB-200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non-colexified pairs ($U = 42656$, $p = 1.33 \times 10^{-11}$, $d = 0.96$), indicating that the model has internalized universal conceptual associations. Per-language mean-centering of embeddings improves the between-concept to within-concept distance ratio by a factor of 1.19, providing geometric evidence for a language-neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross-lingual consistency (mean cosine = 0.84), suggesting that second-order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open-source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.

【37】Structured vs. Unstructured Pruning: An Exponential Gap
标题：结构化与非结构化修剪：指数差距
链接：https://arxiv.org/abs/2603.02234

作者：Davide Ferré,Frédéric Giroire,Emanuele Natale,Frederik Mallmann-Trenn
摘要：The Strong Lottery Ticket Hypothesis (SLTH) posits that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention. In this work, we consider the problem of approximating a single bias-free ReLU neuron using a randomly initialized bias-free two-layer ReLU network, thereby isolating the intrinsic limitations of neuron pruning. We show that neuron pruning requires a starting network with $Ω(d/\varepsilon)$ hidden neurons to $\varepsilon$-approximate a target ReLU neuron. In contrast, weight pruning achieves $\varepsilon$-approximation with only $O(d\log(1/\varepsilon))$ neurons, establishing an exponential separation between the two pruning paradigms.

【38】Generalized Discrete Diffusion with Self-Correction
标题：具有自修正的广义离散扩散
链接：https://arxiv.org/abs/2603.02230

作者：Linxuan Wang,Ziyi Wang,Yikun Bai,Wei Deng,Guang Lin,Qifan Song
备注：40 pages, 3 figures, 6 tables
摘要：Self-correction is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose a Self-Correcting Discrete Diffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality.

【39】Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat
标题：分散注意力中的路由吸收：为什么随机门很难被击败
链接：https://arxiv.org/abs/2603.02227

作者：Keston Aquino-Michaels
备注：14 pages, 4 figures
摘要：Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model's Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against oracle masks but catastrophic perplexity when deployed (601.6 vs. 48.6 on mask-agnostic Q/K/V); and (4) stochastic mask randomization during training fails to prevent co-adaptation (78.2 ppl deployed dense vs. 37.3 baseline). We connect routing absorption to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE, where experts are self-contained modules. The implication is that end-to-end sparse attention methods employing per-query token-level gating face absorption pressure proportional to the parameter asymmetry between the gate and the model, and that post-hoc approaches, which decouple representation learning from sparsification, sidestep this entirely.

【40】MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation
标题：MedCalc-Bench无法衡量你的想法：基准审计和开本评估的案例
链接：https://arxiv.org/abs/2603.02222

作者：Artus Krohn-Grimberghe
摘要：MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split (HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%. We present three contributions that challenge the benchmark's current framing. First, we conduct a systematic audit of the benchmark's calculator implementations, identifying and fixing over 20 errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset. Second, we show that a simple intervention-providing the model with the calculator specification at inference time ("open-book" prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7, surpassing all published results including RL-trained systems, without any fine-tuning. Third, we establish an upper bound of 95-97% using GPT-5.2-Thinking, with residual errors attributable primarily to ground-truth issues and dataset ambiguities. Our findings suggest that MedCalc-Bench predominantly measures formula memorization and arithmetic precision rather than clinical reasoning, and would be better framed as a tool-use evaluation.

【41】NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
标题：NExT-Guard：无令牌级标签的免训练流媒体保护
链接：https://arxiv.org/abs/2603.02219

作者：Junfeng Fang,Nachuan Chen,Houcheng Jiang,Dan Zhang,Fei Shen,Xiang Wang,Xiangnan He,Tat-Seng Chua
摘要：Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision. Experimental results show that NExT-Guard outperforms both post-hoc and streaming safeguards based on supervised training, with superior robustness across models, SAE variants, and risk scenarios. These results make NExT-Guard a universal and scalable paradigm for real-time safety, accelerating the practical deployment of streaming safeguards.

【42】Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
标题：只有当自我合成管道确保可学习的信息增益时，自我游戏才会发展
链接：https://arxiv.org/abs/2603.02218

作者：Wei Liu,Siya Qi,Yali Du,Yulan He
备注：10 pages, 6 figures, 7 formulas
摘要：Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

【43】Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression
标题：无再训练足够了吗？高效MoE压缩需要进行路由器校准
链接：https://arxiv.org/abs/2603.02217

作者：Sieun Hyeon,Jaeyoung Do
摘要：Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model's next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance recovery, with substantially larger gains in fine-grained MoEs (many small experts) than in coarse-grained MoEs due to their more complex routing decision boundaries.

【44】Variance reduction in lattice QCD observables via normalizing flows
标题：通过正规化流减少格点QCD观测量的方差
链接：https://arxiv.org/abs/2603.02984

作者：Ryan Abbott,Denis Boyda,Yang Fu,Daniel C. Hackett,Gurtej Kanwar,Fernando Romero-López,Phiala E. Shanahan,Julian M. Urban
备注：15 pages, 4 figures, 2 tables
摘要：Normalizing flows can be used to construct unbiased, reduced-variance estimators for lattice field theory observables that are defined by a derivative with respect to action parameters. This work implements the approach for observables involving gluonic operator insertions in the SU(3) Yang-Mills theory and two-flavor Quantum Chromodynamics (QCD) in four space-time dimensions. Variance reduction by factors of $10$-$60$ is achieved in glueball correlation functions and in gluonic matrix elements related to hadron structure, with demonstrated computational advantages. The observed variance reduction is found to be approximately independent of the lattice volume, so that volume transfer can be utilized to minimize training costs.

【45】The Vienna 4G/5G Drive-Test Dataset
标题：维也纳4G/5G路测数据集
链接：https://arxiv.org/abs/2603.02638

作者：Wilfried Wiedner,Lukas Eller,Mariam Mussbah,Dominik Rössler,Valerian Maresch,Philipp Svoboda,Markus Rupp
备注：18 pages, 12 figures, 8 tables. Submitted to Scientific Data
摘要：Machine learning for mobile network analysis, planning, and optimization is often limited by the lack of large, comprehensive real-world datasets. This paper introduces the Vienna 4G/5G Drive-Test Dataset, a city-scale open dataset of georeferenced Long Term Evolution (LTE) and 5G New Radio (NR) measurements collected across Vienna, Austria. The dataset combines passive wideband scanner observations with active handset logs, providing complementary network-side and user-side views of deployed radio access networks. The measurements cover diverse urban and suburban settings and are aligned with time and location information to support consistent evaluation. For a representative subset of base stations (BSs), we provide inferred deployment descriptors, including estimated BS locations, sector azimuths, and antenna heights. The release further includes high-resolution building and terrain models, enabling geometry-conditioned learning and calibration of deterministic approaches such as ray tracing. To facilitate practical reuse, the data are organized into scanner, handset, estimated cell information, and city-model components, and the accompanying documentation describes the available fields and intended joins between them. The dataset enables reproducible benchmarking across environment-aware learning, propagation modeling, coverage analysis, and ray-tracing calibration workflows.

【46】Low-Degree Method Fails to Predict Robust Subspace Recovery
标题：低级方法无法预测稳健的子空间恢复
链接：https://arxiv.org/abs/2603.02594

作者：He Jia,Aravindan Vijayaraghavan
备注：27 pages, 1 figure
摘要：The low-degree polynomial framework has been highly successful in predicting computational versus statistical gaps for high-dimensional problems in average-case analysis and machine learning. This success has led to the low-degree conjecture, which posits that this method captures the power and limitations of efficient algorithms for a wide class of high-dimensional statistical problems. We identify a natural and basic hypothesis testing problem in $\mathbb{R}^n$ which is polynomial time solvable, but for which the low-degree polynomial method fails to predict its computational tractability even up to degree $k=n^{Ω(1)}$. Moreover, the low-degree moments match exactly up to degree $k=O(\sqrt{\log n/\log\log n})$. Our problem is a special case of the well-studied robust subspace recovery problem. The lower bounds suggest that there is no polynomial time algorithm for this problem. In contrast, we give a simple and robust polynomial time algorithm that solves the problem (and noisy variants of it), leveraging anti-concentration properties of the distribution. Our results suggest that the low-degree method and low-degree moments fail to capture algorithms based on anti-concentration, challenging their universality as a predictor of computational barriers.

【47】Geometric structures and deviations on James' symmetric positive-definite matrix bicone domain
标题：James对称定矩阵双锥域的几何结构和偏差
链接：https://arxiv.org/abs/2603.02483

作者：Jacek Karwowski,Frank Nielsen
备注：35 pages, 4 figures
摘要：Symmetric positive-definite (SPD) matrix datasets play a central role across numerous scientific disciplines, including signal processing, statistics, finance, computer vision, information theory, and machine learning among others. The set of SPD matrices forms a cone which can be viewed as a global coordinate chart of the underlying SPD manifold. Rich differential-geometric structures may be defined on the SPD cone manifold. Among the most widely used geometric frameworks on this manifold are the affine-invariant Riemannian structure and the dual information-geometric log-determinant barrier structure, each associated with dissimilarity measures (distance and divergence, respectively). In this work, we introduce two new structures, a Finslerian structure and a dual information-geometric structure, both derived from James' bicone reparameterization of the SPD domain. Those structures ensure that geodesics correspond to straight lines in appropriate coordinate systems. The closed bicone domain includes the spectraplex (the set of positive semi-definite diagonal matrices with unit trace) as an affine subspace, and the Hilbert VPM distance is proven to generalize the Hilbert simplex distance which found many applications in machine learning. Finally, we discuss several applications of these Finsler/dual Hessian structures and provide various inequalities between the new and traditional dissimilarities.

【48】Topological Causal Effects
标题：布局因果效应
链接：https://arxiv.org/abs/2603.02289

作者：Kwangho Kim,Hajin Lee
摘要：Estimating causal effects is particularly challenging when outcomes arise in complex, non-Euclidean spaces, where conventional methods often fail to capture meaningful structural variation. We develop a framework for topological causal inference that defines treatment effects through differences in the topological structure of potential outcomes, summarized by power-weighted silhouette functions of persistence diagrams. We develop an efficient, doubly robust estimator in a fully nonparametric model, establish functional weak convergence, and construct a formal test of the null hypothesis of no topological effect. Empirical studies illustrate that the proposed method reliably quantifies topological treatment effects across diverse complex outcome types.

【49】Quantum AS-DeepOnet: Quantum Attentive Stacked DeepONet for Solving 2D Evolution Equations
标题：Quantum AS-DeepOnet：用于求解2D进化方程的量子专注堆叠DeepOnet
链接：https://arxiv.org/abs/2603.02261

作者：Hongquan Wang,Hanshu Chen,Ilia Marchevsky,Zhuojia Fu
摘要：DeepONet enables retraining-free inference across varying initial conditions or source terms at the cost of high computational requirements. This paper proposes a hybrid quantum operator network (Quantum AS-DeepOnet) suitable for solving 2D evolution equations. By combining Parameterized Quantum Circuits and cross-subnet attention methods, we can solve 2D evolution equations using only 60% of the trainable parameters while maintaining accuracy and convergence comparable to the classical DeepONet method.

【50】Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
标题：Whisper-RIR-Mega：ASB对房间声学鲁棒性的配对清洁-回响语音基准
链接：https://arxiv.org/abs/2603.02252

作者：Mandip Goswami
摘要：We introduce Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech for evaluating automatic speech recognition (ASR) robustness to room acoustics. Each sample pairs a clean LibriSpeech utterance with the same utterance convolved with a real room impulse response from the RIR-Mega corpus, with stratified splits by reverberation time (RT60) and direct-to-reverberant ratio (DRR). We evaluate five Whisper models (tiny through large-v3) on 1600 test samples and report word error rate (WER) and character error rate (CER) under clean and reverberant conditions. Reverberation consistently degrades performance across all model sizes; the reverb penalty in WER ranges from 0.12 to 1.07 percentage points depending on the model. We release the dataset, evaluation code, and baseline results to support reproducible research on robust ASR.

【51】OnDA: On-device Channel Pruning for Efficient Personalized Keyword Spotting
标题：OnDA：设备上频道修剪，以实现高效的个性化关键词发现
链接：https://arxiv.org/abs/2603.02247

作者：Matteo Risso,Alessio Burrello,Daniele Jahier Pagliari
备注：Submitted for review at Interspeech2026
摘要：Always-on keyword spotting (KWS) demands on-device adaptation to cope with user- and environment-specific distribution shifts under tight latency and energy budgets. This paper proposes, for the first time, coupling weight adaptation (i.e., on-device training) with architectural adaptation, in the form of online structured channel pruning, for personalized on-device KWS. Starting from a state-of-the-art self-learning personalized KWS pipeline, we compare data-agnostic and data-aware pruning criteria applied on in-field pseudo-labelled user data. On the HeySnips and HeySnapdragon datasets, we achieve up to 9.63x model-size compression with respect to unpruned baselines at iso-task performance, measured as the accuracy at 0.5 false alarms per hour. When deploying our adaptation pipeline on a Jetson Orin Nano embedded GPU, we achieve up to 1.52x/1.57x and 1.64x/1.77x latency and energy-consumption improvements during online training/inference compared to weights-only adaptation.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递