Py学习  »  机器学习算法

机器学习学术速递[4.2]

arXiv每日学术速递 • 1 周前 • 267 次点击  

点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!


cs.LG 方向,今日共计164篇


大模型相关(23篇)

【1】LLM REgression with a Latent Iterative State Head
标题:具有潜在迭代国家元首的法学硕士倒退
链接:https://arxiv.org/abs/2604.01206

作者:Yiheng Su,Matthew Lease
摘要:我们提出了RELISH(回归与潜在的迭代状态头),一种新颖的,轻量级的架构设计的文本回归与大型语言模型。RELISH不是将数字目标解码为文本或聚合多个生成的输出,而是直接从冻结的LLM表示中预测标量值,方法是通过标记级表示上的交叉注意迭代地改进学习的潜在状态,然后将最终状态映射到线性回归的点估计。在五个数据集,四个LLM主干和两个LLM训练机制中,RELISH始终优于所有三个主要LLM回归系列的先前基线,包括自回归解码,回归感知推理和现有的预测头方法。尽管取得了这些进展,RELISH仍然具有高度的参数效率,在冻结的LLM骨干中仅需要3.4 - 3.7M的可训练参数(仅增加0.01 - 0.04%的额外开销),远远低于随模型大小增长的基于LoRA的替代方案(0.26 - 0.42%)。
摘要:We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).


【2】Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning
标题:在线推理校准:测试时训练实现可推广的保形LLM推理
链接:https://arxiv.org/abs/2604.01170

作者:Cai Zhou,Zekai Wang,Menghua Wu,Qianyu Julie Zhu,Flora C. Shi,Chenyu Wang,Ashia Wilson,Tommi Jaakkola,Stephen Bates
备注:20 pages
摘要:虽然测试时间扩展使大型语言模型能够解决高度困难的任务,但最先进的结果是以过高的计算成本为代价的。这些低效率可以归因于后训练语言模型的错误校准,以及流行的采样技术缺乏校准。在这里,我们提出了在线推理校准(ORCA),一个框架,用于校准采样过程,借鉴保形预测和测试时的培训。具体来说,我们介绍了一个元学习过程,更新每个输入的校准模块。这使我们能够在分布变化下提供有效的置信度估计,例如在不同推理阶段发生的思维模式中,或者在模型开发和部署之间的即时分布中。ORCA不仅提供了共形风险的理论保证,而且在不同的推理任务中表现出更高的效率和泛化能力。在风险水平$δ=0.1$时,ORCA提高了Qwen2.5- 32 B在分发任务上的效率,使用监督标签节省了47.5%,使用自一致标签节省了40.7%。在zero-shot域外设置下,它将MATH-500的节省从静态校准基线的24.8%提高到67.0%,同时保持较低的经验误差率,并且在模型系列和下游基准中保持相同的趋势。我们的代码可在https://github.com/wzekai99/ORCA上公开获取。
摘要:While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level $δ=0.1$, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.


【3】Reasoning Shift: How Context Silently Shortens LLM Reasoning
标题:推理转变:语境如何悄然缩短LLM推理
链接:https://arxiv.org/abs/2604.01161

作者:Gleb Rodionov
备注:Preprint, work in progress
摘要:大型语言模型(LLM)表现出测试时间缩放行为,如扩展的推理轨迹和自验证,在复杂的长期推理任务中表现出显着的性能。然而,这些推理行为的鲁棒性仍然有待研究。为了研究这一点,我们在三种情况下对多个推理模型进行了系统的评估:(1)用冗长、不相关的上下文增强的问题;(2)具有独立任务的多轮对话设置;(3)作为复杂任务中的子任务提出的问题。我们观察到一个有趣的现象:推理模型往往会产生更短的推理痕迹(高达50%)相同的问题,在不同的上下文条件下相比,在孤立的问题时产生的痕迹。更细粒度的分析表明,这种压缩与自我验证和不确定性管理行为(如双重检查)的减少有关。虽然这种行为转变不会影响简单问题的表现,但它可能会影响更具挑战性的任务的表现。我们希望我们的研究结果引起更多的关注,推理模型的鲁棒性和LLM和基于LLM的代理的上下文管理问题。
摘要:Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.


【4】Fast and Accurate Probing of In-Training LLMs' Downstream Performances
标题:快速准确地探索在职LLM的下游表现
链接:https://arxiv.org/abs/2604.01025

作者:Zhichen Liu,Tianle Lun,Zhibin Wen,Hao An,Yulin Ou,Jianhui Xu,Hao Zhang,Wenyi Fang,Yang Zheng,Yang Xu
摘要:在参数大小和测试时间方面扩展大型语言模型(LLM)的范例已经突破了AI功能的界限,但代价是使传统的生成式评估范例过于昂贵,因此使LLM的训练中下游性能评估的延迟无法忍受。然而,像训练损失(困惑)这样的简单指标并不总是与下游性能相关,因为有时它们的趋势与实际任务结果不同。这种困境需要一种计算效率高且在测量模型能力方面足够准确的方法。为了应对这一挑战,我们引入了一种新的训练中评估范式,该范式使用轻量级探针来监控下游性能。探测器将LLM检查点的内部表示(在训练期间)作为输入,并直接预测检查点对通过成功概率测量的下游任务的性能(即,pass@1)。我们设计了几个探头架构,验证其有效性使用OLMo 3 - 7 B的检查点在一组不同的下游任务。探测器可以准确地预测检查点的性能(使用avg. AUROC$>$0.75),在检查点之间具有良好的泛化能力(早期预测稍后),并将计算延迟从$\sim$1 hr(使用传统的生成评估方法)减少到$\sim$3 min。总之,这项工作提供了一个实用且可扩展的培训下游评估范例,使LLM开发过程更加敏捷,明智和高效。
摘要:The paradigm of scaling Large Language Models (LLMs) in both parameter size and test time has pushed the boundaries of AI capabilities, but at the cost of making the traditional generative evaluation paradigm prohibitively expensive, therefore making the latency of LLM's in-training downstream performance evaluation unbearable. However, simple metrics like training loss (perplexity) are not always correlated with downstream performance, as sometimes their trends diverge from the actual task outcomes. This dilemma calls for a method that is computationally efficient and sufficiently accurate in measuring model capabilities. To address this challenge, we introduce a new in-training evaluation paradigm that uses a lightweight probe for monitoring downstream performance. The probes take the internal representations of LLM checkpoints (during training) as input and directly predict the checkpoint's performance on downstream tasks measured by success probability (i.e., pass@1). We design several probe architectures, validating their effectiveness using the OLMo3-7B's checkpoints across a diverse set of downstream tasks. The probes can accurately predict a checkpoint's performance (with avg. AUROC$>$0.75), have decent generalizability across checkpoints (earlier predicts later), and reduce the computation latency from $\sim$1 hr (using conventional generative evaluation method) to $\sim$3 min. In sum, this work presents a practical and scalable in-training downstream evaluation paradigm, enabling a more agile, informed, and efficient LLM development process.


【5】Optimal Brain Decomposition for Accurate LLM Low-Rank Approximation
标题:精确LLM低阶逼近的最佳大脑分解
链接:https://arxiv.org/abs/2604.00821

作者:Yuhang Li,Donghyun Lee,Ruokai Yin,Priyadarshini Panda
摘要:低秩分解是大型语言模型(LLM)微调和推理中的一个重要问题。通过奇异值分解(SVD),可以将权矩阵最优分解到低秩空间中。以前,通常的做法是在激活白化空间中分解权重,然后获得令人满意的结果。在这项工作中,我们提出了最优脑分解LLM(OBD-LLM),它利用二阶Hessian信息研究模型空间中的分解问题。通过对Hessian矩阵进行严格的Kronecker分解,证明了该分解需要同时考虑层的输入和输出信息,并取得了比只考虑输入信息的方法更好的分解结果。我们的损失感知分解方法涉及对权重矩阵的双向白化。因此,OBD-LLM是语言模型中权重的最佳分解的封闭形式解决方案。值得注意的是,我们实现了~20-40\%更好的结果比以前的国家的最先进的分解方法,SVD-LLM。
摘要:Low-rank decomposition has emerged as an important problem in Large Language Model (LLM) fine-tuning and inference. Through Singular Value Decomposition (SVD), the weight matrix can be factorized into low-rank spaces optimally. Previously, a common practice was to decompose the weight in the activation-whitened space, and then achieve satisfying results. In this work, we propose Optimal Brain Decomposition LLM (OBD-LLM), which studies the decomposition problem in the model space by utilizing second-order Hessian information. Through a rigorous Kronecker-factorization of the Hessian, we show that the decomposition needs to consider both input and output information of the layer, and achieves much better decomposition results compared to input only method. Our loss-aware decomposition method involves a bi-directional whitening on the weight matrix. As a result, OBD-LLM is a closed-form solution for the optimal decomposition of weights in the language model. Remarkably, we achieve ~20-40\% better results than previous state-of-the-art decomposition methods, the SVD-LLM.


【6】Multimodal Language Models Cannot Spot Spatial Inconsistencies
标题:多模式语言模型无法发现空间不一致
链接:https://arxiv.org/abs/2604.00799

作者:Om Khangaonkar,Hadi J. Rad,Hamed Pirsiavash
摘要:空间一致性是视觉世界的基本属性,也是旨在理解物理现实的模型的关键要求。尽管最近取得了一些进展,但多模态大型语言模型(MLLM)经常难以在多个视图中推理3D几何形状。我们引入了一个更具挑战性的任务,而不是要求模型描述场景属性:给定同一场景的两个视图,识别违反3D运动一致性的对象。我们提出了一个简单的和可扩展的方法,从多视角场景生成逼真的,空间不一致的图像对,使系统的评估这种能力。我们的研究结果表明,最先进的MLLM显着低于人类观察者,并在不同的场景属性中表现出很大的变化,揭示了对3D结构的脆弱和不完整的理解。我们希望我们的发现强调了对物理世界更深入理解的方法的必要性。
摘要:Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.


【7】Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer
标题:Aurora超级计算机上大混合专家语言模型的可扩展预训练
链接:https://arxiv.org/abs/2604.00785

作者:Dharma Teja Vooturi,Dhiraj Kalamkar,Dipankar Das,Bharat Kaul
摘要:从头开始预训练大型语言模型(LLM)需要大量的计算。Aurora超级计算机是一台ExaScale机器,拥有127,488个Intel PVC(Ponte Vechio)GPU瓦片。在这项工作中,我们展示了Aurora上的LLM预训练,规模为1000个GPU瓦片。为此,我们开发了Optimus,这是一个内部培训库,支持标准的大型模型培训技术。使用Optimus,我们首先在OLMoE-mix-0924数据集的全部4万亿个令牌的3072个GPU瓦片上从头开始预训练Mula-1B(10亿密度模型)和Mula-7 B-A1 B(70亿专家混合(MoE)模型)。然后,我们通过在同一数据集上预训练三个大型MoE模型Mula-20 B-A2 B、Mula-100 B-A7 B和Mula-220 B-A10 B,直到1000亿个令牌来演示模型缩放。在我们最大的模型Mula-220 B-A10 B上,我们将计算扩展从384个GPU瓦片推到12288个GPU瓦片,并在12288个GPU瓦片上观察到约90%的扩展效率。我们使用自定义GPU内核进行专家计算,显著提高了MoE模型的运行时性能,并使用了一种新颖的EP-Aware分片优化器,使训练速度提高了1.71倍。作为Optimus库的一部分,我们还开发了一套强大的可靠性和容错功能,以提高大规模训练的稳定性和连续性。
摘要 :Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed scaling efficiency of around 90% at 12288 GPU tiles. We significantly improved the runtime performance of MoE models using custom GPU kernels for expert computation, and a novel EP-Aware sharded optimizer resulting in training speedups up to 1.71x. As part of the Optimus library, we also developed a robust set of reliability and fault tolerant features to improve training stability and continuity at scale.


【8】Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction
标题:Spectral紧凑训练:通过永久截断的DID和Stiefel QR撤回预训练大型语言模型
链接:https://arxiv.org/abs/2604.00733

作者:Björn Roman Kohlberger
备注:8 pages, 3 figures, 4 tables. Patent pending: Irish Application PTIE20260000000219. Code at https://github.com/EctoSpace/SCT
摘要:内存墙仍然是在消费者硬件上训练大型语言模型的主要瓶颈。我们介绍了谱紧凑训练(SCT),一种用永久截断SVD因子W = U diag(s)V^T替换稠密权重矩阵的方法,其中完整的稠密矩阵在训练或推理期间从未实现。通过标准的反向传播,通过紧凑的频谱因子的流量,和U,V通过QR分解后,每个优化步骤的Stiefel流形收回。SCT在rank 32的每个MLP层实现了高达199倍的内存减少,从而在Steam Deck手持设备上实现了70 B参数架构的完整训练步骤(7.2 GB峰值内存与Adam密集FP 32训练的1,245 GB)。在SmolLM 2 -1.7B上进行的等级扫描实验(等级32-256,2000步,NVIDIA A100)表明,所有测试等级都收敛到相同的损失下限(~4.2-4.5),将学习率计划(而不是MLP等级)确定为主要瓶颈。秩128在11.7x MLP压缩下以最低的复杂度成为效率最佳点。GPU内存在排名32时下降了46%,而训练吞吐量翻了一番。
摘要:The memory wall remains the primary bottleneck for training large language models on consumer hardware. We introduce Spectral Compact Training (SCT), a method that replaces dense weight matrices with permanent truncated SVD factors W = U diag(s) V^T, where the full dense matrix is never materialized during training or inference. Gradients flow through the compact spectral factors via standard backpropagation, and U, V are retracted to the Stiefel manifold via QR decomposition after each optimizer step. SCT achieves up to 199x memory reduction per MLP layer at rank 32, enabling full training steps of 70B-parameter architectures on a Steam Deck handheld (7.2 GB peak memory vs. 1,245 GB for dense FP32 training with Adam). Rank-sweep experiments on SmolLM2-1.7B (ranks 32-256, 2000 steps, NVIDIA A100) show that all tested ranks converge to the same loss floor (~4.2-4.5), identifying the learning rate schedule -- not MLP rank -- as the primary bottleneck. Rank 128 emerges as the efficiency sweet spot at 11.7x MLP compression with the lowest perplexity. GPU memory drops 46% at rank 32 while training throughput doubles.


【9】Exploring Silent Data Corruption as a Reliability Challenge in LLM Training
标题:探索沉默数据损坏作为LLM训练中的可靠性挑战
链接:https://arxiv.org/abs/2604.00726

作者:Anton Altenbernd,Philipp Wiesner,Odej Kao
备注:10 Pages, 4 Figures, CCGrid 2026
摘要:随着大型语言模型(LLM)在规模和复杂性上的扩展,训练过程中失败的后果变得越来越严重。一个主要的挑战来自于静默数据损坏(SDC):绕过系统级检测机制的硬件引起的故障。SDC可能表现得像良性的数值噪声,但也可能导致有害的梯度损坏,导致损失尖峰,发散或停滞的进展。   这项工作提供了一个间歇性SDC如何影响LLM预训练的对照研究。使用有针对性的故障注入在GPU矩阵乘法指令的水平,我们表征不同的位位置,内核功能,和执行阶段的敏感性。我们的分析表明,本地起源的故障可能会产生有影响力的腐败,包括NaN传播,损失,梯度范数和注意力对数的短暂峰值,以及持续的参数发散。建立在观察到的腐败签名,我们提出了一个轻量级的检测方法,识别潜在的有害参数更新。在参数为60M、350M和1.3B的LLaMA模型上的实验表明,在检测时重新计算最近的训练步骤可以有效地减轻这些事件的影响。
摘要:As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress.   This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter divergence. Building on the observed corruption signatures, we propose a lightweight detection method that identifies potentially harmful parameter updates. Experiments on LLaMA models with 60M, 350M, and 1.3B parameters demonstrate that recomputing the most recent training step upon detection can effectively mitigate the impact of these events.


【10】A Survey of On-Policy Distillation for Large Language Models
标题:大型语言模型的政策上提炼调查
链接:https://arxiv.org/abs/2604.00626

作者:Mingyang Song,Mao Zheng
摘要:知识蒸馏已经成为将推理和领域专业知识从前沿的大型语言模型(LLM)转移到较小的可部署学生的主要机制。然而,主导的范式仍然是\texit {off-policy}:学生在静态教师生成的数据上进行训练,并且在学习过程中永远不会遇到自己的错误。这种训练-测试不匹配,是\textit{暴露偏差}的一个实例,导致预测误差在推理时以自回归方式复合。On-Policy Distillation(OPD)通过让学生生成自己的轨迹并接收教师对这些自我生成的输出的反馈来解决这个问题,从而将交互式模仿学习理论中的蒸馏作为基础。尽管快速增长跨越分歧最小化,奖励引导学习,自我发挥,OPD文献仍然支离破碎,没有统一的治疗。这项调查提供了第一个全面的概述OPD法学硕士。我们在策略样本上引入了一个统一的$f$-发散框架,并沿着三个正交维度组织景观:反馈信号(基于logit,基于结果或自我发挥),教师访问(白盒,黑盒或无教师),和损失粒度(令牌级,序列级或混合)。我们系统地分析了代表性的方法,检查工业部署,并确定开放的问题,包括蒸馏标度律,不确定性感知反馈和代理级蒸馏。
摘要:Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.


【11】Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions
标题:使用不确定性感知输出长度预测来安排LLM推理
链接:https://arxiv.org/abs/2604.00499

作者:Haoyu Zheng,Yongqiang Zhang,Fangcheng Fu,Xiaokai Zhou,Hao Luo,Hongchao Zhu,Yuanyuan Zhu,Hao Wang,Xiao Yan,Jiawei Jiang
摘要:为了调度LLM推理,\textit{shortest job first}(SJF)原则是有利的,它优先考虑具有短输出长度的请求,以避免行首(HOL)阻塞。现有的方法通常为每个请求预测单个输出长度以便于调度。我们认为,这样的\textit{点估计}不匹配的\textit{随机}解码过程的LLM推理,其中输出长度是\textit{不确定}的性质和确定的结束时,序列(EOS)令牌进行采样。因此,每个请求的输出长度应该符合一个分布,而不是一个单一的值。通过对经验数据和随机解码过程的深入分析,我们观察到输出长度服从重尾分布,并且可以用log-t分布拟合。在此基础上,我们提出了一个简单的度量称为尾部膨胀期望(TIE),以取代SJF调度中的输出长度,它调整的期望值的log-t分布与其尾部概率,以考虑的风险,请求生成长输出。为了评估我们的TIE调度器,我们将其与三个强大的基线进行比较,结果表明TIE将在线推理的每个令牌延迟降低了2.31\times $,并将离线数据生成的吞吐量提高了1.42\times $。
摘要:To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by $2.31\times$ for online inference and improves throughput by $1.42\times$ for offline data generation.


【12】A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation
标题:用于胸部X射线解释的推理使能视觉语言基础模型
链接:https://arxiv.org/abs/2604.00493

作者:Yabin Zhang,Chong Wang,Yunhe Gao,Jiaming Liu,Maya Varma,Justin Xu,Sophie Ostmeier,Jin Long,Sergios Gatidis,Seena Dehkharghani,Arne Michalson,Eun Kyoung Hong,Christian Bluethgen,Haiwei Henry Guo,Alexander Victor Ortiz,Stephan Altmayer,Sandhya Bodapati,Joseph David Janizek,Ken Chang,Jean-Benoit Delbrouck,Akshay S. Chaudhari,Curtis P. Langlotz
备注:Codes: https://github.com/YBZh/CheXOne Models: https://huggingface.co/StanfordAIMI/CheXOne
摘要:胸部X射线(CXR)是全球最常进行的成像检查之一,但不断增加的成像量增加了放射科医生的工作量和诊断错误的风险。尽管人工智能(AI)系统已经显示出对CXR解释的承诺,但大多数系统只生成最终预测,而没有明确说明视觉证据如何转化为放射学发现和诊断预测。我们提出了CheXOne,一个支持推理的视觉语言模型,用于CXR解释。CheXOne联合生成诊断预测和明确的、临床上有根据的推理痕迹,这些痕迹将视觉证据、放射学发现和这些预测联系起来。该模型在1470万个指令和推理样本上进行训练,这些样本来自30个公共数据集,涵盖36个CXR解释任务,使用两阶段框架,将指令调整与强化学习相结合,以提高推理质量。我们在zero-shot设置中评估CheXOne,包括视觉问题回答、报告生成、视觉基础和推理评估,涵盖17个评估设置。CheXOne的性能优于现有的医疗和通用领域基础模型,并在独立的公共基准测试中表现出色。一项临床读者研究表明,在55%的情况下,CheXOne起草的报告与住院医师撰写的报告相当或更好,同时有效地解决了临床适应症,并提高了报告撰写和CXR解释效率。涉及放射科医生的进一步分析表明,生成的推理轨迹显示出较高的临床真实性,并为最终预测提供了因果支持,为性能增益提供了合理的解释。这些结果表明,显式推理可以提高人工智能辅助CXR解释中的模型性能、可解释性和临床实用性。
摘要:Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.


【13】G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs
标题:G-Drift MIA:通过LLM中成员引发的特征漂移进行成员推断
链接:https://arxiv.org/abs/2604.00419

作者:Ravi Ranjan,Utkarsh Grover,Xiaomin Lin,Agoritsa Polyzou
备注:14 pages, 3 figures and tables. Accepted in ICPR-2026 conference, to appear in the Springer LNCS proceedings
摘要:大型语言模型(LLM)是在大量网络规模的语料库上训练的,这引起了人们对隐私和版权的日益关注。隶属推理攻击(MIA)的目的是确定在训练过程中是否使用了给定的示例。现有的LLM MIA在很大程度上依赖于输出概率或损失值,并且当成员和非成员来自相同的分布时,通常仅比随机猜测稍好。我们介绍了G-Drift MIA,一种基于梯度引起的特征漂移的白盒成员推理方法。给定一个候选者(x,y),我们应用一个单一的有针对性的梯度上升步骤,增加其损失,并测量内部表示的变化,包括logits,隐藏层激活,以及更新前后固定特征方向的投影。这些漂移信号用于训练有效地将成员与非成员分离的轻量级逻辑分类器。在多个基于transformer的LLM和来自现实MIA基准的数据集中,G-Drift的性能大大优于基于置信度、基于困惑度和基于参考的攻击。我们进一步表明,记忆的训练样本系统地表现出更小,更结构化的功能漂移比非成员,梯度几何,代表性的稳定性和记忆之间提供了一个机械的联系。总的来说,我们的研究结果表明,小的,受控的梯度干预提供了一个实用的工具,用于审计训练数据的成员资格和评估LLM中的隐私风险。
摘要:Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.


【14】Decision-Centric Design for LLM Systems
标题:LLM系统以决策为中心的设计
链接:https://arxiv.org/abs/2604.00414

作者:Wei Sun
摘要:LLM系统除了生成输出外还必须做出控制决策:是否回答,澄清,检索,调用工具,修复或升级。在许多当前的架构中,这些决策仍然隐含在生成中,将评估和操作纠缠在单个模型调用中,并使故障难以检查,约束或修复。我们提出了一个以决策为中心的框架,将决策相关的信号与将它们映射到行动的策略分离,将控制变成系统的一个明确和可检查的层。这种分离支持将故障归因于信号估计、决策策略或执行,并实现每个组件的模块化改进。它统一了熟悉的单步设置,如路由和自适应推理,并自然地扩展到顺序设置,其中动作在动作之前改变可用的信息。在三个对照实验中,该框架减少了徒劳的行动,提高了任务的成功,并揭示了可解释的失败模式。更广泛地说,它为构建更可靠,可控和可诊断的LLM系统提供了一般的架构原则。
摘要:LLM systems must make control decisions in addition to generating outputs: whether to answer, clarify, retrieve, call tools, repair, or escalate. In many current architectures, these decisions remain implicit within generation, entangling assessment and action in a single model call and making failures hard to inspect, constrain, or repair. We propose a decision-centric framework that separates decision-relevant signals from the policy that maps them to actions, turning control into an explicit and inspectable layer of the system. This separation supports attribution of failures to signal estimation, decision policy, or execution, and enables modular improvement of each component. It unifies familiar single-step settings such as routing and adaptive inference, and extends naturally to sequential settings in which actions alter the information available before acting. Across three controlled experiments, the framework reduces futile actions, improves task success, and reveals interpretable failure modes. More broadly, it offers a general architectural principle for building more reliable, controllable, and diagnosable LLM systems.


【15】Is One Token All It Takes? Graph Pooling Tokens for LLM-based GraphQA
标题:一个代币就足够了吗?基于LLM的GraphQA的图池代币
链接:https://arxiv.org/abs/2604.00342

作者:Ankit Grover,Lodovico Giaretta,Rémi Bourgerie,Sarunas Girdzijauskas
备注:Accepted at LREC, KG-LLM Workshop 2026
摘要:图神经网络(GNN)与大型语言模型(LLM)的集成已经成为图问题分类(GraphQA)的一个有前途的范例。然而,将复杂的结构信息编码到LLM的潜在空间中的有效方法仍然是一个开放的挑战。目前最先进的架构,如G-Retriever,通常依赖于标准GNN和积极的均值池,将整个图的子结构压缩到单个令牌中,从而造成严重的信息瓶颈。这项工作通过研究两种正交策略来缓解这一瓶颈:(1)通过多令牌池增加图形到LLM接口的带宽,以及(2)通过全局注意力机制增强图形编码器的语义质量。我们评估了一套分层修剪和基于聚类的池操作符,包括Top-k,SAGPool,DiffPool,MinCutPool和虚拟节点池(VNPool),将图数据投影到多个可学习的令牌中。从经验上讲,我们证明,虽然池化在软提示调整过程中引入了显着的不稳定性,但低秩自适应(LoRA)的应用有效地稳定了特定的分层投影(特别是VNPool和修剪方法),尽管密集聚类算子仍然具有挑战性。这种稳定性允许压缩表示与全图基线竞争(在WebQSP上实现约73%的Hit@1)。从概念上讲,我们证明了一个图形Transformer与VNPool实现功能的结构作为一个单层的感知器IO编码器。最后,我们将FandE(特征和边缘)得分调整到生成式GraphQA域。我们的分析表明,GraphQA基准受到代表性饱和的影响,其中目标答案通常与孤立的节点特征高度相关。该实现可在https://github.com/Agrover112/G-Retriever/tree/all_good/上获得
摘要:The integration of Graph Neural Networks (GNNs) with Large Language Models (LLMs) has emerged as a promising paradigm for Graph Question Answering (GraphQA). However, effective methods for encoding complex structural information into the LLM's latent space remain an open challenge. Current state-of-the-art architectures, such as G-Retriever, typically rely on standard GNNs and aggressive mean pooling to compress entire graph substructures into a single token, creating a severe information bottleneck. This work mitigates this bottleneck by investigating two orthogonal strategies: (1) increasing the bandwidth of the graph-to-LLM interface via multi-token pooling, and (2) enhancing the semantic quality of the graph encoder via global attention mechanisms. We evaluate a suite of hierarchical pruning and clustering-based pooling operators including Top-k, SAGPool, DiffPool, MinCutPool, and Virtual Node Pooling (VNPool) to project graph data into multiple learnable tokens. Empirically, we demonstrate that while pooling introduces significant instability during soft prompt tuning, the application of Low-Rank Adaptation (LoRA) effectively stabilizes specific hierarchical projections (notably VNPool and pruning methods), though dense clustering operators remain challenging. This stabilization allows compressed representations to rival full-graph baselines (achieving ~73% Hit@1 on WebQSP). Conceptually, we demonstrate that a Graph Transformer with VNPool implementation functions structurally as a single-layer Perceiver IO encoder. Finally, we adapt the FandE (Features and Edges) Score to the generative GraphQA domain. Our analysis reveals that the GraphQA benchmark suffers from representational saturation, where target answers are often highly correlated with isolated node features. The implementation is available at https://github.com/Agrover112/G-Retriever/tree/all_good/


【16】Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation
标题:大语言模型蒸馏的具有多样性的反向Kullback-Leibler分歧
链接:https://arxiv.org/abs/2604.00223

作者:Hoang-Chau Luong,Dat Ba Tran,Lingwei Chen
摘要:反向Kullback-Leibler(RKL)分歧最近成为大型语言模型(LLM)蒸馏的首选目标,一直优于正向KL(FKL),特别是在词汇量大和师生能力显著不匹配的情况下,RKL专注于学习主导模式,而不是执行密集对齐。然而,RKL引入了一个结构性的限制,驱使学生走向过度自信的预测。我们首先通过将其梯度分解为目标和非目标成分来分析RKL,并表明即使学生已经与教师匹配,非目标梯度也会持续推动目标logit向上,从而减少输出多样性。此外,RKL对非目标类的监督较弱,导致尾部对齐较差。为了解决这些问题,我们提出了多样性感知RKL(DRKL),它消除了这种梯度效应,并加强了非目标监督,同时保留了RKL的优化优势。跨数据集和模型系列的广泛实验表明,DRKL始终优于FKL,RKL和其他最先进的蒸馏目标,实现了更好的性能和卓越的多样性权衡。
摘要:Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.


【17】ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
标题:ParetoBandit:用于非固定LLM服务的预算调整自适应路由
链接:https://arxiv.org/abs/2604.00136

作者:Annette Taberner-Miller
备注:27 pages, 15 figures, 13 tables. Code available at https://github.com/ParetoBandit/ParetoBandit
摘要 :生产LLM服务通常依赖于跨越约530倍成本范围的多型号产品组合,其中路由决策权衡质量和成本。这种权衡是不稳定的:供应商修改定价,模型质量可以默默地倒退,新模型必须在没有停机时间的情况下集成。我们提出了ParetoBandit,一个开源的自适应路由器,建立在成本感知的上下文强盗,是第一个同时执行美元计价的预算,在线适应这种变化,并在运行时板载新模型。   ParetoBandit通过三种机制来缩小这些差距。在线原始-对偶预算调节器在开放流上强制执行每个请求的成本上限,用闭环控制取代离线惩罚调优。充分统计数据上的几何遗忘使得能够快速适应价格和质量变化,同时从离线先验自举。热插拔注册表允许运营商在运行时添加或删除模型,对每个新来者进行短暂的强制探索阶段,之后UCB选择仅从实时流量中发现其质量成本利基。   我们评估ParetoBandit在四个部署场景上的1,824个提示通过三个模型组合路由。在七个预算上限中,平均每次请求成本从未超过目标0.4%。当条件发生变化时,系统会进行调整:对最昂贵的型号进行数量级降价,产生高达+0.071的质量提升,并检测到无声的质量回归,并在预算范围内重新路由。冷启动模型在约142个步骤内达到有意义的采用,而不会突破成本上限。路由器区分而不是盲目地采用:昂贵的模型被禁止,低质量的模型在有限的探索后被拒绝。端到端路由延迟在CPU上为9.8ms-不到典型推理时间的0.4%-路由决策本身仅需22.5us。
摘要:Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime.   ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone.   We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.


【18】Hierarchical Pre-Training of Vision Encoders with Large Language Models
标题:具有大型语言模型的视觉编码器的分层预训练
链接:https://arxiv.org/abs/2604.00086

作者:Eugene Lee,Ting-Yu Chang,Jui-Huang Tsai,Jiajie Diao,Chen-Yi Lee
备注:17 pages, 14 figures, accepted to Computer Vision and Pattern Recognition Conference (CVPR) Workshops 2026. 5th MMFM Workshop: What is Next in Multimodal Foundation Models?
摘要:计算机视觉领域通过可扩展的视觉编码器和多模式预训练框架取得了重大进展。然而,现有的方法通常将视觉编码器和大型语言模型(LLM)视为独立的模块,限制了分层视觉特征的集成。在这项工作中,我们提出了HIVE(视觉编码器的分层预训练),这是一种新的框架,通过在视觉编码器和LLM之间引入分层交叉注意来增强视觉语言对齐。与传统的扁平化图像嵌入方法不同,HIVE能够跨多个层进行结构化特征融合,改善梯度流和表示学习。为了优化这种交互,我们引入了一个三阶段训练策略,逐步将视觉编码器与LLM对齐,确保稳定的优化和有效的多模态融合。实证评估表明,HIVE不仅在图像分类方面,而且在各种视觉语言任务上都取得了优异的性能,在MME,GQA,OK-VQA和ScienceQA等基准测试中表现优于基于自我注意力的方法。我们的研究结果突出了分层特征集成的好处,为更有效和更具表现力的视觉语言模型铺平了道路。
摘要:The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.


【19】Task-Centric Personalized Federated Fine-Tuning of Language Models
标题:以任务为中心的个性化联合微调语言模型
链接:https://arxiv.org/abs/2604.00050

作者:Gabriel U. Talasso,Meghdad Kurmanji,Allan M. de Souza,Nicholas D. Lane,Leandro A. Villas
摘要:联邦学习(FL)已经成为一种很有前途的技术,用于在不同任务的分布式和私有数据集上训练语言模型。然而,在异构任务上训练的聚合模型通常会降低单个客户端的整体性能。为了解决这个问题,个性化FL(pFL)旨在为每个客户端的数据分布创建量身定制的模型。虽然这些方法提高了本地性能,但它们通常在两个方面缺乏鲁棒性:(i)泛化:当客户端必须对看不见的任务进行预测时,或者面对其数据分布的变化时,以及(ii)客户端内任务干扰:当单个客户端的数据包含多个分布时,这些分布可能在本地训练期间相互干扰。为了解决这两个挑战,我们提出了FedRouter,一个基于集群的pFL,为每个任务而不是每个客户端构建专门的模型。FedRouter使用适配器来个性化模型,方法是使用两种集群机制将适配器与特定任务相关联。一个局部聚类将适配器与任务数据样本相关联,一个全局聚类将来自不同客户端的相似适配器相关联,以构建以任务为中心的个性化模型。此外,我们提出了一个评估路由器机制,路由测试样本的最佳适配器的基础上创建的集群。通过在多任务数据集上比较我们的方法与现有方法的实验,FedRouter在这些具有挑战性的场景中表现出强大的弹性,在任务干扰下表现相对更好,相对改善高达6.1%,在泛化评估下表现相对改善高达136%。
摘要:Federated Learning (FL) has emerged as a promising technique for training language models on distributed and private datasets of diverse tasks. However, aggregating models trained on heterogeneous tasks often degrades the overall performance of individual clients. To address this issue, Personalized FL (pFL) aims to create models tailored for each client's data distribution. Although these approaches improve local performance, they usually lack robustness in two aspects: (i) generalization: when clients must make predictions on unseen tasks, or face changes in their data distributions, and (ii) intra-client tasks interference: when a single client's data contains multiple distributions that may interfere with each other during local training. To tackle these two challenges, we propose FedRouter, a clustering-based pFL that builds specialized models for each task rather than for each client. FedRouter uses adapters to personalize models by employing two clustering mechanisms to associate adapters with specific tasks. A local clustering that associate adapters with task data samples and a global one that associates similar adapters from different clients to construct task-centric personalized models. Additionally, we propose an evaluation router mechanism that routes test samples to the best adapter based on the created clusters. Experiments comparing our method with existing approaches across a multitask dataset, FedRouter demonstrate strong resilience in these challenging scenarios performing up to 6.1% relatively better under tasks interference and up to 136% relative improvement under generalization evaluation.


【20】Large Language Models for Analyzing Enterprise Architecture Debt in Unstructured Documentation
标题:用于分析非结构化文档中企业架构债的大型语言模型
链接:https://arxiv.org/abs/2604.00046

作者:Christin Pagels,Simon Hacks,Rob Henk Bemthuis
备注:Author version, 2 figures, 5 tables. To appear in the Proceedings of the 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26), 2026
摘要:企业架构债务(EA债务)源于次优的设计决策和未对齐的组件,这些组件会随着时间的推移降低组织的IT环境。早期的指标,企业架构气味(EA气味),目前主要是手动检测或只从结构化的工件,留下许多未分析的非结构化文档。本研究提出了一种使用大型语言模型(LLM)来识别和量化非结构化架构文档中的EA债务的方法。设计科学的研究方法之后,我们设计和评估一个基于LLM的原型自动EA检测。工件摄取非结构化文档(例如,过程描述、策略文件),应用微调的检测模型,并输出识别的气味。我们评估原型通过案例研究,使用合成但现实的业务文档,基准对自定义的基于GPT的模型。结果表明,LLM可以检测非结构化文本中的多个预定义的EA气味,基准模型实现了更高的精度和处理速度,微调的内部部署模型提供了数据保护优势。研究结果突出了将基于LLM的气味检测整合到EA治理实践中的机会。
摘要:Enterprise Architecture Debt (EA Debt) arises from suboptimal design decisions and misaligned components that can degrade an organization's IT landscape over time. Early indicators, Enterprise Architecture Smells (EA Smells), are currently mainly detected manually or only from structured artifacts, leaving much unstructured documentation under-analyzed. This study proposes an approach using a large language model (LLM) to identify and quantify EA Debt in unstructured architectural documentation. Following a design science research approach, we design and evaluate an LLM-based prototype for automated EA Smell detection. The artifact ingests unstructured documents (e.g., process descriptions, strategy papers), applies fine-tuned detection models, and outputs identified smells. We evaluate the prototype through a case study using synthetic yet realistic business documents, benchmarking against a custom GPT-based model. Results show that LLMs can detect multiple predefined EA Smells in unstructured text, with the benchmark model achieving higher precision and processing speed, and the fine-tuned on-premise model offering data protection advantages. The findings highlight opportunities for integrating LLM-based smell detection into EA governance practice.


【21】Scalable Identification and Prioritization of Requisition-Specific Personal Competencies Using Large Language Models
标题:使用大型语言模型可扩展地识别和优先级特定于特定能力的个人能力的优先级
链接:https://arxiv.org/abs/2604.00006

作者:Wanxin Li,Denver McNeney,Nivedita Prabhu,Charlene Zhang,Renee Barr,Matthew Kitching,Khanh Dao Duc,Anthony S. Boyce
摘要:人工智能驱动的招聘工具越来越多地用于人员选拔,但它们很难捕捉特定于申请(req)的个人能力(PC),这些能力将成功的候选人区分为工作类别之外。我们提出了一个基于大语言模型(LLM)的方法来识别和优先级的请求特定的PC请求。我们的方法集成了动态Few-Shot提示,反射为基础的自我改进,基于相似性的过滤,和多阶段验证。应用到程序管理器需求的数据集,我们的方法正确地识别出最高优先级的需求特定的PC,平均准确率为0.76,接近人类专家评分员间的可靠性,并保持了0.07的低范围外率。
摘要:AI-powered recruitment tools are increasingly adopted in personnel selection, yet they struggle to capture the requisition (req)-specific personal competencies (PCs) that distinguish successful candidates beyond job categories. We propose a large language model (LLM)-based approach to identify and prioritize req-specific PCs from reqs. Our approach integrates dynamic few-shot prompting, reflection-based self-improvement, similarity-based filtering, and multi-stage validation. Applied to a dataset of Program Manager reqs, our approach correctly identifies the highest-priority req-specific PCs with an average accuracy of 0.76, approaching human expert inter-rater reliability, and maintains a low out-of-scope rate of 0.07.


【22】Two-Stage Optimizer-Aware Online Data Selection for Large Language Models
标题:大型语言模型的两阶段优化器感知在线数据选择
链接:https://arxiv.org/abs/2604.00001

作者:Fangxin Wang,Peyman Baghershahi,Langzhou He,Henry Peng Zou,Sourav Medya,Philip S. Yu
备注:22 pages, 2 figures, 6 tables
摘要:基于一致性的数据选择提供了一个原则性的框架,用于估计大语言模型(LLM)微调中的样本效用,但现有的方法大多是为离线设置而设计的。因此,它们不太适合在线微调,其中数据按顺序到达,样本效用依赖于步骤,有效的更新几何形状由自适应优化器形成。我们提出了一个优化器感知框架,用于基于梯度的在线数据选择和LLM微调中的重新加权。我们的关键思想是不将在线选择视为静态样本排名,而是在优化器状态下塑造下一个面向目标的更新。我们制定这作为一个优化感知更新匹配问题,建立其连接到二阶目标效用,并显示为什么子集级建设必须考虑选定的样本之间的相互作用和冗余。基于这一观点,我们开发了一个两阶段的过滤器,然后加权算法,首先过滤几何有用的候选人,然后优化其系数。为了使该框架适用于LLM,我们引入了一个因式分解的外积梯度表示和长上下文数据的优化矩阵计算。实验表明,在相同的数据预算下,我们的方法一致地提高了收敛性和下游性能,优于现有的在线数据选择基线。
摘要:Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.


【23】GenoBERT: A Language Model for Accurate Genotype Imputation
标题:GenoBERT:准确基因型插补的语言模型
链接:https://arxiv.org/abs/2604.00058

作者:Lei Huang,Chuan Qiu,Kuan-Jui Su,Anqi Liu,Yun Gong,Weiqiang Lin,Lindong Jiang,Chen Zhao,Meng Song,Jeffrey Deng,Qing Tian,Zhe Luo,Ping Gong,Hui Shen,Chaoyang Zhang,Hong-Wen Deng
摘要:基因型插补使全基因组关联和风险预测研究的密集变异覆盖成为可能,但传统的参考面板方法仍然受到祖先偏倚和罕见变异准确性降低的限制。我们提出了基因型双向编码器表示从Transformers(基因BERT),一个基于transformer的,无参考的框架,标记阶段基因型,并使用自我注意机制,以捕捉短期和长期的连锁不平衡(LD)的依赖关系。在两个独立的数据集上进行基准测试,包括路易斯安那州骨质疏松症研究(LOS)和1000个基因组计划(1 KGP),跨越祖先群体和多个基因型缺失水平(5-50%),结果表明,与四种基线方法(Beagle5.4,SCDA,BiU-Net和STICI)相比,GenoBERT实现了最高的总体准确性。在实际的稀疏水平(高达25%的缺失),GenoBERT在数据集上达到了很高的总体插补精度(r^2约为0.98$),即使在50%的缺失率下也保持了稳健的性能(r^2> 0.90$)。不同祖先的实验结果证实了数据集之间的一致收益,对小样本量和弱LD具有弹性。128-SNP(单核苷酸多态性)上下文窗口(约100 Kb)通过LD衰减分析被验证为足以捕获局部相关性结构。通过消除参考面板依赖性,同时保持高准确性,GenoBERT为基因型插补提供了一个可扩展的和强大的解决方案,并为下游基因组建模奠定了基础。
摘要 :Genotype imputation enables dense variant coverage for genome-wide association and risk-prediction studies, yet conventional reference-panel methods remain limited by ancestry bias and reduced rare-variant accuracy. We present Genotype Bidirectional Encoder Representations from Transformers (GenoBERT), a transformer-based, reference-free framework that tokenizes phased genotypes and uses a self-attention mechanism to capture both short- and long-range linkage disequilibrium (LD) dependencies. Benchmarking on two independent datasets including the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP) across ancestry groups and multiple genotype missingness levels (5-50%) shows that GenoBERT achieves the highest overall accuracy compared to four baseline methods (Beagle5.4, SCDA, BiU-Net, and STICI). At practical sparsity levels (up to 25% missing), GenoBERT attains high overall imputation accuracy ($r^2 approx 0.98$) across datasets, and maintains robust performance ($r^2 > 0.90$) even at 50% missingness. Experimental results across different ancestries confirm consistent gains across datasets, with resilience to small sample sizes and weak LD. A 128-SNP (single-nucleotide polymorphism) context window (approximately 100 Kb) is validated through LD-decay analyses as sufficient to capture local correlation structures. By eliminating reference-panel dependence while preserving high accuracy, GenoBERT provides a scalable and robust solution for genotype imputation and a foundation for downstream genomic modeling.


Graph相关(图学习|图神经网络|图优化等)(4篇)

【1】EmbedPart: Embedding-Driven Graph Partitioning for Scalable Graph Neural Network Training
标题:EmbedPart:用于可扩展图神经网络训练的嵌入驱动图划分
链接:https://arxiv.org/abs/2604.01000

作者:Nikolai Merkel,Ruben Mayer,Volker Markl,Hans-Arno Jacobsen
摘要:图神经网络(GNN)被广泛用于图结构数据的学习,但将GNN训练扩展到海量图仍然具有挑战性。为了实现可扩展的分布式训练,图被划分为分布在多台机器上的较小分区,从而使机器间通信最小化并平衡计算负载。在实践中,现有的分区方法面临着分区开销和分区质量之间的基本权衡。我们提出了EmbedPart,一种嵌入驱动的分区方法,实现了速度和质量。EmbedPart不是直接在不规则的图结构上操作,而是利用在实际GNN训练工作负载期间产生的节点嵌入,并将这些密集嵌入聚类以获得分区。EmbedPart实现了超过Metis 100倍的加速,同时保持了有竞争力的分区质量并加速了分布式GNN训练。此外,EmbedPart自然支持图更新和快速重新分区,并可应用于图重新排序,以提高数据局部性并加速单机GNN训练。通过将分区从不规则图结构转移到密集嵌入,EmbedPart实现了可扩展和高质量的图数据优化。
摘要:Graph Neural Networks (GNNs) are widely used for learning on graph-structured data, but scaling GNN training to massive graphs remains challenging. To enable scalable distributed training, graphs are divided into smaller partitions that are distributed across multiple machines such that inter-machine communication is minimized and computational load is balanced. In practice, existing partitioning approaches face a fundamental trade-off between partitioning overhead and partitioning quality. We propose EmbedPart, an embedding-driven partitioning approach that achieves both speed and quality. Instead of operating directly on irregular graph structures, EmbedPart leverages node embeddings produced during the actual GNN training workload and clusters these dense embeddings to derive a partitioning. EmbedPart achieves more than 100x speedup over Metis while maintaining competitive partitioning quality and accelerating distributed GNN training. Moreover, EmbedPart naturally supports graph updates and fast repartitioning, and can be applied to graph reordering to improve data locality and accelerate single-machine GNN training. By shifting partitioning from irregular graph structures to dense embeddings, EmbedPart enables scalable and high-quality graph data optimization.


【2】A Cross-graph Tuning-free GNN Prompting Framework
标题:无交叉图调谐GNN预算框架
链接:https://arxiv.org/abs/2604.00399

作者:Yaqi Chen,Shixun Huang,Ryan Twemlow,Lei Wang,John Le,Sheng Wang,Willy Susilo,Jun Yan,Jun Shen
摘要:GNN提示的目的是在任务和图形之间调整模型,而不需要广泛的再培训。然而,大多数现有的图形提示方法仍然需要特定于任务的参数更新,并面临着跨图形泛化的问题,限制了它们的性能,并破坏了提示的核心承诺。在这项工作中,我们引入了一个跨图无调优的推理框架(CTP),它支持同构和异构图,可以直接部署到看不见的图,而无需进一步的参数调整,从而实现了即插即用的GNN推理引擎。在Few-Shot预测任务上的大量实验表明,与SOTA相比,CTP实现了30.8%的平均准确率增益和54%的最大增益,证实了其有效性,并为图提示学习提供了新的视角。
摘要:GNN prompting aims to adapt models across tasks and graphs without requiring extensive retraining. However, most existing graph prompt methods still require task-specific parameter updates and face the issue of generalizing across graphs, limiting their performance and undermining the core promise of prompting. In this work, we introduce a Cross-graph Tuning-free Prompting Framework (CTP), which supports both homogeneous and heterogeneous graphs, can be directly deployed to unseen graphs without further parameter tuning, and thus enables a plug-and-play GNN inference engine. Extensive experiments on few-shot prediction tasks show that, compared to SOTAs, CTP achieves an average accuracy gain of 30.8% and a maximum gain of 54%, confirming its effectiveness and offering a new perspective on graph prompt learning.


【3】Hierarchical Discrete Flow Matching for Graph Generation
标题:用于图生成的分层离散流匹配
链接:https://arxiv.org/abs/2604.00236

作者:Yoann Boget,Pablo Strasser,Alexandros Kalousis
备注:Graph, generation, hierarchical
摘要:基于去噪的模型,包括扩散和流匹配,导致了图形生成的实质性进展。尽管取得了这一进展,但这些模型仍然受到两个基本限制的约束:计算成本与节点数量成二次方关系,以及生成过程中需要进行大量的函数评估。在这项工作中,我们引入了一种新的分层生成框架,减少了必须评估的节点对的数量,并采用离散流匹配来显着减少去噪迭代的次数。我们的经验表明,我们的方法更有效地捕捉图形分布,同时大大减少了生成时间。
摘要:Denoising-based models, including diffusion and flow matching, have led to substantial advances in graph generation. Despite this progress, such models remain constrained by two fundamental limitations: a computational cost that scales quadratically with the number of nodes and a large number of function evaluations required during generation. In this work, we introduce a novel hierarchical generative framework that reduces the number of node pairs that must be evaluated and adopts discrete flow matching to significantly decrease the number of denoising iterations. We empirically demonstrate that our approach more effectively captures graph distributions while substantially reducing generation time.


【4】Epileptic Seizure Detection in Separate Frequency Bands Using Feature Analysis and Graph Convolutional Neural Network (GCN) from Electroencephalogram (EEG) Signals
标题:使用特征分析和图卷积神经网络(GCN)从脑电波(EEG)信号在不同频段中检测癫痫发作
链接:https://arxiv.org/abs/2604.00163

作者:Ferdaus Anam Jibon,Fazlul Hasan Siddiqui,F. Deeba,Gahangir Hossain
摘要:癫痫发作是以大脑中异常和过度的电活动为特征的神经系统疾病,导致反复发作事件。脑电图(EEG)信号被广泛用于癫痫发作的诊断,由于其能够捕捉时间和空间的神经动力学。虽然最近的深度学习方法已经达到了很高的检测精度,但它们往往缺乏可解释性和神经生理学相关性。本研究提出一种基于发作期脑电分析的频率感知癫痫发作检测框架。原始EEG信号被分解成五个频带(δ,θ,α,低β,高β),并从每个频带中提取11个判别特征。然后,采用图卷积神经网络(GCN)来对EEG电极之间的空间依赖性进行建模,所述EEG电极表示为图节点。在CHB-MIT头皮EEG数据集上的实验证明了高检测性能,在各个频带上实现了97.1%,97.13%,99.5%,99.7%和51.4%的准确率,整体宽带准确率为99.01%。结果突出了中频带的强辨别能力,并揭示了频率特异性癫痫发作模式。与传统的基于宽带脑电的方法相比,所提出的方法提高了可解释性和诊断精度。
摘要 :Epileptic seizures are neurological disorders characterized by abnormal and excessive electrical activity in the brain, resulting in recurrent seizure events. Electroencephalogram (EEG) signals are widely used for seizure diagnosis due to their ability to capture temporal and spatial neural dynamics. While recent deep learning methods have achieved high detection accuracy, they often lack interpretability and neurophysiological relevance. This study presents a frequency-aware framework for epileptic seizure detection based on ictal-phase EEG analysis. The raw EEG signals are decomposed into five frequency bands (delta, theta, alpha, lower beta, and higher beta), and eleven discriminative features are extracted from each band. A graph convolutional neural network (GCN) is then employed to model spatial dependencies among EEG electrodes, represented as graph nodes. Experiments on the CHB-MIT scalp EEG dataset demonstrate high detection performance, achieving accuracies of 97.1%, 97.13%, 99.5%, 99.7%, and 51.4% across the respective frequency bands, with an overall broadband accuracy of 99.01%. The results highlight the strong discriminative capability of mid-frequency bands and reveal frequency-specific seizure patterns. The proposed approach improves interpretability and diagnostic precision compared to conventional broadband EEG-based methods.


Transformer(9篇)

【1】WARP: Guaranteed Inner-Layer Repair of NLP Transformers
标题:WARP:保证NLPTransformer的内层修复
链接:https://arxiv.org/abs/2604.00938

作者:Hsin-Ling Hsu,Min-Yu Chen,Nai-Chia Chen,Yan-Ru Chen,Yi-Ling Chang,Fang Yu
摘要:基于transformer的NLP模型仍然容易受到对抗性扰动的影响,但现有的修复方法面临着一个基本的权衡:基于梯度的方法提供了灵活性,但缺乏可验证性,而且往往过拟合;提供修复保证的方法仅限于最后一层或小型网络,大大限制了可用于修复的参数搜索空间。我们提出了WARP(权重调整修复与证明),一个基于约束的修复框架,扩展修复超出了Transformer模型的最后一层。WARP制定修复作为一个凸二次规划来自一阶线性化的logit差距,使易于处理的优化在高维参数空间。在一阶近似成立的条件下,该公式包含三个每个样本的保证:(i)一个正的保证金约束,确保正确的分类修复输入,(ii)保存约束指定的保持集,和(iii)一个认证的鲁棒性半径来自Lipschitz连续性。为了确保不同模型架构的可行性,我们引入了一个基于灵敏度的预处理步骤,相应地调节优化环境。我们进一步表明,迭代优化过程收敛到解决方案,满足所有的修复约束条件下温和的假设。对具有不同层架构的仅编码器Transformers的经验评估验证了这些保证在实践中保持,同时提高了对对抗性输入的鲁棒性。我们的研究结果表明,有保证的,可推广的Transformer修复是可以实现的,通过原则性的约束为基础的优化。
摘要:Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.


【2】BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction
标题:BioCOMASS:将生物标志物整合到基于转化器的免疫治疗反应预测中
链接:https://arxiv.org/abs/2604.00739

作者:Sayed Hashim,Frank Soboczenski,Paul Cairns
摘要:用于免疫治疗反应预测的数据集通常尺寸较小,并且在癌症类型、施用的药物和使用的测序仪方面多种多样。当对未包括在训练过程中的患者队列进行测试时,模型的性能通常会下降。最近的工作表明,基于transformer的模型以及自监督学习比基于阈值的生物标志物显示出更好的泛化性能,但仍然是次优的。我们提出了BioCOMPASS,一个基于转换器的模型称为COMPASS的扩展,它集成了生物标志物和治疗信息,以进一步提高其普适性。我们没有将生物标志物数据作为输入,而是构建了损失分量,以将它们与模型的中间表示对齐。我们发现,当使用留一队列、留一癌症类型和留一治疗策略进行评估时,治疗门控和通路一致性损失等成分提高了普适性。结果表明,构建利用生物标志物和治疗信息的组件可以帮助免疫治疗反应预测的普遍性。利用补充临床信息和领域知识的额外组件的精心策划代表了未来研究的一个有希望的方向。
摘要:Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model's intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.


【3】A Benchmark of State-Space Models vs. Transformers and BiLSTM-based Models for Historical Newspaper OCR
标题:历史报纸OCR的状态空间模型与Transformers和基于BiLSTM的模型的基准
链接:https://arxiv.org/abs/2604.00725

作者:Merveilles Agbeti-messan,Thierry Paquet,Clément Chatelain,Pierrick Tranouez,Stéphane Nicolas
摘要:历史报纸的端到端OCR仍然具有挑战性,因为模型必须处理长文本序列、降低的打印质量和复杂的布局。虽然基于transformer的识别器主导了当前的研究,但其二次复杂性限制了有效的段落级转录和大规模部署。我们研究线性时间状态空间模型(SSM),特别是Mamba,作为一种可扩展的替代基于转换器的序列建模OCR。   据我们所知,我们提出了第一个基于SSM的OCR架构,将CNN视觉编码器与双向和自回归Mamba序列建模相结合,并进行了一个大规模的基准测试,将SSM与基于Transformer和BiLSTM的识别器进行比较。多种解码策略(CTC、自回归和非自回归)在相同的训练条件下与强神经基线(VAN、DAN、DANIEL)和广泛使用的现成OCR引擎(PERO-OCR、Tesseract OCR、TrOCR、Gemini)一起进行评估。   在卢森堡国家图书馆的历史报纸上进行的实验,新发布的黄金标准注释超过99%,以及在Fraktur和Antiqua行上进行的跨数据集测试表明,所有神经模型都实现了低错误率(约2%CER),使计算效率成为主要区别因素。基于Mamba的模型保持了有竞争力的准确性,同时将推理时间减半,并表现出出色的内存扩展能力(1000个字符时为1.26倍,而1000个字符时为2.30倍),在严重退化的段落级别上达到6.07%的CER,而DAN为5.24%,同时保持2.05倍的速度。   我们发布代码,训练模型和标准化评估协议,以实现可重复的研究并指导大规模文化遗产OCR的从业者。
摘要 :End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR.   We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN, DANIEL) and widely used off-the-shelf OCR engines (PERO-OCR, Tesseract OCR, TrOCR, Gemini).   Experiments on historical newspapers from the Bibliothèque nationale du Luxembourg, with newly released >99% verified gold-standard annotations, and cross-dataset tests on Fraktur and Antiqua lines, show that all neural models achieve low error rates (~2% CER), making computational efficiency the main differentiator. Mamba-based models maintain competitive accuracy while halving inference time and exhibiting superior memory scaling (1.26x vs 2.30x growth at 1000 chars), reaching 6.07% CER at the severely degraded paragraph level compared to 5.24% for DAN, while remaining 2.05x faster.   We release code, trained models, and standardized evaluation protocols to enable reproducible research and guide practitioners in large-scale cultural heritage OCR.


【4】CircuitProbe: Predicting Reasoning Circuits in Transformers via Stability Zone Detection
标题:CircuitProbe:通过稳定区检测预测Transformer中的推理电路
链接:https://arxiv.org/abs/2604.00716

作者:Rajkiran Panuganti
备注:11 pages, 1 figure, 3 tables. Code available at https://github.com/agenticclass/circuitprobe
摘要:Transformer语言模型包含本地化的推理电路,当在推理时复制时,可以改进推理的连续层块。找到这些电路目前需要蛮力扫描,每个模型花费25个GPU小时。我们提出了CircuitProbe,它预测电路位置的激活统计在CPU上在5分钟内,提供了三到四个数量级的加速。我们发现,推理电路有两种类型:稳定电路在早期层,检测通过衍生物的表示变化,和幅度电路在后期层,检测通过异常评分。我们在跨越6种架构的9个模型中进行了验证,包括2025个模型,确认了CircuitProbe顶部预测在所有验证案例中与最佳电路匹配或在最佳电路的2层内。跨Qwen 2.5系列的缩放实验表明,层复制始终有利于3B参数下的模型,但会降低7B+模型的性能,使其成为小型语言模型的实用缩放技术。CircuitProbe只需要10个校准示例,其预测在英语,印地语,中文和法语中都是稳定的。
摘要:Transformer language models contain localized reasoning circuits, contiguous layer blocks that improve reasoning when duplicated at inference time. Finding these circuits currently requires brute-force sweeps costing 25 GPU hours per model. We propose CircuitProbe, which predicts circuit locations from activation statistics in under 5 minutes on CPU, providing a speedup of three to four orders of magnitude. We find that reasoning circuits come in two types: stability circuits in early layers, detected through the derivative of representation change, and magnitude circuits in late layers, detected through anomaly scoring. We validate across 9 models spanning 6 architectures, including 2025 models, confirming that CircuitProbe top predictions match or are within 2 layers of the optimal circuit in all validated cases. A scaling experiment across the Qwen 2.5 family reveals that layer duplication consistently benefits models under 3B parameters but degrades performance in 7B+ models, making this a practical scaling technique for small language models. CircuitProbe requires as few as 10 calibration examples and its predictions are stable across English, Hindi, Chinese, and French.


【5】Automated Detection of Multiple Sclerosis Lesions on 7-tesla MRI Using U-net and Transformer-based Segmentation
标题:使用U-net和基于变形器的分割自动检测7特斯拉MRI上多发性硬化病变
链接:https://arxiv.org/abs/2604.00469

作者:Michael Maynord,Minghui Liu,Cornelia Fermüller,Seongjin Choi,Yuxin Zeng,Shishir Dahal,Daniel M. Harrison
备注:31 pages, 3 figures, 3 tables. Inference code and model weights available at https://github.com/maynord/7T-MS-lesion-segmentation
摘要:超高场7特斯拉(7 T)MRI改善了多发性硬化症(MS)白质病变(WML)的可视化,但在对比度和伪影方面与1.5- 3 T成像有很大差异-这表明广泛使用的自动分割工具可能无法直接转换。我们分析了7 T FLAIR扫描,并根据病变分割工具(LST)输出生成参考WML掩模,然后进行专家手动修订。作为外部比较器,我们应用了LST-LPA和最近的LST-AI集成,两者最初都是在低场数据上开发的。然后,我们在7 T FLAIR上以多种分辨率(0.5x0.5x0.5^3,1.0x1.0x1.0 ^3和1.5x1.5x2.0^3)训练了3D UNETR和基于SegFormer变换器的模型,并使用BraTS 2023框架中的体素和病变指标评估了所有方法。在原生0.5x0.5x0.5^3分辨率的保留测试集上,7 T训练的Transformers实现了与LST-AI的竞争性重叠,同时恢复了经典方法遗漏的额外小病变,代价是一些边界变异性和偶尔的伪影相关假阳性。在保持的7 T测试集上,我们最好的Transformer模型(SegFormer)实现了0.61的体素Dice和0.20的病变Dice,改进了经典的LST-LPA工具(Dice 0.39,病变Dice 0.02)。在下采样图像上训练的模型的性能下降,强调了原生7 T分辨率对于小病变检测的价值。通过发布我们的7 T训练模型,我们的目标是为超高场MS研究中的自动病变量化提供可重复的,即用型资源(https://github.com/maynord/7T-MS-lesion-segmentation)。
摘要:Ultra-high field 7-tesla (7T) MRI improves visualization of multiple sclerosis (MS) white matter lesions (WML) but differs sufficiently in contrast and artifacts from 1.5-3T imaging - suggesting that widely used automated segmentation tools may not translate directly. We analyzed 7T FLAIR scans and generated reference WML masks from Lesion Segmentation Tool (LST) outputs followed by expert manual revision. As external comparators, we applied LST-LPA and the more recent LST-AI ensemble, both originally developed on lower-field data. We then trained 3D UNETR and SegFormer transformer-based models on 7T FLAIR at multiple resolutions (0.5x0.5x0.5^3, 1.0x1.0x1.0^3, and 1.5x1.5x2.0^3) and evaluated all methods using voxel-wise and lesion-wise metrics from the BraTS 2023 framework. On the held-out test set at native 0.5x0.5x0.5^3 resolution, 7T-trained transformers achieved competitive overlap with LST-AI while recovering additional small lesions that were missed by classical methods, at the cost of some boundary variability and occasional artifact-related false positives. On a held-out 7 T test set, our best transformer model (SegFormer) achieved a voxel-wise Dice of 0.61 and lesion-wise Dice of 0.20, improving on the classical LST-LPA tool (Dice 0.39, lesion-wise Dice 0.02). Performance decreased for models trained on downsampled images, underscoring the value of native 7T resolution for small-lesion detection. By releasing our 7T-trained models, we aim to provide a reproducible, ready-to-use resource for automated lesion quantification in ultra-high field MS research (https://github.com/maynord/7T-MS-lesion-segmentation).


【6】Predicting Wave Reflection and Transmission in Heterogeneous Media via Fourier Operator-Based Transformer Modeling
标题:通过基于傅里叶运算符的Transformer建模预测非均匀媒体中的波反射和传输
链接:https://arxiv.org/abs/2604.00132

作者:Zhe Bai,Hans Johansen
备注:6 pages, 9 figures, ACDSA 2026
摘要:我们开发了一个机器学习(ML)代理模型来近似麦克斯韦方程组的一维解,重点是涉及反射和传输电磁波的材料界面的场景。我们的训练数据来自高保真有限体积(FV)模拟,包括初始条件的变化,以及一种材料光速的变化,允许模型学习一系列波-材料相互作用行为。ML模型在基于视觉变换器的框架中自回归学习物理和频率嵌入。通过在隐空间中引入傅立叶变换,解的波数谱与模拟数据紧密吻合。随着时间的推移,预测误差呈现近似线性增长,在材料界面处急剧增加。测试结果表明,ML的解决方案有足够的相对误差低于10美元的$75$$的时间步长推出,尽管存在的不连续性和未知的材料属性。
摘要:We develop a machine learning (ML) surrogate model to approximate solutions to Maxwell's equations in one dimension, focusing on scenarios involving a material interface that reflects and transmits electro-magnetic waves. Derived from high-fidelity Finite Volume (FV) simulations, our training data includes variations of the initial conditions, as well as variations in one material's speed of light, allowing for the model to learn a range of wave-material interaction behaviors. The ML model autoregressively learns both the physical and frequency embeddings in a vision transformer-based framework. By incorporating Fourier transforms in the latent space, the wave number spectra of the solutions aligns closely with the simulation data. Prediction errors exhibit an approximately linear growth over time with a sharp increase at the material interface. Test results show that the ML solution has adequate relative errors below $10\%$ in over $75$ time step rollouts, despite the presence of the discontinuity and unknown material properties.


【7】Efficient Software Vulnerability Detection Using Transformer-based Models
标题:使用基于转换器的模型进行高效的软件漏洞检测
链接:https://arxiv.org/abs/2604.00112

作者:Sameer Shaik,Zhen Huang,Daniela Stan Raicu,Jacob Furst
摘要:检测软件漏洞对于确保现代计算机系统的安全性和可靠性至关重要。深度神经网络在漏洞检测方面已经显示出有希望的结果,但它们缺乏捕获易受攻击代码的全局上下文信息的能力。为了解决这个问题,我们探讨了Transformers的C/C++漏洞检测的应用。我们使用程序切片来封装程序代码的关键语法和语义特征,例如API函数调用、数组使用、指针操作和算术表达式。通过利用Transformers捕获易受攻击代码的本地和全局上下文信息的能力,我们的工作可以准确地识别漏洞。结合数据平衡和超参数微调,我们的工作提供了一个强大而有效的方法来识别脆弱的代码与适度的资源使用和训练时间。
摘要:Detecting software vulnerabilities is critical to ensuring the security and reliability of modern computer systems. Deep neural networks have shown promising results on vulnerability detection, but they lack the capability to capture global contextual information on vulnerable code. To address this limitation, we explore the application of transformers for C/C++ vulnerability detection. We use program slices that encapsulate key syntactic and semantic features of program code, such as API function calls, array usage, pointer manipulations, and arithmetic expressions. By leveraging transformers' capability to capture both local and global contextual information on vulnerable code, our work can identify vulnerabilities accurately. Combined with data balancing and hyperparameter fine-tuning, our work offers a robust and efficient approach to identifying vulnerable code with moderate resource usage and training time.


【8】Transformers for Program Termination
标题:程序终止的Transformer
链接:https://arxiv.org/abs/2604.00039

作者:Yoav Alon,Cristina David
备注:12 pages
摘要:确定程序是否终止是程序分析中的核心挑战,直接影响正确性,验证和安全性。我们调查Transformer架构是否可以直接从源代码中识别终止模式,以及如何通过集成来放大它们的优势。为了克服非终止示例的极端稀缺性,我们设计了一个紧凑的Transformer编码器的集成框架,系统地训练了一套不平衡感知损失函数和类感知采样技术。通过结合使用不同损失函数训练的模型,我们的集成实现了比任何单个Transformer更强的性能,优于强大的现成LLM和基于图形的方法。最后,我们介绍了一个属性管道,产生语法意识的解释终止估计。
摘要:Determining whether a program terminates is a core challenge in program analysis with direct implications for correctness, verification, and security. We investigate whether transformer architectures can recognise termination patterns directly from source code and how their strengths can be amplified through ensembles. To overcome the extreme scarcity of non-terminating examples, we design an ensemble framework of compact transformer encoders, systematically trained with a suite of imbalance-aware loss functions and class-aware sampling techniques. By combining models trained with distinct loss functions, our ensembles achieve substantially stronger performance than any single transformer, outperforming both powerful off-the-shelf LLMs and graph-based methods. Finally, we introduce an attribution pipeline that produces syntax-aware explanations for the termination estimation.


【9】Forecast collapse of transformer-based models under squared loss in financial time series
标题:金融时间序列中平方损失下基于transformer模型的预测崩溃
链接:https://arxiv.org/abs/2604.00064

作者:Pierre Andreoletti
摘要:我们研究了具有弱条件结构的时间序列的平方损失下的轨迹预测,使用高度表达的预测模型。基于平方损失风险最小化的经典特征,我们强调未来轨迹的条件期望有效退化的制度,导致平凡的贝叶斯最优预测(在标准金融环境中价格平坦,回报为零)。  在这种情况下,增加的模型表现力并没有提高预测精度,而是在最佳预测器周围引入了虚假的轨迹波动。这些波动来自噪声的重复使用,并导致预测方差增加,而偏差没有任何减少。  这为基于Transformer的金融时间序列预测的降级提供了流程级解释。  我们补充这些理论结果与高频欧元/美元汇率数据的数值实验,分析概率水平的预测误差的分布。结果表明,基于Transformer的模型在大多数预测窗口上产生的误差大于简单的线性基准,与理论确定的方差驱动机制一致。
摘要:We study trajectory forecasting under squared loss for time series with weak conditional structure, using highly expressive prediction models. Building on the classical characterization of squared-loss risk minimization, we emphasize regimes in which the conditional expectation of future trajectories is effectively degenerate, leading to trivial Bayes-optimal predictors (flat for prices and zero for returns in standard financial settings).  In this regime, increased model expressivity does not improve predictive accuracy but instead introduces spurious trajectory fluctuations around the optimal predictor. These fluctuations arise from the reuse of noise and result in increased prediction variance without any reduction in bias.  This provides a process-level explanation for the degradation of Transformerbased forecasts on financial time series.  We complement these theoretical results with numerical experiments on high-frequency EUR/USD exchange rate data, analyzing the distribution of trajectory-level forecasting errors. The results show that Transformer-based models yield larger errors than a simple linear benchmark on a large majority of forecasting windows, consistent with the variance-driven mechanism identified by the theory.


GAN|对抗|攻击|生成相关(4篇)

【1】Bridging the Simulation-to-Experiment Gap with Generative Models using Adversarial Distribution Alignment
标题:使用对抗性分布对齐用生成模型弥合模拟与实验的差距
链接:https://arxiv.org/abs/2604.01169

作者:Kai Nelson,Tobias Kreiman,Sergey Levine,Aditi S. Krishnapriyan
摘要:科学和工程领域的一个根本挑战是模拟与实验之间的差距。虽然我们通常拥有物理定律的先验知识,但这些物理定律对于复杂系统来说太难了。这种系统通常使用模拟器建模,模拟器施加计算近似。与此同时,实验测量更忠实地代表了真实世界,但实验数据通常由仅部分反映系统完整底层状态的观测组成。我们提出了一个数据驱动的分布对齐框架,通过在完全观察到的(但不完美的)模拟数据上预训练生成模型,然后将其与实验数据的部分(但真实)观察结果对齐,来弥合模拟与实验之间的差距。虽然我们的方法是领域不可知的,但我们通过引入对抗性分布对齐(ADA)将我们的方法应用于物理科学。这种方法将原子位置的生成模型(最初在模拟的玻尔兹曼分布上训练)与实验观察的分布相匹配。我们证明了我们的方法恢复目标可观测分布,即使有多个,潜在相关的可观测。我们还在合成,分子和实验蛋白质数据上实证验证了我们的框架,证明它可以将生成模型与不同的可观测值对齐。我们的代码可在https://kaityrusnelson.com/ada/上获得。
摘要 :A fundamental challenge in science and engineering is the simulation-to-experiment gap. While we often possess prior knowledge of physical laws, these physical laws can be too difficult to solve exactly for complex systems. Such systems are commonly modeled using simulators, which impose computational approximations. Meanwhile, experimental measurements more faithfully represent the real world, but experimental data typically consists of observations that only partially reflect the system's full underlying state. We propose a data-driven distribution alignment framework that bridges this simulation-to-experiment gap by pre-training a generative model on fully observed (but imperfect) simulation data, then aligning it with partial (but real) observations of experimental data. While our method is domain-agnostic, we ground our approach in the physical sciences by introducing Adversarial Distribution Alignment (ADA). This method aligns a generative model of atomic positions -- initially trained on a simulated Boltzmann distribution -- with the distribution of experimental observations. We prove that our method recovers the target observable distribution, even with multiple, potentially correlated observables. We also empirically validate our framework on synthetic, molecular, and experimental protein data, demonstrating that it can align generative models with diverse observables. Our code is available at https://kaityrusnelson.com/ada/.


【2】MIRANDA: MId-feature RANk-adversarial Domain Adaptation toward climate change-robust ecological forecasting with deep learning
标题:MIRANDA:中型功能RANk对抗领域适应气候变化-通过深度学习进行稳健的生态预测
链接:https://arxiv.org/abs/2604.00800

作者:Yuchang Jiang,Jan Dirk Wegner,Vivien Sainte Fare Garnot
备注:EarthVision CVPRW 2026
摘要:植物物候学建模的目的是根据气象时间序列预测季节性阶段的时间,如叶子脱落或开花。可靠的预测对于预测生态系统对气候变化的反应至关重要。虽然物候建模传统上依赖于机械方法,但深度学习方法最近被提出作为灵活的、数据驱动的替代方案,通常具有卓越的性能。然而,当气候变化引起数据分布变化时,机械模型往往优于深度网络。域自适应(DA)技术可以帮助解决这一限制。然而,与标准DA设置不同,气候变化引起了一个时间连续域,涉及协变量和标签的变化,温暖的记录和春天的开始。为了应对这一挑战,我们引入了中间特征等级对抗域适应(MIRANDA)。鉴于传统的对抗性方法对最终的潜在表示强制域不变性,这种方法没有明确解决标签移位,我们将对抗性正则化应用于中间特征。此外,而不是一个二进制域分类的目标,我们采用了一个基于排名的目标,强制执行年不变的学习气象表示。在一个跨越70年的国家级数据集上,包括5个树种的67,800个物候观测,我们证明了与传统的DA方法不同,MIRANDA提高了对气候分布变化的鲁棒性,并缩小了与机械模型的性能差距。
摘要:Plant phenology modelling aims to predict the timing of seasonal phases, such as leaf-out or flowering, from meteorological time series. Reliable predictions are crucial for anticipating ecosystem responses to climate change. While phenology modelling has traditionally relied on mechanistic approaches, deep learning methods have recently been proposed as flexible, data-driven alternatives with often superior performance. However, mechanistic models tend to outperform deep networks when data distribution shifts are induced by climate change. Domain Adaptation (DA) techniques could help address this limitation. Yet, unlike standard DA settings, climate change induces a temporal continuum of domains and involves both a covariate and label shift, with warmer records and earlier start of spring. To tackle this challenge, we introduce Mid-feature Rank-adversarial Domain Adaptation (MIRANDA). Whereas conventional adversarial methods enforce domain invariance on final latent representations, an approach that does not explicitly address label shift, we apply adversarial regularization to intermediate features. Moreover, instead of a binary domain-classification objective, we employ a rank-based objective that enforces year-invariance in the learned meteorological representations. On a country-scale dataset spanning 70 years and comprising 67,800 phenological observations of 5 tree species, we demonstrate that, unlike conventional DA approaches, MIRANDA improves robustness to climatic distribution shifts and narrows the performance gap with mechanistic models.


【3】Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning
标题:沉默中思考错误:对连续潜在推理的后门攻击
链接:https://arxiv.org/abs/2604.00770

作者:Swapnil Parekh
摘要:新一代的语言模型完全在连续的隐藏状态下进行推理,不产生任何标记,   没有审计线索我们表明,这种沉默创造了一个全新的攻击面。一个令人不安的转向   在输入层的单个嵌入向量;模型自身的多遍推理将这种扰动放大为   被劫持的潜在轨迹,可靠地产生攻击者选择的答案,同时在结构上保持   对任何标记级防御来说都是不可见的在两个架构(Coconut和SimCoT)中,三个推理基准,   模型参数范围从124 M到3B,ProrightSteer在接近基线的情况下实现>=99%的攻击成功率   干净的准确性,无需再培训即可转移到保持的基准(94-100%),避开所有五个评估活动   防御,并生存25个时代的清洁微调。我们将这些结果追溯到一个统一的机制:   潜在空间中的坍缩将触发的表示拉到一个紧密的几何吸引子上,解释了为什么   防御失败,以及为什么任何有效的后门必须留下线性可分离的签名(探针AUC>=0.999)。然而   出现了一个惊人的悖论:即使模型输出了正确的答案,   错了对抗信息不是在任何单一的向量中,而是在集体的轨迹中,   后门扰动作为一个新的镜头的机械可解释性的连续推理。代码和检查点   都是可用的。
摘要:A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving   no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a   single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a   hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally   invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks,   and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline   clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active   defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural   Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why   defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a   striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the   wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing   backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints   are available.


【4】Learning and Generating Mixed States Prepared by Shallow Channel Circuits
标题:学习和生成浅通道电路准备的混合状态
链接:https://arxiv.org/abs/2604.01197

作者:Fangjun Hu,Christian Kokail,Milan Kornjača,Pedro L. S. Lopes,Weiyuan Gong,Sheng-Tao Wang,Xun Gao,Stefan Ostermann
备注:44 pages, 13 figures, 1 table
摘要:从测量数据中学习量子态是量子信息和计算复杂性的核心问题。在这项工作中,我们研究的问题,学习生成混合状态的有限维晶格。最近的事态发展的动机在混合状态阶段的物质,我们专注于任意状态的平凡阶段。一个状态属于平凡相,如果存在一个浅的准备通道电路,在该电路下,在整个准备过程中保持局部可逆性。我们证明,在这个类中的任何混合状态可以有效地从测量访问单独学习。具体来说,给定一个未知的平凡相位混合状态的副本,我们的算法输出一个浅的本地通道电路,近似产生这个状态的跟踪距离。假设电路深度和门位置是常数(或多对数),样本复杂度和运行时间在量子位数上是多项式(或准多项式)。重要的是,学习者没有得到原始的准备回路,只依赖于它的存在。我们的研究结果为基于浅沟道电路的量子生成模型提供了结构基础。在经典极限下,我们的框架还启发了一个有效的算法,经典的扩散模型只使用多项式的训练和生成开销。
摘要:Learning quantum states from measurement data is a central problem in quantum information and computational complexity. In this work, we study the problem of learning to generate mixed states on a finite-dimensional lattice. Motivated by recent developments in mixed state phases of matter, we focus on arbitrary states in the trivial phase. A state belongs to the trivial phase if there exists a shallow preparation channel circuit under which local reversibility is preserved throughout the preparation. We prove that any mixed state in this class can be efficiently learned from measurement access alone. Specifically, given copies of an unknown trivial phase mixed state, our algorithm outputs a shallow local channel circuit that approximately generates this state in trace distance. The sample complexity and runtime are polynomial (or quasi-polynomial) in the number of qubits, assuming constant (or polylogarithmic) circuit depth and gate locality. Importantly, the learner is not given the original preparation circuit and relies only on its existence. Our results provide a structural foundation for quantum generative models based on shallow channel circuits. In the classical limit, our framework also inspires an efficient algorithm for classical diffusion models using only a polynomial overhead of training and generation.


半/弱/无/有监督|不确定性|主动学习(2篇)

【1】Safe learning-based control via function-based uncertainty quantification
标题:通过基于功能的不确定性量化实现基于学习的安全控制
链接:https://arxiv.org/abs/2604.01173

作者:Abdullah Tokmak,Toni Karvonen,Thomas B. Schön,Dominik Baumann
备注:Under review for CDC 2026
摘要:在安全关键系统中部署基于学习的控制方法时,不确定性量化至关重要。这通常是通过构建包含感兴趣的未知函数的不确定性管来实现的,例如,奖励和约束函数或潜在的动力学模型,具有很高的概率。然而,现有的不确定性量化方法通常依赖于对未知函数的限制性假设,例如函数范数或Lipschitz常数的已知界限,并且与不连续性作斗争。在本文中,我们将未知函数建模为随机函数,从中可以生成独立同分布的实现,并通过场景方法构建不确定性管,这些不确定性管具有高概率,并且仅依赖于采样实现。我们将这些不确定性管到一个安全的贝叶斯优化算法,然后我们用它来安全地调整控制参数的一个真正的古田摆。
摘要:Uncertainty quantification is essential when deploying learning-based control methods in safety-critical systems. This is commonly realized by constructing uncertainty tubes that enclose the unknown function of interest, e.g., the reward and constraint functions or the underlying dynamics model, with high probability. However, existing approaches for uncertainty quantification typically rely on restrictive assumptions on the unknown function, such as known bounds on functional norms or Lipschitz constants, and struggle with discontinuities. In this paper, we model the unknown function as a random function from which independent and identically distributed realizations can be generated, and construct uncertainty tubes via the scenario approach that hold with high probability and rely solely on the sampled realizations. We integrate these uncertainty tubes into a safe Bayesian optimization algorithm, which we then use to safely tune control parameters on a real Furuta pendulum.


【2】Unsupervised 4D Flow MRI Velocity Enhancement and Unwrapping Using Divergence-Free Neural Networks
标题:使用无分歧神经网络的无监督4D流MRI速度增强和展开
链接:https://arxiv.org/abs/2604.00205

作者:Javier Bisbal,Julio Sotelo,Hernán Mella,Oliver Welin Odeback,Joaquín Mura,David Marlevi,Junya Matsuda,Kotomi Iwata,Tetsuro Sekine,Cristian Tejos,Sergio Uribe
备注:11 pages, 5 figures, 7 tables
摘要:这项工作介绍了一个无监督的发散和无混叠神经网络(DAF-FlowNet)的4D流磁共振成像(4D Flow MRI),联合增强噪声速度场和纠正相位包裹伪影。DAF-FlowNet将速度参数化为矢量势的旋度,通过构造来执行质量守恒,并避免显式的发散惩罚调整。余弦数据一致性损失使得能够从包裹的相位图像同时进行去噪和解包裹。在由计算流体动力学生成的合成主动脉4D Flow MRI上,DAF-FlowNet实现了比现有技术更低的误差(相对于噪声水平上性能最佳的替代方案,速度归一化均方根误差低11%,方向误差低11%,发散度低44%),对中度分割扰动具有鲁棒性。对于展开,在峰值速度/速度编码比为1.4和2.1时,DAF-FlowNet实现了0.18%和5.2%的残余包裹体素,相对于最佳替代方法分别减少了72%和18%。在同时存在噪声和混叠的情况下,所提出的单级公式优于最先进的顺序管道(速度归一化均方根误差降低15%,方向误差降低11%,发散度降低28%)。在10个肥厚型心肌病患者数据集中,DAF-FlowNet保留了精细尺度的血流特征,纠正了混叠区域,并改善了内部血流一致性,如4D Flow MRI共识指南推荐的主动脉和肺质量守恒分析中平面间血流偏倚减少所示。这些结果支持DAF-FlowNet作为统一速度增强和相位展开的框架,以提高心血管4D Flow MRI的可靠性。
摘要:This work introduces an unsupervised Divergence and Aliasing-Free neural network (DAF-FlowNet) for 4D Flow Magnetic Resonance Imaging (4D Flow MRI) that jointly enhances noisy velocity fields and corrects phase wrapping artifacts. DAF-FlowNet parameterizes velocities as the curl of a vector potential, enforcing mass conservation by construction and avoiding explicit divergence-penalty tuning. A cosine data-consistency loss enables simultaneous denoising and unwrapping from wrapped phase images. On synthetic aortic 4D Flow MRI generated from computational fluid dynamics, DAF-FlowNet achieved lower errors than existing techniques (up to 11% lower velocity normalized root mean square error, 11% lower directional error, and 44% lower divergence relative to the best-performing alternative across noise levels), with robustness to moderate segmentation perturbations. For unwrapping, at peak velocity/velocity-encoding ratios of 1.4 and 2.1, DAF-FlowNet achieved 0.18% and 5.2% residual wrapped voxels, representing reductions of 72% and 18% relative to the best alternative method, respectively. In scenarios with both noise and aliasing, the proposed single-stage formulation outperformed a state-of-the-art sequential pipeline (up to 15% lower velocity normalized root mean square error, 11% lower directional error, and 28% lower divergence). Across 10 hypertrophic cardiomyopathy patient datasets, DAF-FlowNet preserved fine-scale flow features, corrected aliased regions, and improved internal flow consistency, as indicated by reduced inter-plane flow bias in aortic and pulmonary mass-conservation analyses recommended by the 4D Flow MRI consensus guidelines. These results support DAF-FlowNet as a framework that unifies velocity enhancement and phase unwrapping to improve the reliability of cardiovascular 4D Flow MRI.


迁移|Zero/Few/One-Shot|自适应(7篇)

【1】S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
标题:S0调整:混合重现注意力模型的零开销适应
链接:https://arxiv.org/abs/2604.01168

作者:Jack Young
备注:15 pages (10 main + 5 appendix), 3 figures, code at https://github.com/jackyoung27/s0-tuning
摘要:使用大约48个执行验证的HumanEval训练解决方案,调整每个递归层的单个初始状态矩阵,零推理开销,在HumanEval上比LoRA高出+10.8 pp(p < 0.001)。该方法,我们称之为S 0调整,优化每个递归层的一个状态矩阵,同时冻结所有模型权重。在Qwen3.5- 4 B(GatedDeltaNet混合)上,S 0调整将贪婪pass@1提高了+23.6 +/- 1.7 pp(10个种子)。在FalconH 1 - 7 B(Mamba-2杂种)上,S 0达到71.8% +/- 1.3,LoRA达到71.4% +/- 2.4(3粒种子),在该样本量下在统计学上无法区分,同时无需合并权重。跨域迁移在MATH-500(+4.8 pp,p = 0.00002,8个种子)和GSM 8 K(+2.8 pp,p = 0.0003,10个种子)上是显著的;文本到SQL基准测试(Spider)显示没有迁移,与自动转向机制一致。纯Transformer(Qwen2.5-3B)上的前缀调谐控制在所有九种测试配置下使性能降低-13.9 pp。在Qwen3.5上,每步状态偏移变体达到+27.1 pp,高于S 0和LoRA,但具有每步推理成本。总之,结果表明,经常性的状态初始化是一个强大的零推理开销PEFT表面的混合语言模型时,验证监督是稀缺的。调优状态是一个~48 MB的文件;任务切换不需要权重合并或模型重载。代码和库:https://github.com/jackyoung27/s0-tuning。
摘要:Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.


【2】Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation
标题:用于单目深度估计的轻量级预算引导CLIP自适应
链接 :https://arxiv.org/abs/2604.01118

作者:Reyhaneh Ahani Manghotay,Jie Liang
备注:14 pages, 2 figures
摘要:利用CLIP等视觉语言模型(VLM)的丰富语义特征进行单目深度估计任务是一个很有前途的方向,但通常需要大量微调或缺乏几何精度。我们提出了一个参数高效的框架,名为MoA-DepthCLIP,它采用预训练的CLIP表示,以最小的监督进行单目深度估计。我们的方法将轻量级的适配器混合(MoA)模块集成到预训练的Vision Transformer(ViT-B/32)骨干中,并结合对最终层的选择性微调。该设计实现了空间感知的自适应,由全局语义上下文向量和混合预测架构引导,该混合预测架构将深度仓分类与直接回归协同。为了提高结构精度,我们采用复合损失函数,强制执行几何约束。在NYU Depth V2基准上,MoA-DepthCLIP实现了具有竞争力的结果,通过将$δ_1$精度从0.390提高到0.745并将RMSE从1.176降低到0.520,显著优于DepthCLIP基线。这些结果是在需要很少的可训练参数的情况下实现的,这表明轻量级的、非制导的MoA是将VLM知识转移到细粒度的单目深度估计任务的高效策略。
摘要:Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.


【3】Transfer learning for nonparametric Bayesian networks
标题:非参数Bayesian网络的迁移学习
链接:https://arxiv.org/abs/2604.01021

作者:Rafael Sojo,Pedro Larrañaga,Concha Bielza
备注:An earlier version was previously posted on SSRN. This version includes improvements in experiments and evaluation metrics following reviewer comments. Revision submitted to Knowledge-Based Systems
摘要:本文介绍了两种迁移学习方法,估计非参数贝叶斯网络在稀缺数据。我们提出了两种算法,一种是基于约束的结构学习方法,称为PC稳定迁移学习(PCS-TL),另一种是基于分数的方法,称为爬山迁移学习(HC-TL)。我们还定义了特定的指标来解决每个指标中的负迁移问题,即迁移学习对模型性能产生负面影响的情况。然后,对于参数,我们提出了一个对数线性池的方法。为了进行评估,我们学习了核密度估计贝叶斯网络,一种非参数贝叶斯网络,并将其迁移学习性能与单独的模型进行了比较。为此,我们从小型,中型和大型合成网络中采样数据,并从UCI机器学习存储库中获取数据集。然后,我们向这些数据集添加噪声和修改,以测试它们避免负迁移的能力。最后,我们进行了Friedman检验和Bergmann-Hommel事后分析,以显示我们的方法增强实验行为的统计证据。因此,PCS-TL和HC-TL被证明是可靠的算法,用于提高具有稀缺数据的非参数贝叶斯网络的学习性能,这在实际工业环境中意味着减少部署网络所需的时间。
摘要:This paper introduces two transfer learning methodologies for estimating nonparametric Bayesian networks under scarce data. We propose two algorithms, a constraint-based structure learning method, called PC-stable-transfer learning (PCS-TL), and a score-based method, called hill climbing transfer learning (HC-TL). We also define particular metrics to tackle the negative transfer problem in each of them, a situation in which transfer learning has a negative impact on the model's performance. Then, for the parameters, we propose a log-linear pooling approach. For the evaluation, we learn kernel density estimation Bayesian networks, a type of nonparametric Bayesian network, and compare their transfer learning performance with the models alone. To do so, we sample data from small, medium and large-sized synthetic networks and datasets from the UCI Machine Learning repository. Then, we add noise and modifications to these datasets to test their ability to avoid negative transfer. To conclude, we perform a Friedman test with a Bergmann-Hommel post-hoc analysis to show statistical proof of the enhanced experimental behavior of our methods. Thus, PCS-TL and HC-TL demonstrate to be reliable algorithms for improving the learning performance of a nonparametric Bayesian network with scarce data, which in real industrial environments implies a reduction in the required time to deploy the network.


【4】Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
标题:在考试时学习:具有可学习适应政策的语言代理
链接:https://arxiv.org/abs/2604.00830

作者:Zhanzhi Lou,Hui Chen,Yibo Li,Qian Wang,Bryan Hooi
摘要:测试时学习(TTL)使语言代理能够通过在推理时与环境的重复交互来迭代地改进其性能。TTL的核心是一个适应策略,它根据以前的经验更新演员策略,从而改善未来的行为。现有的方法依赖于固定的、手工制定的适应政策,而不是为了下游改进而优化这些政策。我们认为,最佳的适应政策应该从任务环境中学习,而不是手工设计的基础上人类的直觉。为了实现这一目标,我们引入元TTL,制定有效的适应政策的发现作为一个双层优化问题的框架。在此框架内,内部循环执行标准TTL过程,测量候选自适应策略如何有效地帮助代理纠正连续事件中的错误。在智能体性能的指导下,外环采用进化搜索在不同的训练任务分布迭代地完善自适应策略。我们评估元TTL杰里科和WebArena-Lite在分布(ID)和分布(OOD)设置,使用多个元代理骨干。两个基准测试的结果表明,Meta-TTL始终优于手工制作的基线,这表明优化的适应策略编码可转移的策略,概括超出了训练任务分布。
摘要:Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent's performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.


【5】Cost-Penalized Fitness in FMA-Orchestrated Mixture of Experts: Experimental Evidence for Molecular Memory in Domain Adaptation
标题:FAA指定的专家混合体中的成本惩罚适应性:领域适应中分子记忆的实验证据
链接:https://arxiv.org/abs/2604.00812

作者:Martin Jaraiz
备注:10 pages, 3 figures, draft
摘要 :我们目前的实验结果从七个控制运行的nanoFMT,自由市场算法(FMA)编排的Transformer与动态混合专家(MoE)管理。实验解决了一个基本的问题,先进的LLM开发:一个MoE系统应如何管理其专家库时,在不断变化的数据分布下满负荷运行?我们证明,成本惩罚的健身指标,结合线性宽限期的新生儿专家,产生一个系统,积累领域的专业知识,通过多样化,而不是更换。核心结果是一个往返域转移实验,显示当返回到先前学习的域时,恢复速度快9- 11倍,无需专家出生或更换。这种“分子记忆”效应-休眠的专家存活下来,并在其领域回归时重新激活-在当前的MoE管理方法中没有类似的东西。初步的成本分析估计,在中等情况下,OpenAI规模的提供商每年可节省3910万美元和27.1 GWh的能源减少。
摘要:We present experimental results from seven controlled runs of nanoFMT, a Free-Market Algorithm (FMA) orchestrated transformer with dynamic Mixture-of-Experts (MoE) management. The experiments address a fundamental question for advanced LLM development: how should an MoE system manage its expert pool when operating at full capacity under changing data distributions? We demonstrate that cost-penalized fitness metrics, combined with a linear grace period for newborn experts, produce a system that accumulates domain expertise through diversification rather than replacement. The central result is a round-trip domain shift experiment showing 9-11x faster recovery when returning to a previously learned domain, with zero expert births or replacements required. This "molecular memory" effect -- where dormant experts survive and reactivate when their domain returns -- has no analogue in current MoE management approaches. A preliminary cost analysis estimates annual savings of $39.1M and 27.1 GWh energy reduction for an OpenAI-scale provider under a moderate scenario.


【6】Autonomous Adaptive Solver Selection for Chemistry Integration via Reinforcement Learning
标题:通过强化学习进行化学集成的自主自适应求解器选择
链接:https://arxiv.org/abs/2604.00264

作者:Eloghosa Ikponmwoba,Opeoluwa Owoyele
摘要:刚性化学动力学的计算成本仍然是反应流模拟中的主要瓶颈,然而混合集成策略通常由手动调整的预测器或监督预测器驱动,这些预测器从瞬时局部状态做出近视决策。我们介绍了一个约束强化学习(RL)框架,自主选择之间的隐式BDF积分器(CVODE)和准稳态(QSS)求解器在化学集成。求解器的选择是一个马尔可夫决策过程。该代理学习具有容错性的策略,这些策略考虑了当前求解器选择如何影响下游误差累积,同时在用户规定的精度容限下最小化计算成本,该精度容限通过具有在线乘数适应的拉格朗日奖励来执行。在采样的0 D均相反应器条件下,RL自适应策略实现了约$3\times$的平均加速比,加速比范围从$1.11\times$到$10.58\times$,同时保持了106种十二烷机制的精确点火延迟和物种分布,并增加了约$1\%$的推理开销。在没有再训练的情况下,0 D训练的策略在10 $-2000 $的应变率下转移到1D逆流扩散火焰,相对于CVODE提供一致的约2.2\times $加速,同时保持接近参考温度的精度,并在仅12 $-15\%$的时空点处选择CVODE。总的来说,结果表明,所提出的强化学习框架有潜力在尊重精度约束的同时学习特定于问题的集成策略,从而为具有空间异构刚度的多物理场系统的自适应、自优化工作流程开辟了一条道路。
摘要:The computational cost of stiff chemical kinetics remains a dominant bottleneck in reacting-flow simulation, yet hybrid integration strategies are typically driven by hand-tuned heuristics or supervised predictors that make myopic decisions from instantaneous local state. We introduce a constrained reinforcement learning (RL) framework that autonomously selects between an implicit BDF integrator (CVODE) and a quasi-steady-state (QSS) solver during chemistry integration. Solver selection is cast as a Markov decision process. The agent learns trajectory-aware policies that account for how present solver choices influence downstream error accumulation, while minimizing computational cost under a user-prescribed accuracy tolerance enforced through a Lagrangian reward with online multiplier adaptation. Across sampled 0D homogeneous reactor conditions, the RL-adaptive policy achieves a mean speedup of approximately $3\times$, with speedups ranging from $1.11\times$ to $10.58\times$, while maintaining accurate ignition delays and species profiles for a 106-species \textit{n}-dodecane mechanism and adding approximately $1\%$ inference overhead. Without retraining, the 0D-trained policy transfers to 1D counterflow diffusion flames over strain rates $10$--$2000~\mathrm{s}^{-1}$, delivering consistent $\approx 2.2\times$ speedup relative to CVODE while preserving near-reference temperature accuracy and selecting CVODE at only $12$--$15\%$ of space-time points. Overall, the results demonstrate the potential of the proposed reinforcement learning framework to learn problem-specific integration strategies while respecting accuracy constraints, thereby opening a pathway toward adaptive, self-optimizing workflows for multiphysics systems with spatially heterogeneous stiffness.


【7】PASM: Population Adaptive Symbolic Mixture-of-Experts Model for Cross-location Hurricane Evacuation Decision Prediction
标题:PASM:跨地点飓风疏散决策预测的人口自适应符号混合专家模型
链接:https://arxiv.org/abs/2604.00074

作者:Xiao Qian,Shangjia Dong
摘要:准确预测疏散行为对于备灾至关重要,但在一个地区训练的模型在其他地方往往会失败。使用多个州的飓风疏散调查,我们表明,这种失败超越功能分布的转变:具有相似特征的家庭遵循系统不同的决策模式,各州。因此,单一的全球模型过度拟合优势反应,错误地代表脆弱的亚群,并在不同地点推广不力。我们提出了人口自适应符号混合专家(PASM),它对大型语言模型引导的符号回归与混合专家架构。PASM发现人类可读的封闭形式的决策规则,将其专门化为数据驱动的子群体,并在推理时将每个输入路由到适当的专家。在飓风哈维和厄玛数据上,从佛罗里达州和德克萨斯州转移到佐治亚州,使用100个校准样本,PASM实现了0.607的马修斯相关系数,而XGBoost(0.404),TabPFN(0.333),GPT-5-mini(0.434)和元学习基线MAML和原型网络(MCC $\leq $0.346)。路由机制为子群体分配不同的公式原型,因此所产生的行为配置文件是直接可解释的。在四个人口统计轴的公平性审计发现,在Bonferroni校正后,没有统计上的显着差异。PASM缩小了一半以上的跨位置泛化差距,同时保持决策规则足够透明,以便进行真实世界的应急规划。
摘要:Accurate prediction of evacuation behavior is critical for disaster preparedness, yet models trained in one region often fail elsewhere. Using a multi-state hurricane evacuation survey, we show this failure goes beyond feature distribution shift: households with similar characteristics follow systematically different decision patterns across states. As a result, single global models overfit dominant responses, misrepresent vulnerable subpopulations, and generalize poorly across locations. We propose Population-Adaptive Symbolic Mixture-of-Experts (PASM), which pairs large language model guided symbolic regression with a mixture-of-experts architecture. PASM discovers human-readable closed-form decision rules, specializes them to data-driven subpopulations, and routes each input to the appropriate expert at inference time. On Hurricanes Harvey and Irma data, transferring from Florida and Texas to Georgia with 100 calibration samples, PASM achieves a Matthews correlation coefficient of 0.607, compared to XGBoost (0.404), TabPFN (0.333), GPT-5-mini (0.434), and meta-learning baselines MAML and Prototypical Networks (MCC $\leq$ 0.346). The routing mechanism assigns distinct formula archetypes to subpopulations, so the resulting behavioral profiles are directly interpretable. A fairness audit across four demographic axes finds no statistically significant disparities after Bonferroni correction. PASM closes more than half the cross-location generalization gap while keeping decision rules transparent enough for real-world emergency planning.


强化学习(7篇)

【1】Deep Reinforcement Learning for Robotic Manipulation under Distribution Shift with Bounded Extremum Seeking
标题:有界边界搜索下分布转移下机器人操纵的深度强化学习
链接:https://arxiv.org/abs/2604.01142

作者:Shaifalee Saxena,Rafael Fierro,Alexander Scheinker
摘要:强化学习在机器人操作中表现出很强的性能,但当测试条件与训练分布不同时,学习的策略通常会降低性能。这种限制在接触丰富的任务中尤其重要,例如推动和拾取放置,其中目标,接触条件或机器人动力学的变化可能会在推理时驱动系统脱离分布。在本文中,我们研究了一种将强化学习与有界极值相结合的混合控制器,以提高此类条件下的鲁棒性。在所提出的方法中,深度确定性策略梯度(DDPG)政策在标准条件下的机器人推和取放任务进行训练,然后在部署过程中结合有界ES。RL策略提供了快速的操作行为,而有界ES确保了当操作条件偏离训练期间看到的操作条件时,整个控制器对时间变化的鲁棒性。所得到的控制器进行评估下的几个分布设置,包括随时间变化的目标和空间变化的摩擦片。
摘要 :Reinforcement learning has shown strong performance in robotic manipulation, but learned policies often degrade in performance when test conditions differ from the training distribution. This limitation is especially important in contact-rich tasks such as pushing and pick-and-place, where changes in goals, contact conditions, or robot dynamics can drive the system out-of-distribution at inference time. In this paper, we investigate a hybrid controller that combines reinforcement learning with bounded extremum seeking to improve robustness under such conditions. In the proposed approach, deep deterministic policy gradient (DDPG) policies are trained under standard conditions on the robotic pushing and pick-and-place tasks, and are then combined with bounded ES during deployment. The RL policy provides fast manipulation behavior, while bounded ES ensures robustness of the overall controller to time variations when operating conditions depart from those seen during training. The resulting controller is evaluated under several out-of-distribution settings, including time-varying goals and spatially varying friction patches.


【2】Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization
标题:轨迹优化中基于流的策略和分布式强化学习
链接:https://arxiv.org/abs/2604.00977

作者:Ruijie Hao,Longfei Zhang,Yang Dai,Yang Ma,Xingxing Liang,Guangquan Cheng
摘要:强化学习(RL)已被证明在解决复杂的控制和决策任务方面非常有效。然而,在大多数传统RL算法中,策略通常被参数化为对角高斯分布,这限制了策略捕获多模态分布,使得难以覆盖多解问题中的全部最优解,并且回报被降低为平均值,失去了其多模态性质,从而为策略更新提供了不足的指导。针对这些问题,我们提出了一个强化学习算法称为基于流的政策与分布式强化学习(FP-DRL)。该算法使用流匹配对策略进行建模,从而提供计算效率和拟合复杂分布的能力。此外,它采用分布式RL方法来建模和优化整个回报分布,从而更有效地指导多模态策略更新并提高代理性能。MuJoCo基准测试的实验结果表明,FP-DRL算法在大多数MuJoCo控制任务中实现了最先进的(SOTA)性能,同时表现出卓越的流策略表示能力。
摘要:Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.


【3】Policy Improvement Reinforcement Learning
标题:政策改进强化学习
链接:https://arxiv.org/abs/2604.00860

作者:Huaiyang Wang,Xiaojie Li,Deqing Wang,Haoyi Zhou,Zixuan Huang,Yaodong Yang,Jianxin Li,Yikun Ban
摘要:带有可验证奖励的强化学习(RLVR)已经成为提高大型语言模型推理能力的中心后训练范式。然而,现有的方法有一个共同的盲点:它们基于即时的组级或批级统计数据来优化策略,而没有验证由此产生的更新是否确实改善了模型。这种开环设计--在每一步都孤立地更新,只受组内(批量)奖励信号的指导--意味着优化可能会漂移或崩溃,而没有机制来检测和纠正这些故障。我们认为,缺少的成分是政策改进反馈:直接测量和优化迭代间进度的能力。为此,我们引入了策略改进强化学习(PIRL),这是一个框架,它用最大化迭代过程中累积策略改进的明确目标来取代代理奖励最大化,并证明了这个时间目标与最大化最终任务性能完全一致。在PIRL的基础上,我们提出了策略改进策略优化(PIPO),通过回溯验证实现闭环优化。在每次迭代中,PIPO都会根据滑动窗口历史基线评估上一次更新是否产生了真正的改进,然后积极加强有益的更新并抑制有害的更新-将开环过程转变为自我纠正过程。我们提供的理论分析表明,PIPO执行上升的PIRL目标的期望,和数学推理基准的实验表明,改进的稳定性和性能比GRPO及其变种。
摘要:Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.


【4】Learning to Hint for Reinforcement Learning
标题:学习提示强化学习
链接:https://arxiv.org/abs/2604.00698

作者:Yu Xia,Canwen Xu,Zhewei Yao,Julian McAuley,Yuxiong He
摘要:组相对策略优化(GRPO)被广泛用于具有可验证奖励的强化学习,但它经常遭受优势崩溃:当一个组中的所有推出都获得相同的奖励时,该组产生零相对优势,因此没有学习信号。例如,如果一个问题对于推理器来说太难了,那么所有采样的卷展都可能是不正确的,并且获得零奖励。最近的工作解决了这个问题,通过添加提示或辅助脚手架,这样的困难的问题,使推理机产生混合的结果,并恢复非零更新。然而,现有的提示通常是固定的,而不是适应于当前的推理机,并且在提示输入下创建学习信号的提示不一定改善在测试时使用的无提示策略。为此,我们提出了强化学习的提示学习(HiLL),这是一个在强化学习期间联合训练提示策略和推理策略的框架。对于每个难题,提示器在线生成提示,条件是当前推理器的不正确的展示,允许提示生成适应推理器的不断发展的错误。我们进一步引入提示依赖,它测量正确的暗示轨迹依赖于提示的程度。我们推导出的可转移性结果表明,较低的提示依赖意味着更强的转移,从暗示的成功,没有提示的成功,我们使用这个结果来定义一个转移加权奖励训练的暗示。因此,HiLL倾向于提示,不仅恢复信息GRPO组,而且还产生更有可能改善原始无提示策略的信号。多个基准测试的实验表明,HiLL始终优于GRPO和先前的基于提示的基线,证明了自适应和传输感知提示学习对RL的价值。该代码可在https://github.com/Andree-9/HiLL上获得。
摘要 :Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.


【5】GUIDE: Reinforcement Learning for Behavioral Action Support in Type 1 Diabetes
标题:指南:强化学习用于1型糖尿病的行为行动支持
链接:https://arxiv.org/abs/2604.00385

作者:Saman Khamesian,Sri Harini Balaji,Di Yang Shi,Stephanie M. Carpenter,Daniel E. Rivera,W. Bradley Knox,Peter Stone,Hassan Ghasemzadeh
摘要:1型糖尿病(T1 D)管理需要持续调整胰岛素和生活方式,以将血糖维持在安全的目标范围内。尽管自动胰岛素输注(AID)系统改善了血糖结果,但许多患者仍未能达到推荐的临床目标,因此需要采用新方法来改善T1 D患者的血糖控制。虽然强化学习(RL)已被用作一种有前途的方法,但当前基于RL的方法主要集中在仅胰岛素治疗上,并且不提供血糖控制的行为建议。为了解决这一差距,我们提出了GUIDE,一个基于RL的决策支持框架,旨在通过提供行为建议来补充AID技术,以防止血糖异常事件。GUIDE生成由干预类型、幅度和时间定义的结构化操作,包括餐时胰岛素给药和碳水化合物摄入事件。GUIDE集成了一个根据真实世界的连续葡萄糖监测数据训练的患者特定葡萄糖水平预测器,并在统一的环境中支持离线和在线RL算法。我们使用标准化血糖指标评估了25名T1 D患者的非政策和政策方法。在评价的方法中,CQL-BC算法显示出最高的平均范围内时间,达到85.49%,同时维持低血糖暴露。行为相似性分析进一步表明,学习的CQL-BC政策保留了患者动作模式的关键结构特征,在受试者之间实现了0.87 $\pm $0.09的平均余弦相似性。这些发现表明,具有结构化行为动作空间的保守离线RL可以为个性化糖尿病管理提供具有临床意义和行为合理的决策支持。
摘要:Type 1 Diabetes (T1D) management requires continuous adjustment of insulin and lifestyle behaviors to maintain blood glucose within a safe target range. Although automated insulin delivery (AID) systems have improved glycemic outcomes, many patients still fail to achieve recommended clinical targets, warranting new approaches to improve glucose control in patients with T1D. While reinforcement learning (RL) has been utilized as a promising approach, current RL-based methods focus primarily on insulin-only treatment and do not provide behavioral recommendations for glucose control. To address this gap, we propose GUIDE, an RL-based decision-support framework designed to complement AID technologies by providing behavioral recommendations to prevent abnormal glucose events. GUIDE generates structured actions defined by intervention type, magnitude, and timing, including bolus insulin administration and carbohydrate intake events. GUIDE integrates a patient-specific glucose level predictor trained on real-world continuous glucose monitoring data and supports both offline and online RL algorithms within a unified environment. We evaluate both off-policy and on-policy methods across 25 individuals with T1D using standardized glycemic metrics. Among the evaluated approaches, the CQL-BC algorithm demonstrates the highest average time-in-range, reaching 85.49% while maintaining low hypoglycemia exposures. Behavioral similarity analysis further indicates that the learned CQL-BC policy preserves key structural characteristics of patient action patterns, achieving a mean cosine similarity of 0.87 $\pm$ 0.09 across subjects. These findings suggest that conservative offline RL with a structured behavioral action space can provide clinically meaningful and behaviorally plausible decision support for personalized diabetes management.


【6】Focal plane wavefront control with model-based reinforcement learning
标题:基于模型的强化学习的焦平面波阵面控制
链接:https://arxiv.org/abs/2604.00993

作者:Jalo Nousiainen,Iremsu Taskin,Markus Kasper,Gilles Orban De Xivry,Olivier Absil
备注:13 pages, 11 figures accepted by A&A
摘要:对可能适合居住的系外行星的直接成像是超大望远镜上高对比度成像仪器的一个主要科学案例。大多数这样的系外行星的轨道靠近它们的宿主恒星,在那里它们的观测受到快速移动的大气斑点和准静态非共路像差(NCPA)的限制。传统的NCPA校正方法通常使用机械反射镜探头,这在操作期间损害性能。这项工作提出了基于机器学习的NCPA控制方法,自动检测和纠正动态和静态NCPA误差,利用顺序相位多样性。我们扩展了以前的工作,强化学习AO焦平面控制。一种新的基于模型的RL算法,NCPA的策略优化(PO4NCPA),将焦平面图像解释为输入数据,并通过顺序相位多样性,确定相位校正,优化非冠状和冠状后PSF,而无需事先系统知识。此外,我们证明了这种方法的有效性,通过数值模拟静态NCPA误差的地面望远镜和红外成像仪受水蒸气诱导看到(动态NCPA)。仿真结果表明,PO4NCPA鲁棒补偿静态和动态NCPA。在静态情况下,它实现了近最佳的焦平面光抑制与电晕放电和近最佳的斯特列尔没有一个。与动态NCPA,它匹配的性能的模态最小二乘重建结合一步延迟积分器在这些指标。该方法对ELT光瞳、矢量涡旋日冕、光子和背景噪声下仍然有效。PO4NCPA是无模型的,可以直接应用于标准成像以及任何冠状动脉造影。其亚毫秒级的推理时间和性能也使其适用于HCI之外的大气湍流的实时低阶校正。
摘要:The direct imaging of potentially habitable exoplanets is one prime science case for high-contrast imaging instruments on extremely large telescopes. Most such exoplanets orbit close to their host stars, where their observation is limited by fast-moving atmospheric speckles and quasi-static non-common-path aberrations (NCPA). Conventional NCPA correction methods often use mechanical mirror probes, which compromise performance during operation. This work presents machine-learning-based NCPA control methods that automatically detect and correct both dynamic and static NCPA errors by leveraging sequential phase diversity. We extend previous work in reinforcement learning for AO to focal plane control. A new model-based RL algorithm, Policy Optimization for NCPAs (PO4NCPA), interprets the focal-plane image as input data and, through sequential phase diversity, determines phase corrections that optimize both non-coronagraphic and post-coronagraphic PSFs without prior system knowledge. Further, we demonstrate the effectiveness of this approach by numerically simulating static NCPA errors on a ground-based telescope and an infrared imager affected by water-vapor-induced seeing (dynamic NCPAs). Simulations show that PO4NCPA robustly compensates static and dynamic NCPAs. In static cases, it achieves near-optimal focal-plane light suppression with a coronagraph and near-optimal Strehl without one. With dynamics NCPA, it matches the performance of the modal least-squares reconstruction combined with a 1-step delay integrator in these metrics. The method remains effective for the ELT pupil, vector vortex coronagraph, and under photon and background noise. PO4NCPA is model-free and can be directly applied to standard imaging as well as to any coronagraph. Its sub-millisecond inference times and performance also make it suitable for real-time low-order correction of atmospheric turbulence beyond HCI.


【7】Decomposable Reward Modeling and Realistic Environment Design for Reinforcement Learning-Based Forex Trading
标题:基于强化学习的外汇交易可分解奖励建模与现实环境设计
链接:https://arxiv.org/abs/2604.00031

作者:Nabeel Ahmad Saidd
摘要:将强化学习(RL)应用于外汇(Forex)交易仍然具有挑战性,因为必须同时满足现实环境,定义良好的奖励函数和表达性动作空间,但许多先前的研究依赖于简化的模拟器,单一标量奖励和受限的动作表示,限制了可解释性和实际相关性。本文提出了一个模块化RL框架,旨在通过三个紧密集成的组件来解决这些限制:摩擦感知执行引擎,强制执行严格的反前瞻语义,在时间t进行观察,在时间t+1执行,并在时间t+1按市值计价,同时考虑实际成本,如价差,佣金,滑点,展期融资和保证金触发清算;一个可分解的11个组件奖励架构,具有固定权重和每步诊断记录,以实现系统消融和组件级归因;以及一个10个动作的离散接口,具有法律动作掩蔽,对明确的交易原语进行编码,同时强制执行保证金感知可行性约束。对EURUSD的实证评估侧重于学习动态而不是泛化,并揭示了强烈的非单调奖励交互,其中额外的惩罚并不能可靠地改善结果;完整的奖励配置实现了最高的训练夏普(0.765)和累积回报(57.09%)。相对于保守的3动作基线,扩展的动作空间增加了回报,但也增加了营业额,并降低了夏普,这表明在固定的培训预算下,回报-活动权衡,而支持扩展的变体始终减少了缩编,组合配置实现了最强的端点性能。
摘要 :Applying reinforcement learning (RL) to foreign exchange (Forex) trading remains challenging because realistic environments, well-defined reward functions, and expressive action spaces must be satisfied simultaneously, yet many prior studies rely on simplified simulators, single scalar rewards, and restricted action representations, limiting both interpretability and practical relevance. This paper presents a modular RL framework designed to address these limitations through three tightly integrated components: a friction-aware execution engine that enforces strict anti-lookahead semantics, with observations at time t, execution at time t+1, and mark-to-market at time t+1, while incorporating realistic costs such as spread, commission, slippage, rollover financing, and margin-triggered liquidation; a decomposable 11-component reward architecture with fixed weights and per-step diagnostic logging to enable systematic ablation and component-level attribution; and a 10-action discrete interface with legal-action masking that encodes explicit trading primitives while enforcing margin-aware feasibility constraints. Empirical evaluation on EURUSD focuses on learning dynamics rather than generalization and reveals strongly non-monotonic reward interactions, where additional penalties do not reliably improve outcomes; the full reward configuration achieves the highest training Sharpe (0.765) and cumulative return (57.09 percent). The expanded action space increases return but also turnover and reduces Sharpe relative to a conservative 3-action baseline, indicating a return-activity trade-off under a fixed training budget, while scaling-enabled variants consistently reduce drawdown, with the combined configuration achieving the strongest endpoint performance.


分层学习(1篇)

【1】Hierarchical Apprenticeship Learning from Imperfect Demonstrations with Evolving Rewards
标题:从不完美的演示中学习分层学徒,奖励不断变化
链接:https://arxiv.org/abs/2604.00258

作者:Md Mirajul Islam,Rajesh Debnath,Adittya Soukarjya Saha,Min Chi
备注:AIED 2026
摘要:虽然学徒制学习已经显示出直接从电子学习环境中的学生互动中诱导有效的教学政策的希望,但大多数现有的方法依赖于固定奖励下的最佳或接近最佳的专家演示。然而,现实世界中的学生互动往往是不完美的和不断发展的:学生探索,犯错误,修改策略,并随着理解的发展而完善他们的目标。在这项工作中,我们认为,不完美的学生示威不被丢弃的噪音,但结构化的信号,提供他们的相对质量排名。我们介绍HALIDE,分层学徒学习不完美的演示与不断发展的奖励,它不仅利用次优的学生演示,但他们在一个分层的学习框架。HALIDE在多个抽象层次上对学生行为进行建模,从而能够从次优行为中推断出更高层次的意图和策略,同时明确捕获学生奖励函数的时间演变。通过将演示质量整合到分层奖励推理中,HALIDE将瞬时错误与次优策略和朝着更高层次学习目标的有意义进展区分开来。我们的研究结果表明,HALIDE更准确地预测学生的教学决策比方法,依赖于最佳轨迹,固定奖励,或排名不完善的示范。
摘要:While apprenticeship learning has shown promise for inducing effective pedagogical policies directly from student interactions in e-learning environments, most existing approaches rely on optimal or near-optimal expert demonstrations under a fixed reward. Real-world student interactions, however, are often inherently imperfect and evolving: students explore, make errors, revise strategies, and refine their goals as understanding develops. In this work, we argue that imperfect student demonstrations are not noise to be discarded, but structured signals-provided their relative quality is ranked. We introduce HALIDE, Hierarchical Apprenticeship Learning from Imperfect Demonstrations with Evolving Rewards, which not only leverages sub-optimal student demonstrations, but ranks them within a hierarchical learning framework. HALIDE models student behavior at multiple levels of abstraction, enabling inference of higher-level intent and strategy from suboptimal actions while explicitly capturing the temporal evolution of student reward functions. By integrating demonstration quality into hierarchical reward inference,HALIDE distinguishes transient errors from suboptimal strategies and meaningful progress toward higher-level learning goals. Our results show that HALIDE more accurately predicts student pedagogical decisions than approaches that rely on optimal trajectories, fixed rewards, or unranked imperfect demonstrations.


医学相关(1篇)

【1】Genetic algorithms for multi-omic feature selection: a comparative study in cancer survival analysis
标题:用于多组特征选择的遗传算法:癌症生存分析中的比较研究
链接:https://arxiv.org/abs/2604.00065

作者:Luca Cattelani,Vittorio Fortino
摘要:多组学数据集为改进癌症研究中的生物标志物发现提供了机会,但其高维性和有限的样本量使得识别紧凑有效的生物标志物组具有挑战性。大规模组学中的特征选择可以通过将机器学习与遗传算法相结合来有效地解决,遗传算法自然支持预测准确性和生物标志物集大小的多目标优化。然而,遗传算法仍然相对欠探索的多组学特征选择,其中大多数方法连接到一个单一的特征空间的所有层。为了解决这一限制,我们引入了Sweeping*,这是一种在单视图和多视图优化之间交替的多视图、多目标算法。它采用了嵌套的单视图多目标优化,并在这项研究中,我们使用的遗传算法NSGA 3-CHS。它首先识别每层内的信息生物标志物,然后联合评估跨层相互作用;这些多组学解决方案指导下一个单视图搜索。通过重复扫描,该算法逐步识别捕获交叉模态互补信号的紧凑生物标志物组。我们使用三个TCGA队列的生存预测对五种Sweeping* 策略进行了基准测试,包括分层和基于串联的变体。每种策略共同优化预测精度和集合大小,通过一致性指数和根精益测量。整体性能和估计误差通过交叉超体积和Pareto delta在5倍交叉验证下进行评估。我们的研究结果表明,当存在足够的生存信号时,Sweeping* 可以改善准确性-复杂性权衡,并且集成组学层可以增强仅临床模型之外的生存预测,尽管益处仍然依赖于队列。
摘要:Multi-omic datasets offer opportunities for improved biomarker discovery in cancer research, but their high dimensionality and limited sample sizes make identifying compact and effective biomarker panels challenging. Feature selection in large-scale omics can be efficiently addressed by combining machine learning with genetic algorithms, which naturally support multi-objective optimization of predictive accuracy and biomarker set size. However, genetic algorithms remain relatively underexplored for multi-omic feature selection, where most approaches concatenate all layers into a single feature space. To address this limitation, we introduce Sweeping*, a multi-view, multi-objective algorithm alternating between single- and multi-view optimization. It employs a nested single-view multi-objective optimizer, and for this study we use the genetic algorithm NSGA3-CHS. It first identifies informative biomarkers within each layer, then jointly evaluates cross-layer interactions; these multi-omic solutions guide the next single-view search. Through repeated sweeps, the algorithm progressively identifies compact biomarker panels capturing cross-modal complementary signals. We benchmark five Sweeping* strategies, including hierarchical and concatenation-based variants, using survival prediction on three TCGA cohorts. Each strategy jointly optimizes predictive accuracy and set size, measured via the concordance index and root-leanness. Overall performance and estimation error are assessed through cross hypervolume and Pareto delta under 5-fold cross-validation. Our results show that Sweeping* can improve the accuracy-complexity trade-off when sufficient survival signal is present and that integrating omic layers can enhance survival prediction beyond clinical-only models, although benefits remain cohort-dependent.


蒸馏|知识提取(2篇)

【1】Property-Level Flood Risk Assessment Using AI-Enabled Street-View Lowest Floor Elevation Extraction and ML Imputation Across Texas
标题:使用支持人工智能的街景最低楼层立面提取和德克萨斯州的ML插补进行财产级洪水风险评估
链接:https://arxiv.org/abs/2604.01153

作者:Xiangpeng Li,Yu-Hsuan Ho,Sam D Brody,Ali Mostafavi
摘要:本文认为,人工智能支持的街景图像分析,辅之以性能门控机器学习插补,为在区域范围内生成特定于建筑物的高程数据以进行洪水风险评估提供了一条可行的途径。我们在德克萨斯州的18个感兴趣区域(AOI)开发并应用了一个三阶段管道,(1)使用Elev-Vision框架从Google街景图像中提取LFE和街道等级与最低楼层之间的高度差(HDSL),(2)使用随机森林和梯度提升模型估算缺失的HDSL值,这些模型在16个地形,水文,地理和洪水暴露特征上进行训练,和(3)将生成的高程数据集与Fathom 100年一遇淹没面和USACE深度损坏函数集成,以估计特定于房产的内部洪水深度和预期损失。在12,241个住宅结构中,73.4%的地块可获得街景图像,49.0%(5,992个结构)的直接LFE/HDSL提取成功。对于交叉验证性能是可辩护的13个AOI保留插补,所选模型的R suqre值为0.159至0.974;由于性能不足,5个AOI被明确排除在预测之外。结果表明,基于街景的高程映射并不普遍适用于每一个属性,但它是足够的可扩展性,以大大提高区域洪水风险表征超越危险暴露的内部淹没和预期损失的结构级估计。该研究将LFE估计从试点规模的概念验证推进到区域性的端到端工作流程。实际上,它为缺乏全面的高程证书但需要地块级信息来支持减灾、规划和洪水风险管理的司法管辖区提供了一个可复制的框架。
摘要 :This paper argues that AI-enabled analysis of street-view imagery, complemented by performance-gated machine-learning imputation, provides a viable pathway for generating building-specific elevation data at regional scale for flood risk assessment. We develop and apply a three-stage pipeline across 18 areas of interest (AOIs) in Texas that (1) extracts LFE and the height difference between street grade and the lowest floor (HDSL) from Google Street View imagery using the Elev-Vision framework, (2) imputes missing HDSL values with Random Forest and Gradient Boosting models trained on 16 terrain, hydrologic, geographic, and flood-exposure features, and (3) integrates the resulting elevation dataset with Fathom 1-in-100 year inundation surfaces and USACE depth-damage functions to estimate property-specific interior flood depth and expected loss. Across 12,241 residential structures, street-view imagery was available for 73.4% of parcels and direct LFE/HDSL extraction was successful for 49.0% (5,992 structures). Imputation was retained for 13 AOIs where cross-validated performance was defensible, with selected models achieving R suqre values from 0.159 to 0.974; five AOIs were explicitly excluded from prediction because performance was insufficient. The results show that street-view-based elevation mapping is not universally available for every property, but it is sufficiently scalable to materially improve regional flood-risk characterization by moving beyond hazard exposure to structure-level estimates of interior inundation and expected damage. Scientifically, the study advances LFE estimation from a pilot-scale proof of concept to a regional, end-to-end workflow. Practically, it offers a replicable framework for jurisdictions that lack comprehensive Elevation Certificates but need parcel-level information to support mitigation, planning, and flood-risk management.


【2】SAGE: Subsurface AI-driven Geostatistical Extraction with proxy posterior
标题:SAGE:地下人工智能驱动的地质统计提取,具有代理后验
链接:https://arxiv.org/abs/2604.00307

作者:Huseyin Tuna Erdinc,Ipsita Bhar,Rafael Orozco,Thales Souza,Felix J. Herrmann
备注:7 pages, 4 figures
摘要:生成网络的最新进展使地下速度模型合成的新方法成为可能,为全波形反演等传统方法提供了一种引人注目的替代方法。然而,这些方法主要依赖于高质量的、地质上真实的地下速度模型的大规模数据集的可用性,这在实践中通常难以获得。我们介绍SAGE,一个新的框架,统计上一致的代理速度生成从不完整的观测,特别是稀疏的测井曲线和迁移的地震图像。在训练过程中,SAGE在以两种模式(井和地震)为条件的速度模型上学习代理后验;在推理时,它仅以偏移图像为条件产生全分辨率速度场,并在学习的分布中隐式编码井信息。这使得能够生成地质上合理的和统计上准确的速度实现。我们验证SAGE的合成和现场数据集,证明其能够捕捉复杂的地下变化有限的观测约束条件下。此外,可以利用从学习的代理分布中提取的样本来训练下游网络,从而支持反演工作流。总的来说,SAGE提供了一个可扩展的和数据有效的途径,学习地质代理后地震成像和反演。回购链接:https://github.com/slimgroup/SAGE。
摘要:Recent advances in generative networks have enabled new approaches to subsurface velocity model synthesis, offering a compelling alternative to traditional methods such as Full Waveform Inversion. However, these approaches predominantly rely on the availability of large-scale datasets of high-quality, geologically realistic subsurface velocity models, which are often difficult to obtain in practice. We introduce SAGE, a novel framework for statistically consistent proxy velocity generation from incomplete observations, specifically sparse well logs and migrated seismic images. During training, SAGE learns a proxy posterior over velocity models conditioned on both modalities (wells and seismic); at inference, it produces full-resolution velocity fields conditioned solely on migrated images, with well information implicitly encoded in the learned distribution. This enables the generation of geologically plausible and statistically accurate velocity realizations. We validate SAGE on both synthetic and field datasets, demonstrating its ability to capture complex subsurface variability under limited observational constraints. Furthermore, samples drawn from the learned proxy distribution can be leveraged to train downstream networks, supporting inversion workflows. Overall, SAGE provides a scalable and data-efficient pathway toward learning geological proxy posterior for seismic imaging and inversion. Repo link: https://github.com/slimgroup/SAGE.


超分辨率|去噪|去模糊|去雾(3篇)

【1】Differentially Private Manifold Denoising
标题:差异化私人多元化降噪
链接:https://arxiv.org/abs/2604.00942

作者:Jiaqi Wu,Yiqing Sun,Zhigang Yao
备注:59 pages
摘要:我们引入了一个差分私有流形去噪框架,允许用户利用敏感的参考数据集来纠正嘈杂的,非私有的查询点,而不影响隐私。该方法遵循一个迭代过程,(i)使用校准灵敏度下的参考数据私下估计局部均值和切线几何形状,(ii)在每次迭代时通过校正步骤将查询点沿着私下估计的子空间向局部均值投影,以及(iii)使用$(\vareps,δ)$-差分隐私(DP)在迭代和查询之间执行严格的隐私会计。从概念上讲,该框架为多种方法带来了差异隐私,为嵌入、集群和可视化等下游任务保留了足够的几何信号,同时为参考数据提供正式的DP保证。实际上,该过程是模块化和可扩展的,将DP保护的局部几何(均值和切线)与预算的查询点更新分离,使用简单的调度器在迭代和查询之间分配隐私预算。在标准的假设下,流形的规律性,采样密度和测量噪声,我们建立高概率的效用保证表明,正确的查询收敛到流形的样本大小,噪声水平,带宽和隐私预算的非渐近速率。仿真和案例研究表明,在适度的隐私预算下,准确的信号恢复,说明了明确的实用性和隐私的权衡,并提供了一个可部署的DP组件的流形为基础的工作流程在受监管的环境中,而无需重新设计隐私系统。
摘要:We introduce a differentially private manifold denoising framework that allows users to exploit sensitive reference datasets to correct noisy, non-private query points without compromising privacy. The method follows an iterative procedure that (i) privately estimates local means and tangent geometry using the reference data under calibrated sensitivity, (ii) projects query points along the privately estimated subspace toward the local mean via corrective steps at each iteration, and (iii) performs rigorous privacy accounting across iterations and queries using $(\varepsilon,δ)$-differential privacy (DP). Conceptually, this framework brings differential privacy to manifold methods, retaining sufficient geometric signal for downstream tasks such as embedding, clustering, and visualization, while providing formal DP guarantees for the reference data. Practically, the procedure is modular and scalable, separating DP-protected local geometry (means and tangents) from budgeted query-point updates, with a simple scheduler allocating privacy budget across iterations and queries. Under standard assumptions on manifold regularity, sampling density, and measurement noise, we establish high-probability utility guarantees showing that corrected queries converge toward the manifold at a non-asymptotic rate governed by sample size, noise level, bandwidth, and the privacy budget. Simulations and case studies demonstrate accurate signal recovery under moderate privacy budgets, illustrating clear utility-privacy trade-offs and providing a deployable DP component for manifold-based workflows in regulated environments without reengineering privacy systems.


【2】Super-Resolving Coarse-Resolution Weather Forecasts With Flow Matching
标题:具有流量匹配的超分辨率粗分辨率天气预报
链接:https://arxiv.org/abs/2604.00897

作者:Aymeric Delefosse,Anastase Charantonis,Dominique Béréziat
备注:Accepted to Climate Informatics 2026
摘要:基于机器学习的天气预报模型现在已经超过了最先进的数值天气预报系统,但是以高空间分辨率训练和操作这些模型仍然是计算上昂贵的。我们提出了一个模块化框架,通过将学习到的生成超分辨率应用为粗分辨率预测轨迹的后处理步骤,将预测与空间分辨率分离。我们制定超分辨率作为一个随机逆问题,使用残差制定保留大规模的结构,同时重建未解决的变化。该模型是专门在再分析数据上进行流量匹配训练的,并应用于全球中期预报。我们评估(i)通过重新粗化超分辨率预报并将其与原始粗轨迹进行比较来评估设计一致性,以及(ii)使用标准集合验证度量和光谱诊断来评估高分辨率预报质量。结果表明,超分辨率在重新粗化后保留了大尺度结构和方差,引入了物理上一致的小尺度变化,并在相对于业务集合基线的0.25°分辨率下实现了有竞争力的概率预报技能,而与端到端高分辨率预报相比,只需要适度的额外训练成本。
摘要 :Machine learning-based weather forecasting models now surpass state-of-the-art numerical weather prediction systems, but training and operating these models at high spatial resolution remains computationally expensive. We present a modular framework that decouples forecasting from spatial resolution by applying learned generative super-resolution as a post-processing step to coarse-resolution forecast trajectories. We formulate super-resolution as a stochastic inverse problem, using a residual formulation to preserve large-scale structure while reconstructing unresolved variability. The model is trained with flow matching exclusively on reanalysis data and is applied to global medium-range forecasts. We evaluate (i) design consistency by re-coarsening super-resolved forecasts and comparing them to the original coarse trajectories, and (ii) high-resolution forecast quality using standard ensemble verification metrics and spectral diagnostics. Results show that super-resolution preserves large-scale structure and variance after re-coarsening, introduces physically consistent small-scale variability, and achieves competitive probabilistic forecast skill at 0.25° resolution relative to an operational ensemble baseline, while requiring only a modest additional training cost compared with end-to-end high-resolution forecasting.


【3】Denoising distances beyond the volumetric barrier
标题:消除体积屏障之外的距离
链接:https://arxiv.org/abs/2604.00432

作者:Han Huang,Pakawut Jiradilok,Elchanan Mossel
摘要:研究了由随机几何图重构$d$维黎曼流形的潜几何问题。虽然最近的工作已经在从随机几何图,更一般地从噪声距离中恢复流形方面取得了重大进展,但成对距离估计的精度从根本上受到体积障碍的限制,即自然采样间隔尺度$n^{-1/d}$来自流形的一般点通常位于阶数$n^{-1/d}$的距离的事实从最近的采样点。本文介绍了一种新的方法,正交环距离估计程序(Orthogonal Ring Distance Estimation Routine,ORDER),它在多项式时间内实现了n阶(n-2/(d+5))的逐点距离估计精度,最高可达n倍的多对数因子。这严格地克服了尺寸$d > 5$的体积障碍。   作为获得优于$n^{-1/d}$的逐点精度的结果,我们证明了重构的度量测度空间与真隐流形之间的Gromov-Wasserstein距离是$n^{-1/d}$阶的.这与经验测量的Wasserstein收敛速度相匹配,表明我们重建的图度量与访问采样点的完整成对距离矩阵一样渐近。我们的结果证明在一个非常一般的设置,其中包括一般模型的噪声成对距离,稀疏随机几何图形,未知的连接概率函数。
摘要:We study the problem of reconstructing the latent geometry of a $d$-dimensional Riemannian manifold from a random geometric graph. While recent works have made significant progress in manifold recovery from random geometric graphs, and more generally from noisy distances, the precision of pairwise distance estimation has been fundamentally constrained by the volumetric barrier, namely the natural sample-spacing scale $n^{-1/d}$ coming from the fact that a generic point of the manifold typically lies at distance of order $n^{-1/d}$ from the nearest sampled point. In this paper, we introduce a novel approach, Orthogonal Ring Distance Estimation Routine (ORDER), which achieves a pointwise distance estimation precision of order $n^{-2/(d+5)}$ up to polylogarithmic factors in $n$ in polynomial time. This strictly beats the volumetric barrier for dimensions $d > 5$.   As a consequence of obtaining pointwise precision better than $n^{-1/d}$, we prove that the Gromov--Wasserstein distance between the reconstructed metric measure space and the true latent manifold is of order $n^{-1/d}$. This matches the Wasserstein convergence rate of empirical measures, demonstrating that our reconstructed graph metric is asymptotically as good as having access to the full pairwise distance matrix of the sampled points. Our results are proven in a very general setting which includes general models of noisy pairwise distances, sparse random geometric graphs, and unknown connection probability functions.


点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】Neural Collapse Dynamics: Depth, Activation, Regularisation, and Feature Norm Threshold
标题:神经崩溃动力学:深度、激活、正规化和特征规范阈值
链接:https://arxiv.org/abs/2604.00230

作者:Anamika Paul Rupa
摘要:神经崩溃(NC)--倒数第二层特征收敛到一个单一的等角紧框架--在平衡状态下得到了很好的理解,但是控制其发生的动力学特征仍然很差。我们确定了一个简单的预测规律:当平均特征范数达到特定于模型数据集的临界值fn* 时,NC发生,该临界值在很大程度上对训练条件不变。这个值集中在每个(模型,数据集)对中(CV < 8%);训练动态主要影响fn接近fn* 的速度,而不是值本身。在标准训练轨迹中,低于fn* 的fn的交叉始终先于NC开始,提供了具有62个时期(MAE 24个时期)的平均提前时间的实用预测器。直接干预实验证实了fn* 是梯度流的稳定吸引子--对特征尺度的扰动在训练过程中会自我校正,无论方向如何都会收敛到相同的值(p>0.2)。完成(架构)x(数据集)网格显示了该论文最强的结果:MNIST上的ResNet-20给出了fn* = 5.867 -+458%的架构效应,而CIFAR-10上只有+68%。网格是强非加性的; fn* 不能分解为独立的架构和数据集贡献。出现了四个结构性的问题:(1)深度对崩溃速度有非单调的影响;(2)激活共同决定崩溃速度和fn*;(3)重量衰减定义了一个三态相图--太少会减慢,最佳范围最快,太多会阻止崩溃;(4)宽度单调加速崩溃,同时使fn* 最多移动13%。这些结果建立了特征-范数动态作为预测NC时序的可操作诊断,表明范数阈值行为是深度网络中延迟表征重组的一般机制。
摘要:Neural collapse (NC) -- the convergence of penultimate-layer features to a simplex equiangular tight frame -- is well understood at equilibrium, but the dynamics governing its onset remain poorly characterised. We identify a simple and predictive regularity: NC occurs when the mean feature norm reaches a model-dataset-specific critical value, fn*, that is largely invariant to training conditions. This value concentrates tightly within each (model, dataset) pair (CV < 8%); training dynamics primarily affect the rate at which fn approaches fn*, rather than the value itself. In standard training trajectories, the crossing of fn below fn* consistently precedes NC onset, providing a practical predictor with a mean lead time of 62 epochs (MAE 24 epochs). A direct intervention experiment confirms fn* is a stable attractor of the gradient flow -- perturbations to feature scale are self-corrected during training, with convergence to the same value regardless of direction (p>0.2). Completing the (architecture)x(dataset) grid reveals the paper's strongest result: ResNet-20 on MNIST gives fn* = 5.867 -- a +458% architecture effect versus only +68% on CIFAR-10. The grid is strongly non-additive; fn* cannot be decomposed into independent architecture and dataset contributions. Four structural regularities emerge: (1) depth has a non-monotonic effect on collapse speed; (2) activation jointly determines both collapse speed and fn*; (3) weight decay defines a three-regime phase diagram -- too little slows, an optimal range is fastest, and too much prevents collapse; (4) width monotonically accelerates collapse while shifting fn* by at most 13%. These results establish feature-norm dynamics as an actionable diagnostic for predicting NC timing, suggesting that norm-threshold behaviour is a general mechanism underlying delayed representational reorganisation in deep networks.


推理|分析|理解|解释(9篇)

【1】LAtent Phase Inference from Short time sequences using SHallow REcurrent Decoders (LAPIS-SHRED)
标题:使用浅电流解码器(LAPIS-SHRED)从短时间序列推断虚相(LAPIS-SHRED)
链接:https://arxiv.org/abs/2604.01216

作者:Yuxuan Bao,Xingyue Zhang,J. Nathan Kutz
摘要:从空间和时间上的稀疏观测重建完整的时空动态仍然是复杂系统中的一个核心挑战,因为测量可能在空间上不完整,也可能局限于狭窄的时间窗口。然而,近似完整的时空轨迹是必不可少的机械洞察力和理解,模型校准和操作决策。我们介绍LAPIS-SHRED(潜在相位推断短时间序列使用浅REcurrent解码器),一个模块化的架构,重建和/或预测完整的时空动态从稀疏传感器观测局限于短时间窗口。LAPIS-SHRED通过三个阶段的管道运行:(i)SHRED模型完全在仿真数据上进行预训练,以将传感器时间历史映射到结构化的潜在空间中,(ii)时间序列模型,在仿真导出的潜在轨迹上进行训练,学习在时间上向前或向后传播潜在状态,以从短的观测时间窗口跨越未观测的时间区域,以及(iii)在部署时,仅提供来自真实系统的超稀疏传感器测量的短观测窗口,冻结的SHRED模型和时间模型根据该短观测窗口联合重建或预测完整的时空轨迹。该框架支持双向推理,从其模块化结构中继承了数据同化和多尺度重建能力,并适应极端的观测约束,包括单帧终端输入。我们评估LAPIS-SHRED六个实验跨越复杂的时空物理:湍流,多尺度推进物理,挥发性燃烧瞬变,和卫星衍生的环境领域,突出了一个轻量级的,模块化的架构,适合于操作环境中的观察受到物理或后勤限制。
摘要 :Reconstructing full spatio-temporal dynamics from sparse observations in both space and time remains a central challenge in complex systems, as measurements can be spatially incomplete and can be also limited to narrow temporal windows. Yet approximating the complete spatio-temporal trajectory is essential for mechanistic insight and understanding, model calibration, and operational decision-making. We introduce LAPIS-SHRED (LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders), a modular architecture that reconstructs and/or forecasts complete spatiotemporal dynamics from sparse sensor observations confined to short temporal windows. LAPIS-SHRED operates through a three-stage pipeline: (i) a SHRED model is pre-trained entirely on simulation data to map sensor time-histories into a structured latent space, (ii) a temporal sequence model, trained on simulation-derived latent trajectories, learns to propagate latent states forward or backward in time to span unobserved temporal regions from short observational time windows, and (iii) at deployment, only a short observation window of hyper-sparse sensor measurements from the true system is provided, from which the frozen SHRED model and the temporal model jointly reconstruct or forecast the complete spatiotemporal trajectory. The framework supports bidirectional inference, inherits data assimilation and multiscale reconstruction capabilities from its modular structure, and accommodates extreme observational constraints including single-frame terminal inputs. We evaluate LAPIS-SHRED on six experiments spanning complex spatio-temporal physics: turbulent flows, multiscale propulsion physics, volatile combustion transients, and satellite-derived environmental fields, highlighting a lightweight, modular architecture suited for operational settings where observation is constrained by physical or logistical limitations.


【2】Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling
标题:迈向个性化飞镖训练:基于普林斯顿生物力学分析和运动建模的数据驱动框架
链接:https://arxiv.org/abs/2604.01130

作者:Zhantao Chen,Dongyi He,Jin Fang,Xi Chen,Yisuo Liu,Xiaozhen Zhong,Xuejun Hu
摘要:随着运动训练变得越来越数据化,传统的飞镖训练主要基于经验和视觉观察,越来越不适合高精度、目标导向的运动。虽然以前的研究已经强调了释放参数,联合运动,协调在飞镖投掷的重要性,大多数定量方法仍然集中在局部变量,单次释放指标,或静态模板匹配。这些方法为个性化训练提供了有限的支持,并且经常忽略有用的运动变化。本文提出了一个数据驱动的飞镖训练辅助系统。该系统创建了一个涵盖运动捕捉、特征建模和个性化反馈的闭环框架。使用Kinect 2.0深度传感器和光学相机在无标记条件下收集飞镖投掷数据。从四个生物力学维度提取了18个运动学特征:三连杆协调,释放速度,多关节角配置和姿势稳定性。开发了两个模块:结合历史高质量样本和最小加加速度标准的个性化最优投掷轨迹模型,以及基于z分数和分层逻辑的运动偏差诊断和推荐模型。共收集了2,396份职业和非职业运动员的投掷样本。结果表明,该系统生成的平滑个性化的参考轨迹与自然的人体运动一致。案例研究表明,它可以检测出躯干稳定性差、肘关节位移异常和速度控制不平衡,并提供有针对性的建议。该框架将飞镖评估从偏离统一标准转变为偏离个人最佳控制范围,提高了飞镖训练和其他高精度目标运动的个性化和可解释性。
摘要:As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual's optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.


【3】Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding
标题:基于MLLM的长格式视频理解的查询条件证据关键帧采样
链接:https://arxiv.org/abs/2604.01002

作者:Yiheng Wang,Lichen Zhu,Yueqian Lin,Yudong Liu,Jingyang Zhang,Hai "Helen" Li,Yiran Chen
摘要:多模态大语言模型(MLLM)在视频问答方面表现出了强大的性能,但它们在长格式视频中的应用受到有限的上下文长度和计算成本的限制,使得关键帧采样至关重要。现有的方法通常依赖于语义相关性或强化学习,它们要么无法捕获证据线索,要么遭受低效的组合优化。在这项工作中,我们提出了一个基于信息瓶颈理论的证据驱动的关键帧采样框架。我们制定关键帧选择最大限度地提高所选帧和查询之间的条件互信息,提供一个原则性的目标,反映每个帧的贡献,回答问题。为了使这一目标易于处理,我们利用其结构,以获得一个分解的优化,减少子集选择独立的帧级评分。我们进一步引入了一个查询条件的证据评分网络训练与对比的目标,以有效地估计证据的重要性。长格式视频理解基准测试的实验表明,我们的方法在严格的令牌预算下始终优于先前的采样策略,同时显着提高训练效率。
摘要:Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.


【4】Generalization Bounds for Spectral GNNs via Fourier Domain Analysis
标题:通过傅里叶域分析确定谱GNN的推广界
链接:https://arxiv.org/abs/2604.00918

作者:Vahan A. Martirosyan,Daniele Malitesta,Hugues Talbot,Jhony H. Giraldo,Fragkiskos D. Malliaros
备注:Accepted to AISTATS 2026
摘要:谱图神经网络学习图过滤器,但它们随着深度和多项式阶数的增加而表现出来的行为还没有得到很好的理解。我们在图形傅立叶域中分析这些模型,其中每一层都变成了元素频率更新,将固定频谱与可训练参数分离,并使深度和阶数显式化。在这种情况下,我们证明了高斯复杂度在图形傅立叶变换下是不变的,这使我们能够导出数据依赖的,深度和顺序感知的泛化边界以及稳定性估计。在线性情况下,我们的边界更紧,在实数图上,数据依赖项与多项式基上的泛化间隙相关,突出了避免跨层频率放大的实际选择。
摘要 :Spectral graph neural networks learn graph filters, but their behavior with increasing depth and polynomial order is not well understood. We analyze these models in the graph Fourier domain, where each layer becomes an element-wise frequency update, separating the fixed spectrum from trainable parameters and making depth and order explicit. In this setting, we show that Gaussian complexity is invariant under the Graph Fourier Transform, which allows us to derive data-dependent, depth, and order-aware generalization bounds together with stability estimates. In the linear case, our bounds are tighter, and on real graphs, the data-dependent term correlates with the generalization gap across polynomial bases, highlighting practical choices that avoid frequency amplification across layers.


【5】ActivityNarrated: An Open-Ended Narrative Paradigm for Wearable Human Activity Understanding
标题:ActivityNarrated:可穿戴人类活动理解的开放式叙述范式
链接:https://arxiv.org/abs/2604.00767

作者:Lala Shakti Swarup Ray,Mengxi Liu,Alcina Pinto,Deepika Gurung,Daniel Geissler,Paul Lukowoicz,Bo Zhou
摘要:可穿戴HAR已经稳步改进,但大多数进展仍然依赖于闭集分类,这限制了现实世界的使用。在实践中,人类活动是开放式的,无脚本的,个性化的,而且往往是组合的,作为叙事而不是固定类别的实例展开。我们认为,解决这一差距并不需要简单地扩展数据集或模型。它需要从根本上改变可穿戴HAR的制定、监督和评估方式。这项工作展示了如何通过在开放式词汇设置中将可穿戴传感器数据与自然语言描述对齐来对开放式活动叙述进行建模。我们的框架有三个核心组成部分。首先,我们引入了一个自然主义的数据收集和注释管道,它将多位置可穿戴传感与对正在进行的行为的自由形式、时间对齐的叙述描述相结合,从而在没有预定义词汇的情况下实现活动语义。其次,我们定义了一个基于检索的评估框架,该框架可以测量传感器数据和语言之间的语义对齐,从而在没有固定类的情况下进行原则性评估,同时还将闭集分类作为特例。第三,我们提出了一种语言条件的学习架构,支持传感器到文本的推理在可变长度的传感器流和异构传感器的位置。实验表明,用固定标签目标训练的模型在现实世界的变化性下会急剧下降,而开放词汇的传感器语言对齐会产生鲁棒的语义基础表示。一旦学会了这种对齐,闭集活动识别就变成了一个简单的下游任务。在交叉参与者评估下,我们的方法实现了65.3%的宏F1,而强闭集HAR基线为31-34%。这些结果建立了开放式叙事建模,作为现实世界可穿戴HAR的实用且有效的基础。
摘要:Wearable HAR has improved steadily, but most progress still relies on closed-set classification, which limits real-world use. In practice, human activity is open-ended, unscripted, personalized, and often compositional, unfolding as narratives rather than instances of fixed classes. We argue that addressing this gap does not require simply scaling datasets or models. It requires a fundamental shift in how wearable HAR is formulated, supervised, and evaluated. This work shows how to model open-ended activity narratives by aligning wearable sensor data with natural-language descriptions in an open-vocabulary setting. Our framework has three core components. First, we introduce a naturalistic data collection and annotation pipeline that combines multi-position wearable sensing with free-form, time-aligned narrative descriptions of ongoing behavior, allowing activity semantics to emerge without a predefined vocabulary. Second, we define a retrieval-based evaluation framework that measures semantic alignment between sensor data and language, enabling principled evaluation without fixed classes while also subsuming closed-set classification as a special case. Third, we present a language-conditioned learning architecture that supports sensor-to-text inference over variable-length sensor streams and heterogeneous sensor placements. Experiments show that models trained with fixed-label objectives degrade sharply under real-world variability, while open-vocabulary sensor-language alignment yields robust and semantically grounded representations. Once this alignment is learned, closed-set activity recognition becomes a simple downstream task. Under cross-participant evaluation, our method achieves 65.3% Macro-F1, compared with 31-34% for strong closed-set HAR baselines. These results establish open-ended narrative modeling as a practical and effective foundation for real-world wearable HAR.


【6】MF-QAT: Multi-Format Quantization-Aware Training for Elastic Inference
标题:MF-QAT:弹性推理的多格式量化感知训练
链接:https://arxiv.org/abs/2604.00529

作者:Zifei Xu,Sayeh Sharify,Hesham Mostafa
摘要:量化感知训练(QAT)通常针对单个目标数字格式执行,而实际部署通常需要基于硬件支持或运行时约束在推理时选择数字精度。我们研究了多格式QAT,其中单个模型被训练为跨多种量化格式的鲁棒性。我们发现,多格式QAT可以在每个目标精度上与单格式QAT相匹配,从而产生一个在不同格式之间整体表现良好的模型,即使是在训练过程中没有看到的格式。为了实现实际部署,我们提出了MXINT和MXFP的切片和缩放转换过程,可以将高精度表示转换为低精度格式,而无需重新训练。在此基础上,我们引入了一个管道,该管道(i)使用多格式QAT训练模型,(ii)存储单个锚格式检查点(MXINT 8/MXFP 8),以及(iii)允许在运行时动态转换为更低的MXINT或MXFP格式,而精度下降可以忽略不计或没有额外的精度下降。这些组件共同提供了一条实现弹性精度扩展的实用路径,并允许在不同部署目标的推理时选择运行时格式。
摘要:Quantization-aware training (QAT) is typically performed for a single target numeric format, while practical deployments often need to choose numerical precision at inference time based on hardware support or runtime constraints. We study multi-format QAT, where a single model is trained to be robust across multiple quantization formats. We find that multi-format QAT can match single-format QAT at each target precision, yielding one model that performs well overall across different formats, even formats that were not seen during training. To enable practical deployment, we propose the Slice-and-Scale conversion procedure for both MXINT and MXFP that converts a high-precision representation into lower-precision formats without re-training. Building on this, we introduce a pipeline that (i) trains a model with multi-format QAT, (ii) stores a single anchor format checkpoint (MXINT8/MXFP8), and (iii) allows on-the-fly conversion to lower MXINT or MXFP formats at runtime with negligible-or no-additional accuracy degradation. Together, these components provide a practical path to elastic precision scaling and allow selecting the runtime format at inference time across diverse deployment targets.


【7】MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
标题:MOON3.0:面向电子商务产品理解的推理感知多模态表示学习
链接:https://arxiv.org/abs/2604.00513

作者:Junxian Wu,Chenghan Fu,Zhanheng Nie,Daoze Zhang,Bowen Wan,Wanxian Guan,Chuan Yu,Jian Xu,Bo Zheng
备注:10 pages, 6 figures
摘要:随着电子商务的快速发展,探索一般的表示,而不是特定的任务越来越受到关注。虽然最近的多模态大语言模型(MLLM)在产品理解方面取得了重大进展,但它们通常被用作特征提取器,将产品信息隐式编码到全局嵌入中,从而限制了它们捕获细粒度属性的能力。因此,我们认为,利用MLLM的推理能力,显式建模细粒度的产品属性具有显着的潜力。然而,由于几个关键挑战,实现这一目标仍然是不平凡的:(i)长上下文推理往往会稀释模型对原始输入中显著信息的注意力;(ii)监督微调(SFT)主要鼓励严格模仿,限制了对有效推理策略的探索;以及(iii)细粒度细节在前向传播过程中逐渐衰减。为了解决这些问题,我们提出了MOON3.0,这是第一个基于推理的MLLM产品表示学习模型。我们的方法(1)采用多头模态融合模块自适应地整合原始信号;(2)结合联合对比和强化学习框架,自主探索更有效的推理策略;(3)引入细粒度残差增强模块,逐步保留整个网络的局部细节。此外,我们还发布了大型多式联运电子商务基准MBE3.0。实验上,我们的模型在我们的基准和公共数据集上的各种下游任务中展示了最先进的zero-shot性能。
摘要 :With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.


【8】Data-Driven Reachability Analysis via Diffusion Models with PAC Guarantees
标题:通过具有PAC保证的扩散模型进行数据驱动的可达性分析
链接:https://arxiv.org/abs/2604.00283

作者:Yanliang Huang,Peng Xie,Wenyuan Wu,Zhuoqi Zeng,Amr Alanwar
备注:8 pages, 5 figures, submitted to the 65th IEEE Conference on Decision and Control (CDC 2026)
摘要:我们提出了一个数据驱动的框架,不需要显式模型的非线性动力系统的可达性分析。去噪扩散概率模型仅从轨迹数据学习动力系统的时间演化状态分布。预测的可达集合采取从重构误差导出的不一致性分数的子水平集合的形式,其中阈值经由学习然后测试过程校准,使得排除可达状态的概率以高概率为界。三个非线性系统,强迫Duffing振子,平面四旋翼,和一个高维反应扩散系统的实验,证实了经验错过率保持低于可能近似正确(PAC)的界限,同时缩放到状态维度超出了经典的基于网格和多项式的方法。
摘要:We present a data-driven framework for reachability analysis of nonlinear dynamical systems that requires no explicit model. A denoising diffusion probabilistic model learns the time-evolving state distribution of a dynamical system from trajectory data alone. The predicted reachable set takes the form of a sublevel set of a nonconformity score derived from the reconstruction error, with the threshold calibrated via the Learn Then Test procedure so that the probability of excluding a reachable state is bounded with high probability. Experiments on three nonlinear systems, a forced Duffing oscillator, a planar quadrotor, and a high-dimensional reaction-diffusion system, confirm that the empirical miss rate remains below the Probably Approximately Correct (PAC) bound while scaling to state dimensions beyond the reach of classical grid-based and polynomial methods.


【9】Finite-Time Analysis of Projected Two-Time-Scale Stochastic Approximation
标题:投影两时间尺度随机逼近的实时分析
链接:https://arxiv.org/abs/2604.00179

作者:Yitao Bai,Thinh T. Doan,Justin Romberg
备注:6 pages, 3 figures
摘要:本文研究了具有常数步长和Polyak-Ruppert平均的投影线性双时间尺度随机逼近的有限时间收敛性。我们建立了一个明确的均方误差界,分解成两个可解释的组件,由约束子空间和统计误差衰减的次线性速率确定的近似误差,常数表示通过限制稳定裕度和耦合可逆性条件。这些常数清晰地将子空间选择的影响(近似误差)与平均水平的影响(统计误差)分开。我们说明了我们的理论结果,通过一些数值实验合成和强化学习问题。
摘要:We study the finite-time convergence of projected linear two-time-scale stochastic approximation with constant step sizes and Polyak--Ruppert averaging. We establish an explicit mean-square error bound, decomposing it into two interpretable components, an approximation error determined by the constrained subspace and a statistical error decaying at a sublinear rate, with constants expressed through restricted stability margins and a coupling invertibility condition. These constants cleanly separate the effect of subspace choice (approximation errors) from the effect of the averaging horizon (statistical errors). We illustrate our theoretical results through a number of numerical experiments on both synthetic and reinforcement learning problems.


检测相关(4篇)

【1】Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
标题:通过多代理可解释性检测多代理共谋
链接:https://arxiv.org/abs/2604.01151

作者:Aaron Rose,Carissa Cullen,Brandon Gary Kaplowitz,Christian Schroeder de Witt
摘要:随着LLM代理越来越多地部署在多代理系统中,它们引入了可能逃避标准形式的人类监督的隐蔽协调风险。虽然模型激活的线性探测器已经显示出在单代理设置中检测欺骗的前景,但共谋本质上是一种多代理现象,并且使用内部表示来检测代理之间的共谋仍然未被探索。我们引入NARCBench,环境分布变化下的共谋检测评估的基准,并提出了五个探测技术,聚合每个代理欺骗分数在组级别的情况下进行分类。我们的探针实现1.00 AUROC分布和0.60- 0.86 AUROC时,转移到结构不同的多智能体场景和隐写二十一点算牌任务的zero-shot。我们发现,没有一个单一的探测技术占主导地位,在所有的共谋类型,这表明不同形式的共谋表现不同的激活空间。我们还发现了初步证据,表明这种信号定位在令牌级别,当处理其合作伙伴的信息的编码部分时,共谋代理的激活会特别激增。这项工作向多代理可解释性迈出了一步:将白盒检测从单个模型扩展到多代理环境,其中检测需要跨代理聚合信号。这些结果表明,模型内部提供了一个补充信号,文本级监测检测多代理勾结,特别是对组织访问模型激活。代码和数据可在https://github.com/aaronrose227/narcbench上获得。
摘要:As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.


【2】KUET at StanceNakba Shared Task: StanceMoE: Mixture-of-Experts Architecture for Stance Detection
标题:StanceNakba的KUET共享任务:StanceMoE:Stance检测的专家混合架构
链接:https://arxiv.org/abs/2604.00878

作者:Abdullah Al Shafi,Md. Milon Islam,Sk. Imran Hossain,K. M. Azharul Hasan
备注:Accepted for workshop proceedings of the 15th International Conference on Language Resources and Evaluation (LREC'26)
摘要 :行为者层面的立场检测旨在确定作者对文本中提到或暗示的特定地缘政治行为者所表达的立场。虽然基于transformer的模型在立场分类方面取得了相对较好的性能,但它们通常依赖于统一的表示,可能无法充分捕获异质语言信号,例如对比话语结构,框架线索和突出的词汇指标。这激发了对自适应架构的需求,这些架构明确地对不同的stance表达模式进行建模。在本文中,我们提出了Stance MoE,一个上下文增强的混合专家(MoE)架构上的微调BERT编码器演员级的立场检测。我们的模型集成了六个专家模块,旨在捕捉互补的语言信号,包括全球语义取向,突出的词汇线索,小句层面的重点,短语层面的模式,框架指标,对比度驱动的话语转换。一个上下文感知的门控机制动态加权专家的贡献,使自适应路由的输入特性的基础上。实验在StanceNakba 2026 Subtask A数据集上进行,该数据集包括1,401个注释的英语文本,其中目标演员隐含在文本中。StanceMoE的宏F1得分为94.26%,优于传统基线和基于BERT的替代变体。
摘要:Actor-level stance detection aims to determine an author expressed position toward specific geopolitical actors mentioned or implicated in a text. Although transformer-based models have achieved relatively good performance in stance classification, they typically rely on unified representations that may not sufficiently capture heterogeneous linguistic signals, such as contrastive discourse structures, framing cues, and salient lexical indicators. This motivates the need for adaptive architectures that explicitly model diverse stance-expressive patterns. In this paper, we propose StanceMoE, a context-enhanced Mixture-of-Experts (MoE) architecture built upon a fine-tuned BERT encoder for actor-level stance detection. Our model integrates six expert modules designed to capture complementary linguistic signals, including global semantic orientation, salient lexical cues, clause-level focus, phrase-level patterns, framing indicators, and contrast-driven discourse shifts. A context-aware gating mechanism dynamically weights expert contributions, enabling adaptive routing based on input characteristics. Experiments are conducted on the StanceNakba 2026 Subtask A dataset, comprising 1,401 annotated English texts where the target actor is implicit in the text. StanceMoE achieves a macro-F1 score of 94.26%, outperforming traditional baselines, and alternative BERT-based variants.


【3】Risk-Aware Batch Testing for Performance Regression Detection
标题:性能回归检测的风险意识批测试
链接:https://arxiv.org/abs/2604.00222

作者:Ali Sayedsalehi,Peter C. Rigby,Gregory Mierzwinski
备注:14 pages, 1 figure, 4 tables. Replication package and dataset available
摘要:性能回归测试在大规模的持续集成(CI)系统中是必不可少的,但是为每个提交执行完整的性能套件是非常昂贵的。先前在性能回归预测和批量测试方面的工作已经显示出了独立的优势,但每一个都面临着实际的局限性:预测模型很少集成到CI决策中,并且传统的预测模型策略忽略了提交级别的异质性。   我们通过引入一个风险感知框架来统一这些链,该框架将机器学习的提交风险与自适应提交风险相集成。使用Mozilla Firefox作为案例研究,我们构建了一个生产衍生的数据集,该数据集由人类确认的回归按时间顺序与Autoland对齐,并微调ModernBERT,CodeBERT和LLaMA-3.1变体以估计提交级性能回归风险,使用CodeBERT实现高达0.694的ROC-AUC。风险评分驱动了一系列风险感知批处理策略,包括风险老化优先级批处理和风险自适应流批处理,并通过真实的CI模拟进行评估。   在数以千计的Firefox历史提交中,我们最好的整体配置,具有线性聚合的风险老化优先级批处理(RAPB-la),比Mozilla的生产启发基线产生了帕累托改进。RAPB-la减少了32.4%的总测试执行,减少了3.8%的平均反馈时间,将平均时间保持在大约基线水平,将最大时间减少了26.2%,并且对应于我们的成本模型下估计的每年基础设施成本节省约49.1万美元。这些结果表明,风险感知批量测试可以减少CI资源消耗,同时提高诊断的及时性。为了支持可重复性和未来的研究,我们发布了一个完整的复制包,其中包含所有数据集,微调管道和我们的并行算法的实现。
摘要:Performance regression testing is essential in large-scale continuous-integration (CI) systems, yet executing full performance suites for every commit is prohibitively expensive. Prior work on performance regression prediction and batch testing has shown independent benefits, but each faces practical limitations: predictive models are rarely integrated into CI decision-making, and conventional batching strategies ignore commit-level heterogeneity.   We unify these strands by introducing a risk-aware framework that integrates machine-learned commit risk with adaptive batching. Using Mozilla Firefox as a case study, we construct a production-derived dataset of human-confirmed regressions aligned chronologically with Autoland, and fine-tune ModernBERT, CodeBERT, and LLaMA-3.1 variants to estimate commit-level performance regression risk, achieving up to 0.694 ROC-AUC with CodeBERT. The risk scores drive a family of risk-aware batching strategies, including Risk-Aged Priority Batching and Risk-Adaptive Stream Batching, evaluated through realistic CI simulations.   Across thousands of historical Firefox commits, our best overall configuration, Risk-Aged Priority Batching with linear aggregation (RAPB-la), yields a Pareto improvement over Mozilla's production-inspired baseline. RAPB-la reduces total test executions by 32.4%, decreases mean feedback time by 3.8%, maintains mean time-to-culprit at approximately the baseline level, reduces maximum time-to-culprit by 26.2%, and corresponds to an estimated annual infrastructure cost savings of approximately $491K under our cost model. These results demonstrate that risk-aware batch testing can reduce CI resource consumption while improving diagnostic timeliness. To support reproducibility and future research, we release a complete replication package containing all datasets, fine-tuning pipelines, and implementations of our batching algorithms.


【4】Sit-to-Stand Transitions Detection and Duration Measurement Using Smart Lacelock Sensor
标题:使用智能Lacelock传感器进行坐到站转变检测和持续时间测量
链接:https://arxiv.org/abs/2604.00175

作者:Md Rafi Islam,Md Rejwanul Haque,Elizabeth Choma,Shannon Hayes,Siobhan McMahon,Xiangrong Shen,Edward Sazonov
备注:10 pages, 11 figures
摘要:运动过程中的姿势稳定性是独立生活、预防跌倒和整体健康的基础,特别是在经历与年龄相关的平衡、肌肉力量和活动能力下降的老年人中。在日常功能活动中,坐立转换(SiSt)是下肢力量,肌肉骨骼健康和跌倒风险的关键指标,使其成为评估功能能力和监测老龄化人群身体衰退的重要参数。本研究提出了一种方法SiSt过渡检测和持续时间测量使用智能Lacelock传感器,一个轻量级的,鞋安装的设备,集成了一个测压元件,加速度计和陀螺仪的运动分析。在16名老年人(年龄:平均值:76.84,SD:3.45岁)进行短期体能测试(SPPB)协议内的SiSt任务的方法进行了评价。从多模态信号中提取的特征用于训练和评估四个机器学习分类器,使用4重参与者独立交叉验证来分类SiSt转换并测量其持续时间。袋装树分类器实现了0.98的准确率和0.8的F1得分分类SiSt过渡。正确分类的转换的持续时间测量的平均绝对误差为0.047,SD为0.07秒。这些发现突出了智能Lacelock传感器在老年人现实世界跌倒风险评估和移动监测方面的潜力。
摘要:Postural stability during movement is fundamental to independent living, fall prevention, and overall health, particularly among older adults who experience age-related declines in balance, muscle strength, and mobility. Among daily functional activities, the Sit-to-Stand (SiSt) transition is a critical indicator of lower-limb strength, musculoskeletal health, and fall risk, making it an essential parameter for assessing functional capacity and monitoring physical decline in aging populations. This study presents a methodology SiSt transition detection and duration measurement using the Smart Lacelock sensor, a lightweight, shoe-mounted device that integrates a load cell, accelerometer, and gyroscope for motion analysis. The methodology was evaluated in 16 older adults (age: mean: 76.84, SD: 3.45 years) performing SiSt tasks within the Short Physical Performance Battery (SPPB) protocol. Features extracted from multimodal signals were used to train and evaluate four machine learning classifiers using a 4-fold participant-independent cross-validation to classify SiSt transitions and measure their duration. The bagged tree classifier achieved an accuracy of 0.98 and an F1 score of 0.8 in classifying SiSt transition. The mean absolute error in duration measurement of the correctly classified transitions was 0.047, and the SD was 0.07 seconds. These findings highlight the potential of the Smart Lacelock sensor for real-world fall-risk assessment and mobility monitoring in older adults.


分类|识别(6篇)

【1】Using predefined vector systems to speed up neural network multimillion class classification
标题:使用预定义的载体系统加速神经网络数百万类分类
链接:https://arxiv.org/abs/2604.00779

作者:Nikita Gabdullin,Ilya Androsov
备注:12 pages, 2 figures, 3 tables, 2 algorithms, 1 theorem, 1 lemma
摘要:神经网络中的标签预测的复杂度为O(n),与类别数成正比。这适用于使用完全连接的层和余弦相似性与一些类原型集的分类。在本文中,我们表明,如果NN潜在空间(LS)的几何形状是已知的,并具有特定的属性,标签预测的复杂性可以显着降低。这是通过将标签预测与用作潜在空间配置(LSC)目标的向量系统中的O(1)复杂度最近聚类中心搜索相关联来实现的。该方法只需要在嵌入向量中找到几个最大和最小值的索引,使其计算效率非常高。我们表明,该方法不会改变神经网络的训练精度计算结果。我们还测量了多个数据集上NN推理和标签预测的不同计算阶段所需的时间。实验表明,该方法允许实现高达11.6倍的总加速度比传统的方法。此外,所提出的方法具有独特的属性,允许预测新类的存在。
摘要:Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set of class prototypes. In this paper we show that if NN latent space (LS) geometry is known and possesses specific properties, label prediction complexity can be significantly reduced. This is achieved by associating label prediction with the O(1) complexity closest cluster center search in a vector system used as target for latent space configuration (LSC). The proposed method only requires finding indexes of several largest and lowest values in the embedding vector making it extremely computationally efficient. We show that the proposed method does not change NN training accuracy computational results. We also measure the time required by different computational stages of NN inference and label prediction on multiple datasets. The experiments show that the proposed method allows to achieve up to 11.6 times overall acceleration over conventional methods. Furthermore, the proposed method has unique properties which allow to predict the existence of new classes.


【2】A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch
标题:CEFR启发的具有模糊C均值的分类框架,用于自动评估Scratch中的编程技能
链接:https://arxiv.org/abs/2604.00730

作者:Ricardo Hidalgo-Aragón,Jesús M. González-Barahona,Gregorio Robles
备注:Paper accepted at CSEDU 2026
摘要:内容:学校、培训平台和技术公司越来越需要用透明、可重复的方法来评估编程能力,以支持个性化的学习途径。目的:本研究介绍了一个教学框架的Scratch项目评估,符合欧洲共同参考框架(CEFR),为学生和教师提供普遍的能力水平,以及可操作的见解课程设计。方法:我们应用模糊C均值聚类到2008246 Scratch项目通过博士评估。Scratch,实现一个有序的标准来映射集群CEFR水平(A1-C2),并引入增强的分类指标,识别过渡学习者,使持续的进度跟踪,并量化分类的确定性,以平衡自动反馈与教师审查。影响:该框架能够诊断系统的课程差距,特别是“B2瓶颈”,只有13.3%的学习者居住,由于集成逻辑同步和数据表示的认知负荷-同时提供确定性-基于触发人为干预。
摘要:Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.


【3】Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics
标题:物理人工智能的混合能量模型:波特-汉密尔顿动力学的可证明稳定识别
链接:https://arxiv.org/abs/2604.00277

作者:Simone Betteti,Luca Laurenti
摘要:基于能量的模型(EBM)将推理实现为学习的李雅普诺夫函数的梯度下降,产生可解释的,结构保留的黑盒神经ODE替代方案,并与物理AI自然对齐。然而,它们在系统识别中的应用仍然有限,现有的体系结构缺乏正式的稳定性保证,在全球范围内排除不稳定的模式。我们解决这个差距,通过引入EBM框架系统识别稳定,耗散,吸收不变的动态。与经典的全局李雅普诺夫稳定性不同,吸收不变性扩展了一类保持稳定的体系结构,使EBM更加灵活和富有表现力。我们扩展EBM理论的非光滑激活通过建立负能量耗散通过克拉克衍生物和衍生新的条件径向无界,暴露在标准EBM的稳定性,表现力的权衡。为了克服这个问题,我们引入了一个具有动态可见层和静态隐藏层的混合架构,在温和的假设下证明了吸收不变性,并表明这些保证扩展到端口汉密尔顿EBM。度量变形的多井和环系统的实验验证了这种方法,展示了我们的混合EBM架构如何通过设计将表现力与声音和可证明的安全保证相结合。
摘要:Energy-based models (EBMs) implement inference as gradient descent on a learned Lyapunov function, yielding interpretable, structure-preserving alternatives to black-box neural ODEs and aligning naturally with physical AI. Yet their use in system identification remains limited, and existing architectures lack formal stability guarantees that globally preclude unstable modes. We address this gap by introducing an EBM framework for system identification with stable, dissipative, absorbing invariant dynamics. Unlike classical global Lyapunov stability, absorbing invariance expands the class of stability-preserving architectures, enabling more flexible and expressive EBMs. We extend EBM theory to nonsmooth activations by establishing negative energy dissipation via Clarke derivatives and deriving new conditions for radial unboundedness, exposing a stability-expressivity tradeoff in standard EBMs. To overcome this, we introduce a hybrid architecture with a dynamical visible layer and static hidden layers, prove absorbing invariance under mild assumptions, and show that these guarantees extend to port-Hamiltonian EBMs. Experiments on metric-deformed multi-well and ring systems validate the approach, showcasing how our hybrid EBM architecture combines expressivity with sound and provable safety guarantees by design.


【4】Lead Zirconate Titanate Reservoir Computing for Classification of Written and Spoken Digits
标题:用于书写和口语数字分类的钛酸铅储层计算
链接:https://arxiv.org/abs/2604.00207

作者:Thomas Buckley,Leslie Schumm,Manor Askenazi,Edward Rietman
摘要:在本文中,我们扩展了我们早期的工作(Rietman等。2022)提出了一个应用程序的物理水库计算(RC)的手写和口头数字的分类。我们利用未极化的锆钛酸铅(PZT)立方体作为计算基板来处理这些数据集。我们的研究结果表明,PZT水库达到89.0%的准确度MNIST手写数字,代表了2.4个百分点的改善逻辑回归基线应用于相同的预处理数据。然而,对于AudioMNIST口语数字数据集,水库系统(88.2%的准确度)与基线方法(88.1%的准确度)相当,这表明水库计算为中等难度的分类任务提供了最大的好处,其中线性方法表现不佳,但问题仍然是可学习的。PZT是一种已用于半导体应用的众所周知的材料,提供了一种可与数字算法集成的低功耗计算基板。我们的研究结果表明,当任务难度超过简单的线性分类器的能力,但仍然在水库动态的计算能力,物理水库擅长。
摘要:In this paper we extend our earlier work of (Rietman et al. 2022) presenting an application of physical Reservoir Computing (RC) to the classification of handwritten and spoken digits. We utilize an unpoled cube of Lead Zirconate Titanate (PZT) as a computational substrate to process these datasets. Our results demonstrate that the PZT reservoir achieves 89.0% accuracy on MNIST handwritten digits, representing a 2.4 percentage point improvement over logistic regression baselines applied to the same preprocessed data. However, for the AudioMNIST spoken digits dataset, the reservoir system (88.2% accuracy) performs equivalently to baseline methods (88.1% accuracy), suggesting that reservoir computing provides the greatest benefits for classification tasks of intermediate difficulty where linear methods underperform but the problem remains learnable. PZT is a well-known material already used in semiconductor applications, presenting a low-power computational substrate that can be integrated with digital algorithms. Our findings indicate that physical reservoirs excel when the task difficulty exceeds the capability of simple linear classifiers but remains within the computational capacity of the reservoir dynamics.


【5】Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates
标题:人工智能安全门分类验证二分法的经验验证
链接:https://arxiv.org/abs/2604.00072

作者:Arsenios Scrivens
备注:21 pages, 9 figures. Companion theory paper: doi:10.5281/zenodo.19237451
摘要:基于分类器的安全门能否在人工智能系统经过数百次迭代改进时保持可靠的监督?我们提供了全面的经验证据,证明他们不能。在一个自我改进的神经控制器(d=240)上,18个分类器配置--包括MLP、SVM、随机森林、k-NN、贝叶斯分类器和深度网络--都不符合安全自我改进的双重条件。三个安全RL基线(CPO,Lyapunov,安全屏蔽)也失败。结果扩展到MuJoCo基准测试(Reacher-v4 d=496,Swimmer-v4 d=1408,HalfCheetah-v4 d=1824)。在控制分布间隔达到delta_s=2.0时,所有分类器仍然失败-包括NP最优测试和具有100%训练精度的MLP-证明结构不可能。   然后,我们表明,不可能性是特定的分类,而不是安全的自我改善本身。Lipschitz球验证器使用可证明的分析边界(无条件delta=0)在{84,240,768,2688,5760,9984,17408}中的维度d上实现零错误接受。球链实现了无边界的参数空间遍历:在MuJoCo Reacher-v4上,10条链产生+4.31的奖励改进,delta=0;在Qwen2.5- 7 B-Instruct上,在LoRA微调期间,42条链转换遍历234倍的单球半径,200步中没有安全违规。一个50提示符的甲骨文证实了甲骨文的不可知性。每组成分验证使半径比全网络球大37倍。在d<=17408时,delta=0是无条件的;在LLM尺度下,取决于估计的Lipschitz常数。
摘要:Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations -- spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks -- all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapunov, safety shielding) also fail. Results extend to MuJoCo benchmarks (Reacher-v4 d=496, Swimmer-v4 d=1408, HalfCheetah-v4 d=1824). At controlled distribution separations up to delta_s=2.0, all classifiers still fail -- including the NP-optimal test and MLPs with 100% training accuracy -- demonstrating structural impossibility.   We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts across dimensions d in {84, 240, 768, 2688, 5760, 9984, 17408} using provable analytical bounds (unconditional delta=0). Ball chaining enables unbounded parameter-space traversal: on MuJoCo Reacher-v4, 10 chains yield +4.31 reward improvement with delta=0; on Qwen2.5-7B-Instruct during LoRA fine-tuning, 42 chain transitions traverse 234x the single-ball radius with zero safety violations across 200 steps. A 50-prompt oracle confirms oracle-agnosticity. Compositional per-group verification enables radii up to 37x larger than full-network balls. At d<=17408, delta=0 is unconditional; at LLM scale, conditional on estimated Lipschitz constants.


【6】An Empirical Recipe for Universal Phone Recognition
标题:通用手机识别的经验秘诀
链接:https://arxiv.org/abs/2603.29042

作者:Shikhar Bharadwaj,Chin-Jou Li,Kwanghee Choi,Eunjung Yeo,William Chen,Shinji Watanabe,David R. Mortensen
备注:Submitted to Interspeech 2026. Code: https://github.com/changelinglab/PhoneticXeus
摘要:电话识别(PR)是多语言和低资源语音处理任务的关键推动因素,但强大的性能仍然难以实现。高性能的以英语为中心的模型不会跨语言泛化,而多语言模型则没有充分利用预训练的表示。目前还不清楚数据规模,架构和训练目标如何有助于多语言PR。我们提出PhoneticXEUS -在大规模多语言数据上训练,并在多语言(17.7%PFER)和带口音的英语语音(10.6%PFER)上实现最先进的性能。通过在统一方案下对100多种语言进行评估的受控消融,我们根据经验建立了我们的训练配方,并量化了SSL表示、数据规模和损失目标的影响。此外,我们还分析了跨语系、口音和发音特征的错误模式。所有数据和代码都是公开发布的。
摘要:Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.


表征(4篇)

【1】Full-Gradient Successor Feature Representations
标题:全梯度后续特征表示
链接:https://arxiv.org/abs/2604.00686

作者:Ritish Shrirao,Aditya Priyadarshi,Raghuram Bharadwaj Diddigi
备注:Submitted to IEEE CDC 2026
摘要:后继特征(SF)结合广义策略改进(GPI),通过将环境动态与奖励函数解耦,为强化学习(RL)中的迁移学习提供了一个强大的框架。然而,标准SF学习方法通常依赖于半梯度时间差(TD)更新。当与非线性函数近似相结合时,半梯度方法缺乏鲁棒的收敛保证,并且可能导致不稳定性,特别是在多任务设置中,其中准确的特征估计对于有效的GPI至关重要。受全梯度DQN的启发,我们提出了全梯度后继特征表示Q学习(FG-SFRQL),这是一种通过最小化全均方贝尔曼误差来优化后继特征的算法。与标准方法不同,我们的方法计算在线和目标网络中参数的梯度。我们提供了一个几乎肯定收敛的理论证明FG-SFRQL和经验表明,最小化的完整的残差导致优越的采样效率和传输性能相比,半梯度基线在离散和连续域。
摘要:Successor Features (SF) combined with Generalized Policy Improvement (GPI) provide a robust framework for transfer learning in Reinforcement Learning (RL) by decoupling environment dynamics from reward functions. However, standard SF learning methods typically rely on semi-gradient Temporal Difference (TD) updates. When combined with non-linear function approximation, semi-gradient methods lack robust convergence guarantees and can lead to instability, particularly in the multi-task setting where accurate feature estimation is critical for effective GPI. Inspired by Full Gradient DQN, we propose Full-Gradient Successor Feature Representations Q-Learning (FG-SFRQL), an algorithm that optimizes the successor features by minimizing the full Mean Squared Bellman Error. Unlike standard approaches, our method computes gradients with respect to parameters in both the online and target networks. We provide a theoretical proof of almost-sure convergence for FG-SFRQL and demonstrate empirically that minimizing the full residual leads to superior sample efficiency and transfer performance compared to semi-gradient baselines in both discrete and continuous domains.


【2】Representation choice shapes the interpretation of protein conformational dynamics
标题:代表选择塑造了蛋白质形态动力学的解释
链接:https://arxiv.org/abs/2604.00580

作者:Axel Giottonini,Thomas Lemmin
摘要 :分子动力学模拟在原子水平上提供了详细的轨迹,但从这些高维数据中提取可解释和强大的见解仍然具有挑战性。在实践中,分析通常依赖于单个表示。在这里,我们表明,代表性的选择是不中立的:它从根本上塑造的构象组织,相似性关系,并从相同的模拟数据推断明显的过渡。   为了补充现有的表示,我们引入方向功能,几何接地,旋转感知编码的蛋白质骨架。我们将其与三种动力学机制的常见描述进行比较:快速折叠蛋白质,大规模结构域运动和蛋白质-蛋白质关联。在这些系统中,我们发现,不同的表示强调构象空间的互补方面,没有一个单一的表示提供了一个完整的图片的基础动力学。   为了便于系统比较,我们开发了ManiProt,一个用于高效计算和分析多种蛋白质表达的库。我们的研究结果激发了一个比较,代表意识的框架,分子动力学模拟的解释。
摘要:Molecular dynamics simulations provide detailed trajectories at the atomic level, but extracting interpretable and robust insights from these high-dimensional data remains challenging. In practice, analyses typically rely on a single representation. Here, we show that representation choice is not neutral: it fundamentally shapes the conformational organization, similarity relationships, and apparent transitions inferred from identical simulation data.   To complement existing representations, we introduce Orientation features, a geometrically grounded, rotation-aware encoding of protein backbone. We compare it against common descriptions across three dynamical regimes: fast-folding proteins, large-scale domain motions, and protein-protein association. Across these systems, we find that different representations emphasize complementary aspects of conformational space, and that no single representation provides a complete picture of the underlying dynamics.   To facilitate systematic comparison, we developed ManiProt, a library for efficient computation and analysis of multiple protein representations. Our results motivate a comparative, representation-aware framework for the interpretation of molecular dynamics simulations.


【3】Learning Shared Representations for Multi-Task Linear Bandits
标题:学习多任务线性盗贼的共享表示
链接:https://arxiv.org/abs/2604.00531

作者:Jiabin Lin,Shana Moothedath
摘要:多任务表征学习是一种在相关任务之间学习共享潜在表征的方法,促进知识转移并提高样本效率。本文介绍了一种新的方法,多任务表示学习线性强盗。我们考虑一个设置与T并发线性强盗任务,每个具有特征维度d,共享一个共同的潜在表示的维度r \ll min{d,T}$,捕获其潜在的相关性。我们提出了一种新的乐观面对不确定性线性(OFUL)算法,利用共享的低秩表示,以提高决策的样本效率的方式。我们的算法首先通过探索阶段收集数据,通过谱初始化估计共享模型,然后在新构建的置信集上进行基于OFUL的学习。我们提供了理论保证的置信集,并证明未知的奖励向量位于置信集内的概率很高。我们推导出累积遗憾界限,并表明所提出的方法实现了\tilde{O}(\sqrt{drNT}),这是独立解决T任务的显着改进,导致\tilde{O}(dT\sqrt{N})的遗憾。我们进行了数值模拟,以验证我们的算法的性能不同的问题大小。
摘要:Multi-task representation learning is an approach that learns shared latent representations across related tasks, facilitating knowledge transfer and improving sample efficiency. This paper introduces a novel approach to multi-task representation learning in linear bandits. We consider a setting with T concurrent linear bandit tasks, each with feature dimension d, that share a common latent representation of dimension r \ll min{d,T}$, capturing their underlying relatedness. We propose a new Optimism in the Face of Uncertainty Linear (OFUL) algorithm that leverages shared low-rank representations to enhance decision-making in a sample-efficient manner. Our algorithm first collects data through an exploration phase, estimates the shared model via spectral initialization, and then conducts OFUL based learning over a newly constructed confidence set. We provide theoretical guarantees for the confidence set and prove that the unknown reward vectors lie within the confidence set with high probability. We derive cumulative regret bounds and show that the proposed approach achieves \tilde{O}(\sqrt{drNT}), a significant improvement over solving the T tasks independently, resulting in a regret of \tilde{O}(dT\sqrt{N}). We performed numerical simulations to validate the performance of our algorithm for different problem sizes.


【4】Deconfounding Scores and Representation Learning for Causal Effect Estimation with Weak Overlap
标题:弱重叠因果效应估计的解构分数和表示学习
链接:https://arxiv.org/abs/2604.00811

作者:Oscar Clivio,Alexander D'Amour,Alexander Franks,David Bruns-Smith,Chris Holmes,Avi Feller
备注:To appear at AISTATS 2026
摘要:重叠,也称为阳性,是因果治疗效果估计的关键条件。许多流行的估计遭受高方差和变得脆弱时,功能差异很大,治疗组。这在高维中尤其具有挑战性:维数灾难可能使重叠变得不可信。为了解决这个问题,我们提出了一类称为去干扰分数的特征表示,它既保留了识别又保留了估计的目标;经典的倾向和预后分数是两种特殊情况。我们的特点是找到一个更好的重叠表示的问题,最大限度地减少重叠发散下的deconfounding得分约束。然后,我们推导出封闭形式的表达式下的广义线性模型与高斯特征的一类去创分数,并表明,预后分数是最佳的在这一类。我们进行了大量的实验来评估这种行为的经验。
摘要:Overlap, also known as positivity, is a key condition for causal treatment effect estimation. Many popular estimators suffer from high variance and become brittle when features differ strongly across treatment groups. This is especially challenging in high dimensions: the curse of dimensionality can make overlap implausible. To address this, we propose a class of feature representations called deconfounding scores, which preserve both identification and the target of estimation; the classical propensity and prognostic scores are two special cases. We characterize the problem of finding a representation with better overlap as minimizing an overlap divergence under a deconfounding score constraint. We then derive closed-form expressions for a class of deconfounding scores under a broad family of generalized linear models with Gaussian features and show that prognostic scores are overlap-optimal within this class. We conduct extensive experiments to assess this behavior empirically.


3D|3D重建等相关(1篇)

【1】ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
标题:ProOOD:原型引导的分布外3D占用预测
链接:https://arxiv.org/abs/2604.01081

作者:Yuheng Zhang,Mengfei Duan,Kunyu Peng,Yuhang Wang,Di Wen,Danda Pani Paudel,Luc Van Gool,Kailun Yang
备注:Accepted to CVPR 2026. The source code is publicly available at https://github.com/7uHeng/ProOOD
摘要:3D语义占用预测是自动驾驶的核心,但目前的方法容易受到长尾类偏差和分布外(OOD)输入的影响,通常过于自信地将异常分配给罕见的类。我们提出了ProOOD,一个轻量级的,即插即用的方法,夫妇的原型指导细化与培训免费OOD评分。ProOOD包括(i)原型引导的语义填补,用类一致的特征填充闭塞区域,(ii)原型引导的尾部挖掘,加强稀有类表示以抑制OOD吸收,以及(iii)EchoOOD,将局部logit一致性与局部和全局原型匹配融合以产生可靠的体素级OOD分数。在五个数据集上进行的大量实验表明,ProOOD在分布3D占用预测和OOD检测方面都达到了最先进的性能。在SemanticKITTI上,它整体超过基线+3.57% mIoU,尾部类mIoU超过基线+24.80%;在VAA-KITTI上,它将AuPRCr提高了+19.34点,在各个基准上都有一致的收益。这些改进在安全关键的城市驾驶中产生更校准的占用估计和更可靠的OOD检测。源代码可在https://github.com/7uHeng/ProOOD上公开获得。
摘要 :3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at https://github.com/7uHeng/ProOOD.


优化|敛散性(8篇)

【1】Approximating Pareto Frontiers in Stochastic Multi-Objective Optimization via Hashing and Randomization
标题:通过哈希和随机化逼近随机多目标优化中的帕累托前沿
链接:https://arxiv.org/abs/2604.01098

作者:Jinzhao Li,Nan Jiang,Yexiang Xue
摘要:随机多目标优化(SMOO)是在不确定环境中权衡多个潜在冲突目标的决策的关键。SMOO的目的是确定帕累托边界,其中包含所有相互非支配的决定。这个问题是非常棘手的,由于嵌入的概率推理,如计算边缘,后验概率,或期望。现有的方法,如标量化,样本平均近似和进化算法,要么提供任意松散的近似或可能会产生高昂的计算成本。提出了一种新的算法XOR-SMOO,该算法通过在$γ$和$δ$中查询一个SAT预言的poly-log时间,以$1-δ$的概率获得SMOO的$γ$-近似Pareto边界($γ>1$)。一个近似帕累托边界仅比真实边界低一个固定的乘法因子。因此,XOR-SMOO解决了高度棘手的SMOO问题(P-hard),仅查询SAT神谕,同时获得紧密的常数因子近似保证。对现实道路网络强化和供应链设计问题的实验表明,XOR-SMOO在识别具有更高目标值的Pareto边界方面优于几个基线,最优解的覆盖率更高,并且找到的解决方案分布更均匀。总的来说,XOR-SMOO大大增强了SMOO求解器的实用性和可靠性。
摘要:Stochastic Multi-Objective Optimization (SMOO) is critical for decision-making trading off multiple potentially conflicting objectives in uncertain environments. SMOO aims at identifying the Pareto frontier, which contains all mutually non-dominating decisions. The problem is highly intractable due to the embedded probabilistic inference, such as computing the marginal, posterior probabilities, or expectations. Existing methods, such as scalarization, sample average approximation, and evolutionary algorithms, either offer arbitrarily loose approximations or may incur prohibitive computational costs. We propose XOR-SMOO, a novel algorithm that with probability $1-δ$, obtains $γ$-approximate Pareto frontiers ($γ>1$) for SMOO by querying an SAT oracle poly-log times in $γ$ and $δ$. A $γ$-approximate Pareto frontier is only below the true frontier by a fixed, multiplicative factor $γ$. Thus, XOR-SMOO solves highly intractable SMOO problems (\#P-hard) with only queries to SAT oracles while obtaining tight, constant factor approximation guarantees. Experiments on real-world road network strengthening and supply chain design problems demonstrate that XOR-SMOO outperforms several baselines in identifying Pareto frontiers that have higher objective values, better coverage of the optimal solutions, and the solutions found are more evenly distributed. Overall, XOR-SMOO significantly enhanced the practicality and reliability of SMOO solvers.


【2】Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPs
标题:POMDPs中近最优通知窗口策略的基于模型的学习
链接:https://arxiv.org/abs/2604.01024

作者:Philip Jordan,Maryam Kamgarpour
摘要:我们研究了基于模型的学习有限窗口政策表部分可观察马尔可夫决策过程(POMDPs)。在部分可观测性下学习的一种常见方法是使用有限的动作观察窗口来近似无界的历史依赖。这导致了历史上的有限状态马尔可夫决策过程(MDP),称为超状态MDP。一旦这种超状态MDP的模型可用,标准MDP算法可以用于计算最优策略,从而激发对样本有效模型估计的需求。估计超状态MDP模型是具有挑战性的,因为轨迹是通过与原始POMDP的相互作用生成的,从而在采样过程和目标模型之间产生不匹配。我们提出了一个表格POMDPs的模型估计过程,并分析其样本复杂性。我们的分析利用弱相依随机变量的过滤器稳定性和浓度不等式之间的连接。其结果是,我们得到严格的样本复杂性保证估计超状态MDP模型从一个单一的轨迹。结合值迭代,这产生了近似最优的有限窗口政策的POMDP。
摘要:We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.


【3】A Decoupled Basis-Vector-Driven Generative Framework for Dynamic Multi-Objective Optimization
标题:动态多目标优化的去耦合基-载体驱动生成框架
链接:https://arxiv.org/abs/2604.00508

作者:Yaoming Yang,Shuai Wang,Bingdong Li,Peng Yang,Ke Tang
摘要:动态多目标优化需要连续跟踪移动的Pareto前沿。现有的方法与不规则的突变和数据稀疏的斗争,主要面临三个挑战:动态模式的非线性耦合,从过时的历史数据的负传递,以及在环境切换过程中的冷启动问题。为了解决这些问题,本文提出了一个解耦的基向量驱动的生成框架(DB-GEN)。首先,为了解决非线性耦合,该框架采用离散小波变换将演化轨迹分离为低频趋势和高频细节。其次,为了减轻负迁移,它通过稀疏字典学习来学习可转移的基向量,而不是直接记忆历史实例。在拓扑感知的对比约束下重新组合这些基础构建了结构化的潜在流形。最后,为了克服冷启动问题,代理辅助搜索范式从这个流形样本初始种群。DB-GEN在1.2亿个解决方案上进行了预训练,无需重新训练或微调即可执行直接在线推理。该zero-shot生成过程以毫秒为单位执行,每个环境变化大约需要0.2秒。实验结果表明,DB-GEN提高了跟踪精度在各种动态基准相比,现有的算法。
摘要:Dynamic multi-objective optimization requires continuous tracking of moving Pareto fronts. Existing methods struggle with irregular mutations and data sparsity, primarily facing three challenges: the non-linear coupling of dynamic modes, negative transfer from outdated historical data, and the cold-start problem during environmental switches. To address these issues, this paper proposes a decoupled basis-vector-driven generative framework (DB-GEN). First, to resolve non-linear coupling, the framework employs the discrete wavelet transform to separate evolutionary trajectories into low-frequency trends and high-frequency details. Second, to mitigate negative transfer, it learns transferable basis vectors via sparse dictionary learning rather than directly memorizing historical instances. Recomposing these bases under a topology-aware contrastive constraint constructs a structured latent manifold. Finally, to overcome the cold-start problem, a surrogate-assisted search paradigm samples initial populations from this manifold. Pre-trained on 120 million solutions, DB-GEN performs direct online inference without retraining or fine-tuning. This zero-shot generation process executes in milliseconds, requiring approximately 0.2 seconds per environmental change. Experimental results demonstrate that DB-GEN improves tracking accuracy across various dynamic benchmarks compared to existing algorithms.


【4】Convergence of Byzantine-Resilient Gradient Tracking via Probabilistic Edge Dropout
标题:通过概率边缘丢失来收敛拜占庭弹性梯度跟踪
链接:https://arxiv.org/abs/2604.00449

作者:Amirhossein Dezhboro,Fateme Maleki,Arman Adibi,Erfan Amini,Jose E. Ramirez-Marquez
摘要 :我们研究分布式优化网络与拜占庭代理,可能会发送任意的对抗性消息。我们提出了\{带有概率边缘丢弃的梯度跟踪}(GT-PD),这是一种随机梯度跟踪方法,可以保留对抗通信下梯度跟踪的收敛特性。GT-PD结合了两个互补的防御层:一个通用的以自我为中心的投影,将每个传入的消息剪辑到接收代理周围的半径为$τ$的球上,以及一个由决策和跟踪通道中的双度量信任得分驱动的完全分散的概率丢弃规则。这种设计限制了对抗性扰动,同时保留了双随机混合结构,这是一种在分散设置中的鲁棒聚合下经常丢失的属性。在完全Byzantine隔离条件下($p_b=0$),GT-PD线性收敛到一个仅由随机梯度方差决定的邻域.对于部分隔离($p_b>0$),本文提出了带概率边缘丢失和漏积分的梯度跟踪算法(GT-PD-L),该算法利用一个漏积分器来控制由持续扰动引起的跟踪误差的积累,并线性收敛到由随机方差和限幅漏比确定的有界邻域.我们进一步表明,在两层dropout与$p_h=1$,隔离拜占庭代理引入诚实的共识动态没有额外的变化。在符号翻转、ALIE和内积操纵攻击下的MNIST实验表明,GT-PD-L在隐形攻击下的性能比坐标修剪均值高出4.3个百分点。
摘要:We study distributed optimization over networks with Byzantine agents that may send arbitrary adversarial messages. We propose \emph{Gradient Tracking with Probabilistic Edge Dropout} (GT-PD), a stochastic gradient tracking method that preserves the convergence properties of gradient tracking under adversarial communication. GT-PD combines two complementary defense layers: a universal self-centered projection that clips each incoming message to a ball of radius $τ$ around the receiving agent, and a fully decentralized probabilistic dropout rule driven by a dual-metric trust score in the decision and tracking channels. This design bounds adversarial perturbations while preserving the doubly stochastic mixing structure, a property often lost under robust aggregation in decentralized settings. Under complete Byzantine isolation ($p_b=0$), GT-PD converges linearly to a neighborhood determined solely by stochastic gradient variance. For partial isolation ($p_b>0$), we introduce \emph{Gradient Tracking with Probabilistic Edge Dropout and Leaky Integration} (GT-PD-L), which uses a leaky integrator to control the accumulation of tracking errors caused by persistent perturbations and achieves linear convergence to a bounded neighborhood determined by the stochastic variance and the clipping-to-leak ratio. We further show that under two-tier dropout with $p_h=1$, isolating Byzantine agents introduces no additional variance into the honest consensus dynamics. Experiments on MNIST under Sign Flip, ALIE, and Inner Product Manipulation attacks show that GT-PD-L outperforms coordinate-wise trimmed mean by up to 4.3 percentage points under stealth attacks.


【5】Shapley-Guided Neural Repair Approach via Derivative-Free Optimization
标题:通过无求导优化的Shapley引导神经修复方法
链接:https://arxiv.org/abs/2604.00422

作者:Xinyu Sun,Wanwei Liu,Haoang Chi,Tingyu Chen,Xiaoguang Mao,Shangwen Wang,Lei Bu,Jingyi Wang,Yang Tan,Zhenyi Qi
摘要:DNN容易受到后门、对抗性攻击和不公平等缺陷的影响,从而破坏了它们的可靠性。现有的方法主要涉及再训练,优化,约束求解,或搜索算法。然而,大多数方法依赖于梯度计算,限制了对特定激活函数(例如,ReLU),或者使用具有不可解释的定位和修复的搜索算法。此外,它们往往缺乏跨多个属性的通用性。我们提出SHARPEN,集成可解释的故障定位与无导数优化策略。首先,SHARPEN引入了一种基于深度形状的定位策略,量化了每一层和神经元对错误输出的边际贡献。具体来说,一个分层的粗到细的方法重新排列层聚合的影响,然后通过分析属性违反和良性状态之间的激活分歧定位故障神经元/过滤器。随后,SHARPEN结合CMA-ES来修复识别的神经元。CMA-ES利用协方差矩阵来捕获变量依赖关系,从而实现无梯度搜索和跨耦合神经元的协调调整。通过将可解释的定位与进化优化相结合,SHARPEN实现了跨架构的无导数修复,对梯度异常和超参数不太敏感。我们证明了SHARPEN的有效性三个修复任务。平衡属性修复和准确性保护,它在后门删除(+10.56%),对抗缓解(+5.78%)和不公平修复(+11.82%)方面优于基线。值得注意的是,SHARPEN处理各种任务,其模块化设计是即插即用的,具有不同的无衍生优化器,突出了其灵活性。
摘要:DNNs are susceptible to defects like backdoors, adversarial attacks, and unfairness, undermining their reliability. Existing approaches mainly involve retraining, optimization, constraint-solving, or search algorithms. However, most methods rely on gradient calculations, restricting applicability to specific activation functions (e.g., ReLU), or use search algorithms with uninterpretable localization and repair. Furthermore, they often lack generalizability across multiple properties. We propose SHARPEN, integrating interpretable fault localization with a derivative-free optimization strategy. First, SHARPEN introduces a Deep SHAP-based localization strategy quantifying each layer's and neuron's marginal contribution to erroneous outputs. Specifically, a hierarchical coarse-to-fine approach reranks layers by aggregated impact, then locates faulty neurons/filters by analyzing activation divergences between property-violating and benign states. Subsequently, SHARPEN incorporates CMA-ES to repair identified neurons. CMA-ES leverages a covariance matrix to capture variable dependencies, enabling gradient-free search and coordinated adjustments across coupled neurons. By combining interpretable localization with evolutionary optimization, SHARPEN enables derivative-free repair across architectures, being less sensitive to gradient anomalies and hyperparameters. We demonstrate SHARPEN's effectiveness on three repair tasks. Balancing property repair and accuracy preservation, it outperforms baselines in backdoor removal (+10.56%), adversarial mitigation (+5.78%), and unfairness repair (+11.82%). Notably, SHARPEN handles diverse tasks, and its modular design is plug-and-play with different derivative-free optimizers, highlighting its flexibility.


【6】Deep Learning-Accelerated Surrogate Optimization for High-Dimensional Well Control in Stress-Sensitive Reservoirs
标题:应力敏感油藏高维井控深度学习加速代理优化
链接:https://arxiv.org/abs/2604.00352

作者:Mahammad Valiyev,Jodel Cornelio,Behnam Jafarpour
摘要:应力敏感非常规油藏的生产优化受压力驱动流动与应力诱导裂缝导流能力和基质渗透率退化之间的非线性权衡所支配。虽然更高的压降提高了短期产量,但它加速了渗透率损失并降低了长期采收率。确定最佳的,随时间变化的控制策略,需要反复评估完全耦合的流动地质力学模拟器,使传统的优化计算昂贵。   我们提出了一个基于深度学习的代理优化框架,用于高维井控。与依赖于预定义的控制参数化或通用采样的先前方法不同,我们的方法将井控制视为连续的高维问题,并引入了一种基于问题的采样策略,该策略将训练数据与优化过程中遇到的轨迹对齐。训练神经网络代理以使用来自耦合流动地质力学模型的数据来近似井底压力轨迹与累积产量之间的映射。   该代理嵌入在一个受约束的优化工作流中,从而实现对控制策略的快速评估。在多个初始化过程中,该替代方案与全物理解决方案的一致性在2- 5%之间,同时将计算成本降低了三个数量级。离散性主要与训练分布边界附近的轨迹和局部优化效果有关。   该框架表明,将代理建模与问题信息采样相结合,可以对高维、基于模拟器的问题进行可扩展和可靠的优化,并对PDE约束系统具有更广泛的适用性。
摘要:Production optimization in stress-sensitive unconventional reservoirs is governed by a nonlinear trade-off between pressure-driven flow and stress-induced degradation of fracture conductivity and matrix permeability. While higher drawdown improves short-term production, it accelerates permeability loss and reduces long-term recovery. Identifying optimal, time-varying control strategies requires repeated evaluations of fully coupled flow-geomechanics simulators, making conventional optimization computationally expensive.   We propose a deep learning-based surrogate optimization framework for high-dimensional well control. Unlike prior approaches that rely on predefined control parameterizations or generic sampling, our method treats well control as a continuous, high-dimensional problem and introduces a problem-informed sampling strategy that aligns training data with trajectories encountered during optimization. A neural network proxy is trained to approximate the mapping between bottomhole pressure trajectories and cumulative production using data from a coupled flow-geomechanics model.   The proxy is embedded within a constrained optimization workflow, enabling rapid evaluation of control strategies. Across multiple initializations, the surrogate achieves agreement with full-physics solutions within 2-5 percent, while reducing computational cost by up to three orders of magnitude. Discrepancies are mainly associated with trajectories near the boundary of the training distribution and local optimization effects.   This framework shows that combining surrogate modeling with problem-informed sampling enables scalable and reliable optimization for high-dimensional, simulator-based problems, with broader applicability to PDE-constrained systems.


【7】Learning to Shuffle: Block Reshuffling and Reversal Schemes for Stochastic Optimization
标题:学习洗牌:随机优化的块重新洗牌和重新排序方案
链接:https://arxiv.org/abs/2604.00260

作者:Lam M. Nguyen,Dzung T. Phan,Jayant Kalagnanam
摘要:随机梯度下降(SGD)的洗牌策略,包括增量梯度,洗牌一次,和随机重新洗牌,支持严格的收敛性分析,任意的历元内排列。特别地,已知随机重排相对于循环和重排一次方案改进优化常数。然而,现有的理论提供了有限的指导,如何设计新的数据排序方案,进一步提高优化常数或稳定性超越随机重组。在本文中,我们设计了一个管道使用大语言模型(LLM)引导的程序演化框架,发现一个有效的洗牌规则,无替换SGD。从这种情况下抽象,我们确定了两个基本的结构组成部分:块改组和配对逆转。我们分别分析这些组件,并表明,块重组严格减少前缀梯度方差常数的统一洗牌框架内,产生可证明的改进,在温和的条件下随机重组。另外,我们表明,成对的反转对称的历元图,并取消了领先的顺序依赖的二阶项,降低了阶灵敏度从二次到立方的步长。与发现的算法的数值实验验证了理论,并表现出一致的收益超过标准洗牌计划在凸和非凸基准。
摘要:Shuffling strategies for stochastic gradient descent (SGD), including incremental gradient, shuffle-once, and random reshuffling, are supported by rigorous convergence analyses for arbitrary within-epoch permutations. In particular, random reshuffling is known to improve optimization constants relative to cyclic and shuffle-once schemes. However, existing theory offers limited guidance on how to design new data-ordering schemes that further improve optimization constants or stability beyond random reshuffling. In this paper, we design a pipeline using a large language model (LLM)-guided program evolution framework to discover an effective shuffling rule for without-replacement SGD. Abstracting from this instance, we identify two fundamental structural components: block reshuffling and paired reversal. We analyze these components separately and show that block reshuffling strictly reduces prefix-gradient variance constants within the unified shuffling framework, yielding provable improvements over random reshuffling under mild conditions. Separately, we show that paired reversal symmetrizes the epoch map and cancels the leading order-dependent second-order term, reducing order sensitivity from quadratic to cubic in the step size. Numerical experiments with the discovered algorithm validate the theory and demonstrate consistent gains over standard shuffling schemes across convex and nonconvex benchmarks.


【8】Scaled Gradient Descent for Ill-Conditioned Low-Rank Matrix Recovery with Optimal Sampling Complexity
标题:最佳采样复杂度的病态低秩矩阵恢复的尺度梯度下降算法
链接:https://arxiv.org/abs/2604.00060

作者:Zhenxuan Li,Meng Huang
摘要:低秩矩阵恢复问题寻求从$m$线性测量重构未知的$n_1 \times n_2$ rank-$r$矩阵,其中$m\ll n_1n_2$。这个问题在过去的几十年里得到了广泛的研究,导致了各种具有坚实理论保证的算法。其中,基于梯度下降的非凸方法由于其计算效率而变得特别流行。然而,这些方法通常受到两个关键限制:$O((n_1 + n_2)r^2)$的次优样本复杂度和$O(κ\log(1/ε))$的迭代复杂度以实现$ε$-精度,导致当目标矩阵病态时收敛缓慢。这里,$κ$表示未知矩阵的条件数。最近的研究表明,GD的预处理变体,称为缩放梯度下降(ScaledGD),可以显着降低迭代复杂度为O(\log(1/ε))$。尽管如此,它的样本复杂度仍然是次优的,为O((n_1 + n_2)r^2)$。相比之下,一个微妙的虚拟序列技术表明,在半正定(PSD)设置的标准GD达到最佳的样本复杂度$O((n_1 + n_2)r)$,但收敛速度较慢,迭代复杂度$O(κ^2 \log(1/ε))$。在本文中,通过更精细的分析,我们表明ScaledGD实现了最佳的样本复杂度$O((n_1 + n_2)r)$和改进的迭代复杂度$O(\log(1/ε))$。值得注意的是,我们的结果扩展到一般的低秩矩阵恢复问题的PSD设置。数值实验进一步验证了ScaledGD算法在最优采样复杂度下加快了病态矩阵的收敛速度。
摘要:The low-rank matrix recovery problem seeks to reconstruct an unknown $n_1 \times n_2$ rank-$r$ matrix from $m$ linear measurements, where $m\ll n_1n_2$. This problem has been extensively studied over the past few decades, leading to a variety of algorithms with solid theoretical guarantees. Among these, gradient descent based non-convex methods have become particularly popular due to their computational efficiency. However, these methods typically suffer from two key limitations: a sub-optimal sample complexity of $O((n_1 + n_2)r^2)$ and an iteration complexity of $O(κ\log(1/ε))$ to achieve $ε$-accuracy, resulting in slow convergence when the target matrix is ill-conditioned. Here, $κ$ denotes the condition number of the unknown matrix. Recent studies show that a preconditioned variant of GD, known as scaled gradient descent (ScaledGD), can significantly reduce the iteration complexity to $O(\log(1/ε))$. Nonetheless, its sample complexity remains sub-optimal at $O((n_1 + n_2)r^2)$. In contrast, a delicate virtual sequence technique demonstrates that the standard GD in the positive semidefinite (PSD) setting achieves the optimal sample complexity $O((n_1 + n_2)r)$, but converges more slowly with an iteration complexity $O(κ^2 \log(1/ε))$. In this paper, through a more refined analysis, we show that ScaledGD achieves both the optimal sample complexity $O((n_1 + n_2)r)$ and the improved iteration complexity $O(\log(1/ε))$. Notably, our results extend beyond the PSD setting to general low-rank matrix recovery problem. Numerical experiments further validate that ScaledGD accelerates convergence for ill-conditioned matrices with the optimal sampling complexity.


预测|估计(6篇)

【1】The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline
标题:食谱比厨房更重要:人工智能天气预测管道的数学基础
链接:https://arxiv.org/abs/2604.01215

作者:Piyush Garg,Diana R. Gergel,Andrew E. Shao,Galen J. Yacalis
摘要:人工智能天气预报发展迅速,但没有统一的数学框架来解释是什么决定了预报技能。现有的理论解决了特定的架构选择,而不是整个学习管道,而2023-2026年的运营证据表明,培训方法,损失函数设计和数据多样性至少与架构选择一样重要。本文作了两个交错的贡献。从理论上讲,我们构建了一个植根于球面近似理论、动力系统理论、信息论和统计学习理论的框架,该框架处理完整的学习管道(架构、损失函数、训练策略、数据分布),而不仅仅是架构。我们建立了一个学习管道误差分解,显示估计误差(损失和数据相关)在当前尺度上占主导地位的近似误差(架构相关)。我们开发了一种损失函数谱理论,在球谐坐标中正式化MSE引起的谱模糊,并推导出分布外推边界,证明数据驱动的模型系统地低估了破纪录的极端情况,偏差在记录中线性增长。从经验上讲,我们使用NVIDIA Earth 2Studio和ERA 5初始条件,通过对10个不同架构的人工智能天气模型进行推理来验证这些预测,并评估了跨越所有季节的30个初始化日期的6个指标。结果证实了MSE训练的模型在高波数下的普遍光谱能量损失,上升的误差一致性比率表明大部分预测误差是跨架构共享的,以及极端事件期间的线性负偏差。整体模型评估分数提供统一的多维评估,并且规定性框架可以在训练之前对建议的管道进行数学评估。
摘要:AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.


【2】NeuroDDAF: Neural Dynamic Diffusion-Advection Fields with Evidential Fusion for Air Quality Forecasting
标题:Neuroduction:用于空气质量预测的神经动态扩散-平流场证据融合
链接:https://arxiv.org/abs/2604.01175

作者 :Prasanjit Dey,Soumyabrata Dev,Angela Meyer,Bianca Schoen-Phelan
备注:This manuscript is under review
摘要:准确的空气质量预测对于保护公众健康和指导环境政策至关重要,但由于非线性时空动力学,风力驱动的运输和跨区域的分布变化,它仍然具有挑战性。基于物理的模型是可解释的,但计算昂贵,往往依赖于限制性假设,而纯粹的数据驱动模型可以是准确的,但可能缺乏鲁棒性和校准的不确定性。为了解决这些局限性,我们提出了神经动态扩散平流场(NeuroDoppler),这是一个物理信息预测框架,将神经表示学习与开放系统传输建模相结合。NeuroDentrum集成了(i)GRU-图形注意力编码器,以捕获时间动态和风感知空间相互作用,(ii)具有可学习残差的傅立叶域扩散平流模块,(iii)风调制的潜在神经ODE,以模拟时变连接下的连续时间演化,以及(iv)证据融合机制,自适应地结合物理引导和神经预测,同时量化不确定性。在四个城市数据集(北京、深圳、天津和安科纳)上进行的1-3天范围的实验表明,NeuroDestra的表现始终优于强基线,包括AirPhyNet,在长期预测中,RMSE降低了9.7%,MAE降低了9.4%。在北京数据集上,NeuroDoppler的1天预测RMSE为41.63 $μ$g/m$^3$,3天预测RMSE为48.88 $μ$g/m$^3$,在所有比较方法中表现最好。此外,NeuroDensitivity提高了跨城市的泛化能力,并产生了校准良好的不确定性估计,这一点在不同风力条件下的集合方差分析和案例研究中得到了证实。
摘要:Accurate air quality forecasting is crucial for protecting public health and guiding environmental policy, yet it remains challenging due to nonlinear spatiotemporal dynamics, wind-driven transport, and distribution shifts across regions. Physics-based models are interpretable but computationally expensive and often rely on restrictive assumptions, whereas purely data-driven models can be accurate but may lack robustness and calibrated uncertainty. To address these limitations, we propose Neural Dynamic Diffusion-Advection Fields (NeuroDDAF), a physics-informed forecasting framework that unifies neural representation learning with open-system transport modeling. NeuroDDAF integrates (i) a GRU-Graph Attention encoder to capture temporal dynamics and wind-aware spatial interactions, (ii) a Fourier-domain diffusion-advection module with learnable residuals, (iii) a wind-modulated latent Neural ODE to model continuous-time evolution under time-varying connectivity, and (iv) an evidential fusion mechanism that adaptively combines physics-guided and neural forecasts while quantifying uncertainty. Experiments on four urban datasets (Beijing, Shenzhen, Tianjin, and Ancona) across 1-3 day horizons show that NeuroDDAF consistently outperforms strong baselines, including AirPhyNet, achieving up to 9.7% reduction in RMSE and 9.4% reduction in MAE on long-term forecasts. On the Beijing dataset, NeuroDDAF attains an RMSE of 41.63 $μ$g/m$^3$ for 1-day prediction and 48.88 $μ$g/m$^3$ for 3-day prediction, representing the best performance among all compared methods. In addition, NeuroDDAF improves cross-city generalization and yields well-calibrated uncertainty estimates, as confirmed by ensemble variance analysis and case studies under varying wind conditions.


【3】Chameleons do not Forget: Prompt-Based Online Continual Learning for Next Activity Prediction
标题:变色龙不要忘记:基于预算的在线持续学习以预测下一次活动
链接:https://arxiv.org/abs/2604.00653

作者:Marwan Hassani,Tamara Verbeek,Sjoerd van Straten
备注:This paper has been accepted for publication in the International Journal of Cooperative Information Systems
摘要:预测过程监控(PPM)侧重于预测未来的过程轨迹,包括下一个活动预测。这在流程变化或面临不确定性的动态环境中至关重要。然而,目前的框架往往假设一个静态的环境,忽略了动态特性和概念漂移。这会导致灾难性的遗忘,其中仅关注新数据分布的训练会对先前学习的数据分布的性能产生负面影响。除其他外,持续学习解决了与减轻灾难性遗忘有关的挑战。本文提出了一种新的方法,称为连续下一个活动预测与提示(CNAPwP),它适应下一个活动预测的双提示算法,以提高准确性和适应性,同时减轻灾难性遗忘。我们引入了具有重复概念漂移的新数据集,以及特定于任务的遗忘度量,该度量测量了初始发生和后续任务发生之间的预测准确性差距。对三个合成数据集和两个真实世界数据集进行的广泛测试表明,与五个基线相比,CNAPwP实现了SOTA或竞争性结果,证明了其在真实世界场景中的潜在适用性。我们的方法的开源实现,以及数据集和结果,可在:https://github.com/SvStraten/CNAPwP。
摘要:Predictive process monitoring (PPM) focuses on predicting future process trajectories, including next activity predictions. This is crucial in dynamic environments where processes change or face uncertainty. However, current frameworks often assume a static environment, overlooking dynamic characteristics and concept drifts. This results in catastrophic forgetting, where training while focusing merely on new data distribution negatively impacts the performance on previously learned data distributions. Continual learning addresses, among others, the challenges related to mitigating catastrophic forgetting. This paper proposes a novel approach called Continual Next Activity Prediction with Prompts (CNAPwP), which adapts the DualPrompt algorithm for next activity prediction to improve accuracy and adaptability while mitigating catastrophic forgetting. We introduce new datasets with recurring concept drifts, alongside a task-specific forgetting metric that measures the prediction accuracy gap between initial occurrence and subsequent task occurrences. Extensive testing on three synthetic and two real-world datasets representing several setups of recurrent drifts shows that CNAPwP achieves SOTA or competitive results compared to five baselines, demonstrating its potential applicability in real-world scenarios. An open-source implementation of our method, together with the datasets and results, is available at: https://github.com/SvStraten/CNAPwP.


【4】Predicting Dynamics of Ultra-Large Complex Systems by Inferring Governing Equations
标题:通过推理控制方程预测超大型复杂系统的动力学
链接:https://arxiv.org/abs/2604.00599

作者:Qi Shao,Duxin Chen,Jiawen Chen,Yujie Zeng,Athen Ma,Wenwu Yu,Vito Latora,Wei Lin
备注:15 pages, 5 figures, under review
摘要:Predicting the behavior of ultra-large complex systems, from climate to biological and technological networks, is a central unsolved challenge. Existing approaches face a fundamental trade-off: equation discovery methods provide interpretability but fail to scale, while neural networks scale but operate as black boxes and often lose reliability over long times. Here, we introduce the Sparse Identification Graph Neural Network, a framework that overcome this divide by allowing to infer the governing equations of large networked systems from data. By defining symbolic discovery as edge-level information, SIGN decouples the scalability of sparse identification from network size, enabling efficient equation discovery even in large systems. SIGN allows to study networks with over 100,000 nodes while remaining robust to noise, sparse sampling, and missing data. Across diverse benchmark systems, including coupled chaotic oscillators, neural dynamics, and epidemic spreading, it recovers governing equations with high precision and sustains accurate long-term predictions. Applied to a data set of time series of temperature measurements in 71,987 sea surface positions, SIGN identifies a compact predictive network model and captures large-scale sea surface temperature conditions up to two years in advance. By enabling equation discovery at previously inaccessible scales, SIGN opens a path toward interpretable and reliable prediction of real-world complex systems.


【5】When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction
标题:当职业生涯数据泄露时:创始人成功预测的结构化特征工程和信号限制
链接:https://arxiv.org/abs/2604.00339

作者:Yagiz Ihlamur
备注:4 pages, 4 tables. Accepted at SecureFinAI Contest @ IEEE IDS 2026. Code: https://github.com/ihlamury/vcbench
摘要 :Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields -- jobs, education, exits -- and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 -- a +17.7pp improvement over the zero-shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = -0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly -- it is a lossy re-encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects the information content of this dataset, not a modeling limitation. In characterizing where the signal runs out and why, this work functions as a benchmark diagnostic -- one that points directly to what a richer dataset would need to include.


【6】Multi-lingual Multi-institutional Electronic Health Record based Predictive Model
标题:基于多语言多机构电子健康记录的预测模型
链接:https://arxiv.org/abs/2604.00027

作者:Kyunghoon Hur,Heeyoung Kwak,Jinsu Jang,Nakhwan Kim,Edward Choi
备注:On revision stage, 10 main pages, 3 supplementary pages
摘要:Large-scale EHR prediction across institutions is hindered by substantial heterogeneity in schemas and code systems. Although Common Data Models (CDMs) can standardize records for multi-institutional learning, the manual harmonization and vocabulary mapping are costly and difficult to scale. Text-based harmonization provides an alternative by converting raw EHR into a unified textual form, enabling pooled learning without explicit standardization. However, applying this paradigm to multi-national datasets introduces an additional layer of heterogeneity, which is "language" that must be addressed for truly scalable EHRs learning. In this work, we investigate multilingual multi-institutional learning for EHR prediction, aiming to enable pooled training across multinational ICU datasets without manual standardization. We compare two practical strategies for handling language barriers: (i) directly modeling multilingual records with multilingual encoders, and (ii) translating non-English records into English via LLM-based word-level translation. Across seven public ICU datasets, ten clinical tasks with multiple prediction windows, translation-based lingual alignment yields more reliable cross-dataset performance than multilingual encoders. The multi-institutional learning model consistently outperforms strong baselines that require manual feature selection and harmonization, and also surpasses single-dataset training. We further demonstrate that text-based framework with lingual alignment effectively performs transfer learning via few-shot fine-tuning, with additional gains. To our knowledge, this is the first study to aggregate multilingual multinational ICU EHR datasets into one predictive model, providing a scalable path toward language-agnostic clinical prediction and future global multi-institutional EHR research.


其他神经网络|深度学习|模型|建模(26篇)

【1】Reconsidering Dependency Networks from an Information Geometry Perspective
标题:从信息几何的角度重新考虑依赖网络
链接:https://arxiv.org/abs/2604.01117

作者:Kazuya Takabatake,Shotaro Akaho
备注:25 papers, 7 figures
摘要:Dependency networks (Heckerman et al., 2000) provide a flexible framework for modeling complex systems with many variables by combining independently learned local conditional distributions through pseudo-Gibbs sampling. Despite their computational advantages over Bayesian and Markov networks, the theoretical foundations of dependency networks remain incomplete, primarily because their model distributions -- defined as stationary distributions of pseudo-Gibbs sampling -- lack closed-form expressions. This paper develops an information-geometric analysis of pseudo-Gibbs sampling, interpreting each sampling step as an m-projection onto a full conditional manifold. Building on this interpretation, we introduce the full conditional divergence and derive an upper bound that characterizes the location of the stationary distribution in the space of probability distributions. We then reformulate both structure and parameter learning as optimization problems that decompose into independent subproblems for each node, and prove that the learned model distribution converges to the true underlying distribution as the number of training samples grows to infinity. Experiments confirm that the proposed upper bound is tight in practice.


【2】Event Embedding of Protein Networks : Compositional Learning of Biological Function
标题:蛋白质网络的事件嵌入:生物功能的组成学习
链接:https://arxiv.org/abs/2604.00911

作者:Antonin Sulc
备注:Machine Learning for Genomics Explorations (MLGenX) ICLR 2026 Workshop
摘要:In this work, we study whether enforcing strict compositional structure in sequence embeddings yields meaningful geometric organization when applied to protein-protein interaction networks. Using Event2Vec, an additive sequence embedding model, we train 64-dimensional representations on random walks from the human STRING interactome, and compare against a DeepWalk baseline based on Word2Vec, trained on the same walks. We find that compositional structure substantially improves pathway coherence (30.2$\times$ vs 2.9$\times$ above random), functional analogy accuracy (mean similarity 0.966 vs 0.650), and hierarchical pathway organization, while geometric properties such as norm--degree anticorrelation are shared with or exceeded by the non-compositional baseline. These results indicate that enforced compositionality specifically benefits relational and compositional reasoning tasks in biological networks.


【3】Fatigue-Aware Learning to Defer via Constrained Optimisation
标题:知识意识学习通过约束优化推迟
链接:https://arxiv.org/abs/2604.00904

作者:Zheng Zhang,Cuong C. Nguyen,David Rosewarne,Kevin Wells,Gustavo Carneiro
摘要 :Learning to defer (L2D) enables human-AI cooperation by deciding when an AI system should act autonomously or defer to a human expert. Existing L2D methods, however, assume static human performance, contradicting well-established findings on fatigue-induced degradation. We propose Fatigue-Aware Learning to Defer via Constrained Optimisation (FALCON), which explicitly models workload-varying human performance using psychologically grounded fatigue curves. FALCON formulates L2D as a Constrained Markov Decision Process (CMDP) whose state includes both task features and cumulative human workload, and optimises accuracy under human-AI cooperation budgets via PPO-Lagrangian training. We further introduce FA-L2D, a benchmark that systematically varies fatigue dynamics from near-static to rapidly degrading regimes. Experiments across multiple datasets show that FALCON consistently outperforms state-of-the-art L2D methods across coverage levels, generalises zero-shot to unseen experts with different fatigue patterns, and demonstrates the advantage of adaptive human-AI collaboration over AI-only or human-only decision-making when coverage lies strictly between 0 and 1.


【4】Embedded Variational Neural Stochastic Differential Equations for Learning Heterogeneous Dynamics
标题:用于学习异类动力学的嵌入式变分神经随机方程
链接:https://arxiv.org/abs/2604.00669

作者:Sandeep Kumar Samota,Reema Gupta,Snehashish Chakraverty
摘要:This study examines the challenges of modeling complex and noisy data related to socioeconomic factors over time, with a focus on data from various districts in Odisha, India. Traditional time-series models struggle to capture both trends and variations together in this type of data. To tackle this, a Variational Neural Stochastic Differential Equation (V-NSDE) model is designed that combines the expressive dynamics of Neural SDEs with the generative capabilities of Variational Autoencoders (VAEs). This model uses an encoder and a decoder. The encoder takes the initial observations and district embeddings and translates them into a Gaussian distribution, which determines the mean and log-variance of the first latent state. Then the obtained latent state initiates the Neural SDE, which utilize neural networks to determine the drift and diffusion functions that govern continuous-time latent dynamics. These governing functions depend on the time index, latent state, and district embedding, which help the model learn the unique characteristics specific to each district. After that, using a probabilistic decoder, the observations are reconstructed from the latent trajectory. The decoder outputs a mean and log-variance for each time step, which follows the Gaussian likelihood. The Evidence Lower Bound (ELBO) training loss improves by adding a KL-divergence regularization term to the negative log-likelihood (nll). The obtained results demonstrate the effective learning of V-NSDE in recognizing complex patterns over time, yielding realistic outcomes that include clear trends and random fluctuations across different areas.


【5】Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning
标题:多摄像机视图缩放实现数据高效的机器人模仿学习
链接:https://arxiv.org/abs/2604.00557

作者:Yichen Xie,Yixiao Wang,Shuqi Zhao,Cheng-En Wu,Masayoshi Tomizuka,Jianwen Xie,Hao-Shu Fang
摘要:The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during demonstration collection. Instead of acquiring more trajectories, multiple synchronized camera perspectives are used to generate pseudo-demonstrations from each expert trajectory, which enriches the training distribution and improves viewpoint invariance in visual representations. We analyze how different action spaces interact with view scaling and show that camera-space representations further enhance diversity. In addition, we introduce a multiview action aggregation method that allows single-view policies to benefit from multiple cameras during deployment. Extensive experiments in simulation and real-world manipulation tasks demonstrate significant gains in data efficiency and generalization compared to single-view baselines. Our results suggest that scaling camera views provides a practical and scalable solution for imitation learning, which requires minimal additional hardware setup and integrates seamlessly with existing imitation learning algorithms. The website of our project is https://yichen928.github.io/robot_multiview.


【6】Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models
标题:统一是有代价的吗?Uni-SafeBench:统一多模态大型模型的安全基准
链接:https://arxiv.org/abs/2604.00547

作者:Zixiang Peng,Yongxiu Xu,Qinyi Zhang,Jiexun Shen,Yifan Zhang,Hongbo Xu,Yubin Wang,Gaopeng Gou
摘要:Unified Multimodal Large Models (UMLMs) integrate understanding and generation capabilities within a single architecture. While this architectural unification, driven by the deep fusion of multimodal features, enhances model performance, it also introduces important yet underexplored safety challenges. Existing safety benchmarks predominantly focus on isolated understanding or generation tasks, failing to evaluate the holistic safety of UMLMs when handling diverse tasks under a unified framework. To address this, we introduce Uni-SafeBench, a comprehensive benchmark featuring a taxonomy of six major safety categories across seven task types. To ensure rigorous assessment, we develop Uni-Judger, a framework that effectively decouples contextual safety from intrinsic safety. Based on comprehensive evaluations across Uni-SafeBench, we uncover that while the unification process enhances model capabilities, it significantly degrades the inherent safety of the underlying LLM. Furthermore, open-source UMLMs exhibit much lower safety performance than multimodal large models specialized for either generation or understanding tasks. We open-source all resources to systematically expose these risks and foster safer AGI development.


【7】Learning from Many and Adapting to the Unknown in Open-set Test Streams
标题:开放测试流中的多源学习与未知适应
链接:https://arxiv.org/abs/2604.00533

作者:Xiao Zhang,Juntao Lyu,Tianyu Hu,Qianchuan Zhao,Huimin Ma
摘要 :Large Language Models (LLMs) generalize across tasks via reusable representations and flexible reasoning, yet remain brittle in real deployment under evolving tasks and continual distribution shift. A common approach is Test-Time Adaptation (TTA), existing ones of which updates models with hand-designed unsupervised objectives over the full parameter space and mostly overlook preserving shared source knowledge and the reliability of adaptation signals. Drawing on molecular signaling cascades of memory updating in Drosophila, we propose Synapse Consolidation (SyCo), a parameter-efficient LLM adaptation method that updates low-rank adapters through Rac1 and MAPK pathways under the guidance of a structured TTA objective driven by problem understanding, process understanding, and source-domain guardrail. Rac1 confines plasticity to a tail-gradient subspace that is less critical for source knowledge, enabling rapid specialization while preserving source representations. MAPK uses a tiered controller to suppress noisy updates and consolidate useful adaptations under non-stationary streams. To model real deployments with multiple sources and continually emerging tasks, we introduce Multi-source Open-set Adaptation (MOA) setting, where a model is trained on multiple labeled source tasks and then adapts on open, non-stationary unlabeled test streams that mix seen and unseen tasks with partial overlap in label and intent space. Across 18 NLP datasets and the MOA setting, SyCo consistently outperforms strong baselines, achieving 78.31\% on unseen-task adaptation and 85.37\% on unseen-data shifts.


【8】Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks
标题:过度参数化浅层神经网络的初始化相关和非空洞的概括边界
链接:https://arxiv.org/abs/2604.00505

作者:Yunwen Lei,Yufeng Xie
摘要:Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emph{fully} initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.


【9】Phase space integrity in neural network models of Hamiltonian dynamics: A Lagrangian descriptor approach
标题:汉密尔顿动力学神经网络模型中的相空间完整性:拉格朗日描述符方法
链接:https://arxiv.org/abs/2604.00473

作者:Abrari Noor Hasmi,Haralampos Hatzikirou,Hadi Susanto
备注:40 pages, 22 figures
摘要:We propose Lagrangian Descriptors (LDs) as a diagnostic framework for evaluating neural network models of Hamiltonian systems beyond conventional trajectory-based metrics. Standard error measures quantify short-term predictive accuracy but provide little insight into global geometric structures such as orbits and separatrices. Existing evaluation tools in dissipative systems are inadequate for Hamiltonian dynamics due to fundamental differences in the systems. By constructing probability density functions weighted by LD values, we embed geometric information into a statistical framework suitable for information-theoretic comparison. We benchmark physically constrained architectures (SympNet, HénonNet, Generalized Hamiltonian Neural Networks) against data-driven Reservoir Computing across two canonical systems. For the Duffing oscillator, all models recover the homoclinic orbit geometry with modest data requirements, though their accuracy near critical structures varies. For the three-mode nonlinear Schrödinger equation, however, clear differences emerge: symplectic architectures preserve energy but distort phase-space topology, while Reservoir Computing, despite lacking explicit physical constraints, reproduces the homoclinic structure with high fidelity. These results demonstrate the value of LD-based diagnostics for assessing not only predictive performance but also the global dynamical integrity of learned Hamiltonian models.


【10】Learning Humanoid Navigation from Human Data
标题:从人类数据学习类人导航
链接:https://arxiv.org/abs/2604.00416

作者:Weizhuo Wang,Yanjie Ze,C. Karen Liu,Monroe Kennedy
备注:8 pages 8 figures
摘要:We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: https://egonav.weizhuowang.com


【11】Deep Networks Favor Simple Data
标题:深度网络青睐简单数据
链接:https://arxiv.org/abs/2604.00394

作者:Weyl Lu,Chenjie Hao,Yubei Chen
摘要 :Estimated density is often interpreted as indicating how typical a sample is under a model. Yet deep models trained on one dataset can assign \emph{higher} density to simpler out-of-distribution (OOD) data than to in-distribution test data. We refer to this behavior as the OOD anomaly. Prior work typically studies this phenomenon within a single architecture, detector, or benchmark, implicitly assuming certain canonical densities. We instead separate the trained network from the density estimator built from its representations or outputs. We introduce two estimators: Jacobian-based estimators and autoregressive self-estimators, making density analysis applicable to a wide range of models.   Applying this perspective to a range of models, including iGPT, PixelCNN++, Glow, score-based diffusion models, DINOv2, and I-JEPA, we find the same striking regularity that goes beyond the OOD anomaly: \textbf{lower-complexity samples receive higher estimated density, while higher-complexity samples receive lower estimated density}. This ordering appears within a test set and across OOD pairs such as CIFAR-10 and SVHN, and remains highly consistent across independently trained models. To quantify these orderings, we introduce Spearman rank correlation and find striking agreement both across models and with external complexity metrics. Even when trained only on the lowest-density (most complex) samples or \textbf{even a single such sample} the resulting models still rank simpler images as higher density.   These observations lead us beyond the original OOD anomaly to a more general conclusion: deep networks consistently favor simple data. Our goal is not to close this question, but to define and visualize it more clearly. We broaden its empirical scope and show that it appears across architectures, objectives, and density estimators.


【12】Gradient-Based Data Valuation Improves Curriculum Learning for Game-Theoretic Motion Planning
标题:基于学生的数据评估改善游戏理论运动规划的课程学习
链接:https://arxiv.org/abs/2604.00388

作者:Shihao Li,Jiachen Li,Dongmei Chen
摘要:We demonstrate that gradient-based data valuation produces curriculum orderings that significantly outperform metadata-based heuristics for training game-theoretic motion planners. Specifically, we apply TracIn gradient-similarity scoring to GameFormer on the nuPlan benchmark and construct a curriculum that weights training scenarios by their estimated contribution to validation loss reduction. Across three random seeds, the TracIn-weighted curriculum achieves a mean planning ADE of $1.704\pm0.029$\,m, significantly outperforming the metadata-based interaction-difficulty curriculum ($1.822\pm0.014$\,m; paired $t$-test $p=0.021$, Cohen's $d_z=3.88$) while exhibiting lower variance than the uniform baseline ($1.772\pm0.134$\,m). Our analysis reveals that TracIn scores and scenario metadata are nearly orthogonal (Spearman $ρ=-0.014$), indicating that gradient-based valuation captures training dynamics invisible to hand-crafted features. We further show that gradient-based curriculum weighting succeeds where hard data selection fails: TracIn-curated 20\% subsets degrade performance by $2\times$, whereas full-data curriculum weighting with the same scores yields the best results. These findings establish gradient-based data valuation as a practical tool for improving sample efficiency in game-theoretic planning.


【13】MVNN: A Measure-Valued Neural Network for Learning McKean-Vlasov Dynamics from Particle Data
标题:MVNN:一个用于从粒子数据学习McKean-Vlasov动力学的测量值神经网络
链接:https://arxiv.org/abs/2604.00333

作者:Liyao Lyu,Xinyue Yu,Hayden Schaeffer
摘要:Collective behaviors that emerge from interactions are fundamental to numerous biological systems. To learn such interacting forces from observations, we introduce a measure-valued neural network that infers measure-dependent interaction (drift) terms directly from particle-trajectory observations. The proposed architecture generalizes standard neural networks to operate on probability measures by learning cylindrical features, using an embedding network that produces scalable distribution-to-vector representations. On the theory side, we establish well-posedness of the resulting dynamics and prove propagation-of-chaos for the associated interacting-particle system. We further show universal approximation and quantitative approximation rates under a low-dimensional measure-dependence assumption. Numerical experiments on first and second order systems, including deterministic and stochastic Motsch-Tadmor dynamics, two-dimensional attraction-repulsion aggregation, Cucker-Smale dynamics, and a hierarchical multi-group system, demonstrate accurate prediction and strong out-of-distribution generalization.


【14】SYNTHONY: A Stress-Aware, Intent-Conditioned Agent for Deep Tabular Generative Models Selection
标题:SYNTHONY:用于深度表格生成模型选择的压力感知、意向调节代理
链接:https://arxiv.org/abs/2604.00293

作者:Hochan Son,Xiaofeng Lin,Jason Ni,Guang Cheng
摘要:Deep generative models for tabular data (GANs, diffusion models, and LLM-based generators) exhibit highly non-uniform behavior across datasets; the best-performing synthesizer family depends strongly on distributional stressors such as long-tailed marginals, high-cardinality categorical, Zipfian imbalance, and small-sample regimes. This brittleness makes practical deployment challenging, especially when users must balance competing objectives of fidelity, privacy, and utility. We study {intent-conditioned tabular synthesis selection}: given a dataset and a user intent expressed as a preference over evaluation metrics, the goal is to select a synthesizer that minimizes regret relative to an intent-specific oracle. We propose {stress profiling}, a synthesis-specific meta-feature representation that quantifies dataset difficulty along four interpretable stress dimensions, and integrate it into {SYNTHONY}, a selection framework that matches stress profiles against a calibrated capability registry of synthesizer families. Across a benchmark of 7 datasets, 10 synthesizers, and 3 intents, we demonstrate that stress-based meta-features are highly predictive of synthesizer performance: a $k$NN selector using these features achieves strong Top-1 selection accuracy, substantially outperforming zero-shot LLM selectors and random baselines. We analyze the gap between meta-feature-based and capability-based selection, identifying the hand-crafted capability registry as the primary bottleneck and motivating learned capability representations as a direction for future work.


【15】MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control
标题:MambaVoiceCloning:通过状态空间建模和扩散控制实现高效且富有表达力的文本到语音
链接:https://arxiv.org/abs/2604.00292

作者:Sahil Kumar,Namrataben Patel,Honggang Wang,Youshan Zhang
备注:Accepted at ICLR 2026
摘要 :MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an Expressive Mamba with AdaLN modulation, yielding linear-time O(T) conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior Mamba-TTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 mel-diffusion-vocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but statistically reliable gains over StyleTTS2, VITS, and Mamba-attention hybrids in MOS/CMOS, F0 RMSE, MCD, and WER, while reducing encoder parameters to 21M and improving throughput by 1.6x. Diffusion remains the dominant latency source, but SSM-only conditioning improves memory footprint, stability, and deployability.


【16】Informed Machine Learning with Knowledge Landmarks
标题:具有知识地标的知情机器学习
链接:https://arxiv.org/abs/2604.00256

作者:Chuyi Dai,Witold Pedrycz,Suping Xu,Ding Liu,Xianmin Wang
摘要:Informed Machine Learning has emerged as a viable generalization of Machine Learning (ML) by building a unified conceptual and algorithmic setting for constructing models on a unified basis of knowledge and data. Physics-informed ML involving physics equations is one of the developments within Informed Machine Learning. This study proposes a novel direction of Knowledge-Data ML, referred to as KD-ML, where numeric data are integrated with knowledge tidbits expressed in the form of granular knowledge landmarks. We advocate that data and knowledge are complementary in several fundamental ways: data are precise (numeric) and local, usually confined to some region of the input space, while knowledge is global and formulated at a higher level of abstraction. The knowledge can be represented as information granules and organized as a collection of input-output information granules called knowledge landmarks. In virtue of this evident complementarity, we develop a comprehensive design process of the KD-ML model and formulate an original augmented loss function L, which additively embraces the component responsible for optimizing the model based on available numeric data, while the second component, playing the role of a granular regularizer, so that it adheres to the granular constraints (knowledge landmarks). We show the role of the hyperparameter positioned in the loss function, which balances the contribution and guiding role of data and knowledge, and point to some essential tendencies associated with the quality of data (noise level) and the level of granularity of the knowledge landmarks. Experiments on two physics-governed benchmarks demonstrate that the proposed KD model consistently outperforms data-driven ML models.


【17】Lévy-Flow Models: Heavy-Tail-Aware Normalizing Flows for Financial Risk Management
标题:Lévy-Flow模型:具有重尾意识的金融风险管理规范化流程
链接:https://arxiv.org/abs/2604.00195

作者:Rachid Drissi
备注:15 pages, 5 figures, 7 tables
摘要:We introduce Lévy-Flows, a class of normalizing flow models that replace the standard Gaussian base distribution with Lévy process-based distributions, specifically Variance Gamma (VG) and Normal-Inverse Gaussian (NIG). These distributions naturally capture heavy-tailed behavior while preserving exact likelihood evaluation and efficient reparameterized sampling. We establish theoretical guarantees on tail behavior, showing that for regularly varying bases the tail index is preserved under asymptotically linear flow transformations, and that identity-tail Neural Spline Flow architectures preserve the base distribution's tail shape exactly outside the transformation region. Empirically, we evaluate on S&P 500 daily returns and additional assets, demonstrating substantial improvements in density estimation and risk calibration. VG-based flows reduce test negative log-likelihood by 69% relative to Gaussian flows and achieve exact 95% VaR calibration, while NIG-based flows provide the most accurate Expected Shortfall estimates. These results show that incorporating Lévy process structure into normalizing flows yields significant gains in modeling heavy-tailed data, with applications to financial risk management.


【18】Speeding Up Mixed-Integer Programming Solvers with Sparse Learning for Branching
标题:利用稀疏学习进行分支加速混合并行编程求解器
链接:https://arxiv.org/abs/2604.00094

作者:Selin Bayramoğlu,George L Nemhauser,Nikolaos V Sahinidis
备注:21 pages, 2 figures
摘要:Machine learning is increasingly used to improve decisions within branch-and-bound algorithms for mixed-integer programming. Many existing approaches rely on deep learning, which often requires very large training datasets and substantial computational resources for both training and deployment, typically with GPU parallelization. In this work, we take a different path by developing interpretable models that are simple but effective. We focus on approximating strong branching (SB) scores, a highly effective yet computationally expensive branching rule. Using sparse learning methods, we build models with fewer than 4% of the parameters of a state-of-the-art graph neural network (GNN) while achieving competitive accuracy. Relative to SCIP's built-in branching rules and the GNN-based model, our CPU-only models are faster than the default solver and the GPU-accelerated GNN. The models are simple to train and deploy, and they remain effective with small training sets, which makes them practical in low-resource settings. Extensive experiments across diverse problem classes demonstrate the efficiency of this approach.


【19】Learning to Play Blackjack: A Curriculum Learning Perspective
标题:学习玩二十一点:课程学习的角度
链接:https://arxiv.org/abs/2604.00076

作者:Amirreza Alasti,Efe Erdal,Yücel Celik,Theresa Eimer
备注:Accepted as an oral presentation at the International Conference on Distributed Artificial Intelligence (DAI 2025). 16 pages, 7 figures
摘要 :Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.


【20】Perspective: Towards sustainable exploration of chemical spaces with machine learning
标题:展望:通过机器学习实现化学空间的可持续探索
链接:https://arxiv.org/abs/2604.00069

作者:Leonardo Medrano Sandonas,David Balcells,Anton Bochkarev,Jacqueline M. Cole,Volker L. Deringer,Werner Dobrautz,Adrian Ehrenhofer,Thorben Frank,Pascal Friederich,Rico Friedrich,Janine George,Luca Ghiringhelli,Alejandra Hinostroza Caldas,Veronika Juraskova,Hannes Kneiding,Yury Lysogorskiy,Johannes T. Margraf,Hanna Türk,Anatole von Lilienfeld,Milica Todorović,Alexandre Tkatchenko,Mariana Rossi,Gianaurelio Cuniberti
备注:44 pages, 8 figures, SusML workshop
摘要:Artificial intelligence is transforming molecular and materials science, but its growing computational and data demands raise critical sustainability challenges. In this Perspective, we examine resource considerations across the AI-driven discovery pipeline--from quantum-mechanical (QM) data generation and model training to automated, self-driving research workflows--building on discussions from the ``SusML workshop: Towards sustainable exploration of chemical spaces with machine learning'' held in Dresden, Germany. In this context, the availability of large quantum datasets has enabled rigorous benchmarking and rapid methodological progress, while also incurring substantial energy and infrastructure costs. We highlight emerging strategies to enhance efficiency, including general-purpose machine learning (ML) models, multi-fidelity approaches, model distillation, and active learning. Moreover, incorporating physics-based constraints within hierarchical workflows, where fast ML surrogates are applied broadly and high-accuracy QM methods are used selectively, can further optimize resource use without compromising reliability. Equally important is bridging the gap between idealized computational predictions and real-world conditions by accounting for synthesizability and multi-objective design criteria, which is essential for practical impact. Finally, we argue that sustainable progress will rely on open data and models, reusable workflows, and domain-specific AI systems that maximize scientific value per unit of computation, enabling efficient and responsible discovery of technological materials and therapeutics.


【21】Temporal Memory for Resource-Constrained Agents: Continual Learning via Stochastic Compress-Add-Smooth
标题:资源受限主体的时间记忆:通过随机压缩-添加-平滑的连续学习
链接:https://arxiv.org/abs/2604.00067

作者:Michael Chertkov
备注:33 pages, 22 figures
摘要:An agent that operates sequentially must incorporate new experience without forgetting old experience, under a fixed memory budget. We propose a framework in which memory is not a parameter vector but a stochastic process: a Bridge Diffusion on a replay interval $[0,1]$, whose terminal marginal encodes the present and whose intermediate marginals encode the past. New experience is incorporated via a three-step \emph{Compress--Add--Smooth} (CAS) recursion. We test the framework on the class of models with marginal probability densities modeled via Gaussian mixtures of fixed number of components~$K$ in $d$ dimensions; temporal complexity is controlled by a fixed number~$L$ of piecewise-linear protocol segments whose nodes store Gaussian-mixture states. The entire recursion costs $O(LKd^2)$ flops per day -- no backpropagation, no stored data, no neural networks -- making it viable for controller-light hardware.   Forgetting in this framework arises not from parameter interference but from lossy temporal compression: the re-approximation of a finer protocol by a coarser one under a fixed segment budget. We find that the retention half-life scales linearly as $a_{1/2}\approx c\,L$ with a constant $c>1$ that depends on the dynamics but not on the mixture complexity~$K$, the dimension~$d$, or the geometry of the target family. The constant~$c$ admits an information-theoretic interpretation analogous to the Shannon channel capacity. The stochastic process underlying the bridge provides temporally coherent ``movie'' replay -- compressed narratives of the agent's history, demonstrated visually on an MNIST latent-space illustration. The framework provides a fully analytical ``Ising model'' of continual learning in which the mechanism, rate, and form of forgetting can be studied with mathematical precision.


【22】Inverse Design of Optical Multilayer Thin Films using Robust Masked Diffusion Models
标题:基于鲁棒掩模扩散模型的光学多层膜逆设计
链接:https://arxiv.org/abs/2604.01106

作者:Jonas Schaible,Asena Karolin Özdemir,Charlotte Debus,Sven Burger,Achim Streit,Christiane Becker,Klaus Jäger,Markus Götz
备注:24 pages, 14 Figures
摘要:Inverse design of optical multilayer stacks seeks to infer layer materials, thicknesses, and ordering from a desired target spectrum. It is a long-standing challenge due to the large design space and non-unique solutions. We introduce \texttt{OptoLlama}, a masked diffusion language model for inverse thin-film design from optical spectra. Representing multilayer stacks as sequences of material-thickness tokens, \texttt{OptoLlama} conditions generation on reflectance, absorptance, and transmittance spectra and learns a probabilistic mapping from optical response to structure. Evaluated on a representative test set of 3,000 targets, \texttt{OptoLlama} reduces the mean absolute spectral error by 2.9-fold relative to a nearest-neighbor template baseline and by 3.45-fold relative to the state-of-the-art data-driven baseline, called \texttt{OptoGPT}. Case studies on designed and expert-defined targets show that the model reproduces characteristic spectral features and recovers physically meaningful stack motifs, including distributed Bragg reflectors. These results establish diffusion-based sequence modeling as a powerful framework for inverse photonic design.


【23】Neural Ordinary Differential Equations for Modeling Socio-Economic Dynamics
标题:用于社会经济动态建模的神经常微方程
链接:https://arxiv.org/abs/2604.00632

作者:Sandeep Kumar Samota,Snehashish Chakraverty,Narayan Sethi
摘要 :Poverty is a complex dynamic challenge that cannot be adequately captured using predefined differential equations. Nowadays, artificial machine learning (ML) methods have demonstrated significant potential in modelling real-world dynamical systems. Among these, Neural Ordinary Differential Equations (Neural ODEs) have emerged as a powerful, data-driven approach for learning continuous-time dynamics directly from observations. This chapter applies the Neural ODE framework to analyze poverty dynamics in the Indian state of Odisha. Specifically, we utilize time-series data from 2007 to 2020 on the key indicators of economic development and poverty reduction. Within the Neural ODE architecture, the temporal gradient of the system is represented by a multi-layer perceptron (MLP). The obtained neural dynamical system is integrated using a numerical ODE solver to obtain the trajectory of over time. In backpropagation, the adjoint sensitivity method is utilized for gradient computation during training to facilitate effective backpropagation through the ODE solver. The trained Neural ODE model reproduces the observed data with high accuracy. This demonstrates the capability of Neural ODE to capture the dynamics of the poverty indicator of concrete-structured households. The obtained results show that ML methods, such as Neural ODEs, can serve as effective tools for modeling socioeconomic transitions. It can provide policymakers with reliable projections, supporting more informed and effective decision-making for poverty alleviation.


【24】Breaking Data Symmetry is Needed For Generalization in Feature Learning Kernels
标题:特征学习核的泛化需要打破数据对称性
链接:https://arxiv.org/abs/2604.00316

作者:Marcel Tomàs Bernal,Neil Rohit Mallinar,Mikhail Belkin
摘要:Grokking occurs when a model achieves high training accuracy but generalization to unseen test points happens long after that. This phenomenon was initially observed on a class of algebraic problems, such as learning modular arithmetic (Power et al., 2022). We study grokking on algebraic tasks in a class of feature learning kernels via the Recursive Feature Machine (RFM) algorithm (Radhakrishnan et al., 2024), which iteratively updates feature matrices through the Average Gradient Outer Product (AGOP) of an estimator in order to learn task-relevant features. Our main experimental finding is that generalization occurs only when a certain symmetry in the training set is broken. Furthermore, we empirically show that RFM generalizes by recovering the underlying invariance group action inherent in the data. We find that the learned feature matrices encode specific elements of the invariance group, explaining the dependence of generalization on symmetry.


【25】Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part II-On the Strength of Weak Learnability and the Boosting Paradigm
标题:蚁群与整体学习之间的同质功能:第二部分-弱可学习性的强度和增强范式
链接:https://arxiv.org/abs/2604.00038

作者:Ernest Fokoué,Gregory Babbitt,Yuval Levental
备注:21 pages, 5 figures, 4 tables
摘要:In Part I of this series, we established a rigorous mathematical isomorphism between ant colony decision-making and random forest learning, demonstrating that variance reduction through decorrelation is a universal principle shared by biological and computational ensembles. Here we turn to the complementary mechanism: bias reduction through adaptive weighting. Just as boosting algorithms sequentially focus on difficult instances, ant colonies dynamically amplify successful foraging paths through pheromone-mediated recruitment. We prove that these processes are mathematically isomorphic, establishing that the fundamental theorem of weak learnability has a direct analog in colony decision-making. We develop a formal mapping between AdaBoost's adaptive reweighting and ant recruitment dynamics, show that the margin theory of boosting corresponds to the stability of quorum decisions, and demonstrate through comprehensive simulation that ant colonies implementing adaptive recruitment achieve the same bias-reduction benefits as boosting algorithms. This completes a unified theory of ensemble intelligence, revealing that both variance reduction (Part I) and bias reduction (Part II) are manifestations of the same underlying mathematical principles governing collective intelligence in biological and computational systems.


【26】When and Where: A Model Hippocampal Network Unifies Formation of Time Cells and Place Cells
标题:何时何地:统一时间细胞和位置细胞形成的海马网络模型
链接:https://arxiv.org/abs/2604.00036

作者:Qiaorong S. Yu,Zhaoze Wang,Vijay Balasubramanian
备注:18 pages, 6 figures
摘要:Hippocampal place and time cells encode spatial and temporal aspects of experience. Both have the same neural substrate, but have been modeled as having different functions and mechanistic origins, place cells as continuous attractors, and time cells as leaky integrators. Here, we show that both types emerge from two dynamical regimes of a single recurrent network (RNN) modeling hippocampal CA3 as a predictive autoencoder. The network receives simulated, partially occluded ``experience vectors" containing spatial patterns (location-specific activity sampled during environmental traversal) and/or temporal patterns (correlated activity pairs separated by ``void" intervals), and is trained to reconstruct missing input. During spatial navigation, the network generates stable attractor-like place fields. But trained on temporally structured inputs, the network produces sequentially broadened fields, recapitulating time cells. By varying spatio-temporal input patterning, we observe hidden units transition smoothly between time cell-like and place cell-like representations. These results suggest a shared origin, but task-driven difference, between place and time cells.


其他(36篇)

【1】CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery
标题:CliffSearch:科学算法发现的理论和代码结构化统计协同进化
链接:https://arxiv.org/abs/2604.01210

作者:Youssef Mroueh,Carlos Fonseca,Brian Belgodere,David Cox
摘要 :Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress-tested, and revised. Current LLM-guided search systems accelerate proposal generation, but often under-represent scientific structure by optimizing code-only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first-class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence-guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark-grounded studies: transformer hyper-connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native-optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer-gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at https://cliffsearch.ai .


【2】Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction
标题:用于高质量基于基元的神经重建的神经调和纹理
链接:https://arxiv.org/abs/2604.01204

作者:Jorge Condor,Nicolas Moenne-Loccoz,Merlin Nimier-David,Piotr Didyk,Zan Gojcic,Qi Wu
摘要:Primitive-based methods such as 3D Gaussian Splatting have recently become the state-of-the-art for novel-view synthesis and related reconstruction tasks. Compared to neural fields, these representations are more flexible, adaptive, and scale better to large scenes. However, the limited expressivity of individual primitives makes modeling high-frequency detail challenging. We introduce Neural Harmonic Textures, a neural representation approach that anchors latent feature vectors on a virtual scaffold surrounding each primitive. These features are interpolated within the primitive at ray intersection points. Inspired by Fourier analysis, we apply periodic activations to the interpolated features, turning alpha blending into a weighted sum of harmonic components. The resulting signal is then decoded in a single deferred pass using a small neural network, significantly reducing computational cost. Neural Harmonic Textures yield state-of-the-art results in real-time novel view synthesis while bridging the gap between primitive- and neural-field-based reconstruction. Our method integrates seamlessly into existing primitive-based pipelines such as 3DGUT, Triangle Splatting, and 2DGS. We further demonstrate its generality with applications to 2D image fitting and semantic reconstruction.


【3】Screening Is Enough
标题:筛选就足够了
链接:https://arxiv.org/abs/2604.01178

作者:Ken M. Nakanishi
备注:21 pages, 13 figures
摘要:A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.


【4】Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
标题:论文重建评估:评估人工智能撰写论文中的表现和幻觉
链接:https://arxiv.org/abs/2604.01128

作者:Atsuyuki Miyai,Mashiro Toyooka,Zaiying Zhao,Kenta Watanabe,Toshihiko Yamasaki,Kiyoharu Aizawa
备注:Project Page: https://agent4science-utokyo.github.io/PaperRecon_HP/
摘要:This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.


【5】Do Phone-Use Agents Respect Your Privacy?
标题:电话使用代理尊重您的隐私吗?
链接:https://arxiv.org/abs/2604.00986

作者:Zhengyang Tang,Ke Ji,Xidong Wang,Zihan Ye,Xinyuan Wang,Yiduo Guo,Ziniu Li,Chenxin Li,Jingyuan Hu,Shunian Chen,Tongxu Luo,Jiaxi Bi,Zeyu Qin,Shaobo Wang,Xin Lai,Pengyuan Lyu,Junyi Li,Can Xu,Chengquan Zhang,Han Hu,Ming Yan,Benyou Wang
备注:work in progress
摘要:We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/tangzhy/MyPhoneBench.


【6】Rapid mixing in positively weighted restricted Boltzmann machines
标题:在正加权限制Boltzmann机器中快速混合
链接:https://arxiv.org/abs/2604.00963

作者:Weiming Feng,Heng Guo,Minji Yang
摘要:We show polylogarithmic mixing time bounds for the alternating-scan sampler for positively weighted restricted Boltzmann machines. This is done via analysing the same chain and the Glauber dynamics for ferromagnetic two-spin systems, where we obtain new mixing time bounds up to the critical thresholds.


【7】Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time
标题:调查野外自治代理的贡献:活动模式和代码随时间的变化
链接:https://arxiv.org/abs/2604.00917

作者:Razvan Mihai Popescu,David Gros,Andrei Botocan,Rahul Pandita,Prem Devanbu,Maliheh Izadi
备注:MSR 2026 Technical Track
摘要:The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Their growing role offers a unique and timely opportunity to investigate AI-driven contributions and their effects on code quality, team dynamics, and software maintainability. In this work, we construct a novel dataset of approximately $110,000$ open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Furthermore, we emphasize that code authoring and review are only a small part of the larger software engineering process, as the resulting code must also be maintained and updated over time. Hence, we offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code. Ultimately, our findings indicate an increasing agent activity in open-source projects, although their contributions are associated with more churn over time compared to human-authored code.


【8】Orthogonal Learner for Estimating Heterogeneous Long-Term Treatment Effects
标题:估计不同长期治疗效果的垂直学习器
链接:https://arxiv.org/abs/2604.00915

作者:Haorui Ma,Dennis Frauen,Valentyn Melnychuk,Stefan Feuerriegel
摘要:Estimation of heterogeneous long-term treatment effects (HLTEs) is widely used for personalized decision-making in marketing, economics, and medicine, where short-term randomized experiments are often combined with long-term observational data. However, HLTE estimation is challenging due to limited overlap in treatment or in observing long-term outcomes for certain subpopulations, which can lead to unstable HLTE estimates with large finite-sample variance. To address this challenge, we introduce the LT-O-learners (Long-Term Orthogonal Learners), a set of novel orthogonal learners for HLTE estimation. The learners are designed for the canonical HLTE setting that combines a short-term randomized dataset $\mathcal{D}_1$ with a long-term historical dataset $\mathcal{D}_2$. The key idea of our LT-O-Learners is to retarget the learning objective by introducing custom overlap weights that downweight samples with low overlap in treatment or in long-term observation. We show that the retargeted loss is equivalent to the weighted oracle loss and satisfies Neyman-orthogonality, which means our learners are robust to errors in the nuisance estimation. We further provide a general error bound for the LT-O-Learners and give the conditions under which quasi-oracle rate can be achieved. Finally, our LT-O-learners are model-agnostic and can thus be instantiated with arbitrary machine learning models. We conduct empirical evaluations on synthetic and semi-synthetic benchmarks to confirm the theoretical properties of our LT-O-Learners, especially the robustness in low-overlap settings. To the best of our knowledge, ours are the first orthogonal learners for HLTE estimation that are robust to low overlap that is common in long-term outcomes.


【9】Accurate and Scalable Matrix Mechanisms via Divide and Conquer
标题:通过分治实现精确且可扩展的矩阵机制
链接:https://arxiv.org/abs/2604.00868

作者:Guanlin He,Yingtai Xiao,Jiamu Bai,Xin Gu,Zeyu Ding,Wenpeng Yin,Daniel Kifer
备注:17 pages
摘要 :Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads.   In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload.   We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.


【10】Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants
标题:主动Agent研究环境:模拟主动用户评估主动助理
链接:https://arxiv.org/abs/2604.00842

作者:Deepak Nathani,Cheng Zhang,Chang Huan,Jiaming Shan,Yinfei Yang,Alkesh Patel,Zhe Gan,William Yang Wang,Michael Saxon,Xin Eric Wang
备注:34 pages, 8 figures, 5 tables
摘要:Proactive agents that anticipate user needs and autonomously execute tasks hold great promise as digital assistants, yet the lack of realistic user simulation frameworks hinders their development. Existing approaches model apps as flat tool-calling APIs, failing to capture the stateful and sequential nature of user interaction in digital environments and making realistic user simulation infeasible. We introduce Proactive Agent Research Environment (Pare), a framework for building and evaluating proactive agents in digital environments. Pare models applications as finite state machines with stateful navigation and state-dependent action space for the user simulator, enabling active user simulation. Building on this foundation, we present Pare-Bench, a benchmark of 143 diverse tasks spanning communication, productivity, scheduling, and lifestyle apps, designed to test context observation, goal inference, intervention timing, and multi-app orchestration.


【11】Routing-Free Mixture-of-Experts
标题:无需预约的专家混合
链接:https://arxiv.org/abs/2604.00801

作者:Yilun Liu,Jinru Han,Sikuan Yan,Volker Tresp,Yunpu Ma
备注:Code is available at https://github.com/liuyilun2000/RoutingFreeMoE/tree/release
摘要:Standard Mixture-of-Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing-Free MoE which eliminates any hard-coded centralized designs including external routers, Softmax, Top-K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load-balancing framework to simultaneously optimize both expert-balancing and token-balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing-Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.


【12】Preference Guided Iterated Pareto Referent Optimisation for Accessible Route Planning
标题:基于偏好引导的迭代Pareto参考优化的可扩展路径规划
链接:https://arxiv.org/abs/2604.00795

作者:Paolo Speziali,Arno De Greef,Mehrdad Asadi,Willem Röpke,Ann Nowé,Diederik M. Roijers
摘要:We propose the Preference Guided Iterated Pareto Referent Optimisation (PG-IPRO) for urban route planning for people with different accessibility requirements and preferences. With this algorithm the user can interact with the system by giving feedback on a route, i.e., the user can say which objective should be further minimized, or conversely can be relaxed. This leads to intuitive user interaction, that is especially effective during early iterations compared to information-gain-based interaction. Furthermore, due to PG-IPRO's iterative nature, the full set of alternative, possibly optimal policies (the Pareto front), is never computed, leading to higher computational efficiency and shorter waiting times for users.


【13】Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention
标题:随机注意力:Connectome启发的随机路由,以实现表达性线性时间注意力
链接:https://arxiv.org/abs/2604.00754

作者:Zehao Jin,Yanan Sui
摘要:The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.


【14】To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
标题:小型化还是小型化:RAG考虑预训练的缩放定律
链接 :https://arxiv.org/abs/2604.00715

作者:Karan Singh,Michael Yu,Varun Gangal,Zhuofu Tao,Sachin Kumar,Emmy Liu,Steven Y. Feng
备注:Code and data at https://github.com/DegenAI-Labs/RAG-scaling-laws
摘要:Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.


【15】Performance of Neural and Polynomial Operator Surrogates
标题:神经和多项运算符代理的性能
链接:https://arxiv.org/abs/2604.00689

作者:Josephine Westermann,Benno Huber,Thomas O'Leary-Roseberry,Jakob Zech
备注:44 pages, 21 figures
摘要:We consider the problem of constructing surrogate operators for parameter-to-solution maps arising from parametric partial differential equations, where repeated forward model evaluations are computationally expensive. We present a systematic empirical comparison of neural operator surrogates, including a reduced-basis neural operator trained with $L^2_μ$ and $H^1_μ$ objectives and the Fourier neural operator, against polynomial surrogate methods, specifically a reduced-basis sparse-grid surrogate and a reduced-basis tensor-train surrogate. All methods are evaluated on a linear parametric diffusion problem and a nonlinear parametric hyperelasticity problem, using input fields with algebraically decaying spectral coefficients at varying rates of decay $s$. To enable fair comparisons, we analyze ensembles of surrogate models generated by varying hyperparameters and compare the resulting Pareto frontiers of cost versus approximation accuracy, decomposing cost into contributions from data generation, setup, and evaluation. Our results show that no single method is universally superior. Polynomial surrogates achieve substantially better data efficiency for smooth input fields ($s \geq 2$), with convergence rates for the sparse-grid surrogate in agreement with theoretical predictions. For rough inputs ($s \leq 1$), the Fourier neural operator displays the fastest convergence rates. Derivative-informed training consistently improves data efficiency over standard $L^2_μ$ training, providing a competitive alternative for rough inputs in the low-data regime when Jacobian information is available at reasonable cost. These findings highlight the importance of matching the surrogate methodology to the regularity of the problem as well as accuracy demands and computational constraints of the application.


【16】On rankings in multiplayer games with an application to the game of Whist
标题:多人游戏中的排名及其在惠斯特游戏中的应用
链接:https://arxiv.org/abs/2604.00641

作者:Alexis Coyette,Charles Modera,Candy Sonveaux,Judicaël Mohet,Francçois-Grégoire Bierwart,Sylverio Pool Marquez,Jarod Ketcha Kouakep,Cédric Simal,Komlan Fiagbe,Violaine Piengeon,Martin Moriamé,Justine Bodart,Marie Dorchain,Maxime Lucas,Rommel Tchinda Djeudjo,Gianluca Peri,Eve Tilman
备注:Author order determined by the proposed ranking method
摘要:We propose a novel extension of the Bradley-Terry model to multiplayer games and adapt a recent algorithm by Newman [1] to our model. We demonstrate the use of our proposed method on synthetic datasets and on a real dataset of games of cards.


【17】HabitatAgent: An End-to-End Multi-Agent System for Housing Consultation
标题:AppoatAgent:一个端到端的住房咨询多代理系统
链接:https://arxiv.org/abs/2604.00556

作者:Hongyang Yang,Yanxin Zhang,Yang She,Yue Xiao,Hao Wu,Yiyang Zhang,Jiapeng Hou,Rongshan Zhang
备注:Accepted at the DMO-FinTech Workshop (PAKDD 2026)
摘要:Housing selection is a high-stakes and largely irreversible decision problem. We study housing consultation as a decision-support interface for housing selection. Existing housing platforms and many LLM-based assistants often reduce this process to ranking or recommendation, resulting in opaque reasoning, brittle multi-constraint handling, and limited guarantees on factuality.   We present HabitatAgent, the first LLM-powered multi-agent architecture for end-to-end housing consultation. HabitatAgent comprises four specialized agent roles: Memory, Retrieval, Generation, and Validation. The Memory Agent maintains multi-layer user memory through internal stages for constraint extraction, memory fusion, and verification-gated updates; the Retrieval Agent performs hybrid vector--graph retrieval (GraphRAG); the Generation Agent produces evidence-referenced recommendations and explanations; and the Validation Agent applies multi-tier verification and targeted remediation. Together, these agents provide an auditable and reliable workflow for end-to-end housing consultation.   We evaluate HabitatAgent on 100 real user consultation scenarios (300 multi-turn question--answer pairs) under an end-to-end correctness protocol. A strong single-stage baseline (Dense+Rerank) achieves 75% accuracy, while HabitatAgent reaches 95%.


【18】Lipschitz Dueling Bandits over Continuous Action Spaces
标题:利普希茨在连续行动空间上与强盗决斗
链接:https://arxiv.org/abs/2604.00523

作者:Mudit Sharma,Shweta Jain,Vaneet Aggarwal,Ganesh Ghalme
摘要:We study for the first time, stochastic dueling bandits over continuous action spaces with Lipschitz structure, where feedback is purely comparative. While dueling bandits and Lipschitz bandits have been studied separately, their combination has remained unexplored. We propose the first algorithm for Lipschitz dueling bandits, using round-based exploration and recursive region elimination guided by an adaptive reference arm. We develop new analytical tools for relative feedback and prove a regret bound of $\tilde O\left(T^{\frac{d_z+1}{d_z+2}}\right)$, where $d_z$ is the zooming dimension of the near-optimal region. Further, our algorithm takes only logarithmic space in terms of the total time horizon, best achievable by any bandit algorithm over a continuous action space.


【19】The Rashomon Effect for Visualizing High-Dimensional Data
标题:可视化多维数据的罗生门效应
链接:https://arxiv.org/abs/2604.00485

作者:Yiyang Sun,Haiyang Huang,Gaurav Rajesh Parikh,Cynthia Rudin
备注:The paper is accepted in AISTATS 2026
摘要:Dimension reduction (DR) is inherently non-unique: multiple embeddings can preserve the structure of high-dimensional data equally well while differing in layout or geometry. In this paper, we formally define the Rashomon set for DR -- the collection of `good' embedding -- and show how embracing this multiplicity leads to more powerful and trustworthy representations. Specifically, we pursue three goals. First, we introduce PCA-informed alignment to steer embeddings toward principal components, making axes interpretable without distorting local neighborhoods. Second, we design concept-alignment regularization that aligns an embedding dimension with external knowledge, such as class labels or user-defined concepts. Third, we propose a method to extract common knowledge across the Rashomon set by identifying trustworthy and persistent nearest-neighbor relationships, which we use to construct refined embeddings with improved local structure while preserving global relationships. By moving beyond a single embedding and leveraging the Rashomon set, we provide a flexible framework for building interpretable, robust, and goal-aligned visualizations.


【20】Internal State-Based Policy Gradient Methods for Partially Observable Markov Potential Games
标题:部分可观测马尔可夫势博弈的基于内部状态的策略梯度方法
链接:https://arxiv.org/abs/2604.00433

作者:Wonseok Yang,Thinh T. Doan
备注:6 pages, 2 figures. Submitted to IEEE Control Systems Letters (L-CSS) with CDC option
摘要:This letter studies multi-agent reinforcement learning in partially observable Markov potential games. Solving this problem is challenging due to partial observability, decentralized information, and the curse of dimensionality. First, to address the first two challenges, we leverage the common information framework, which allows agents to act based on both shared and local information. Second, to ensure tractability, we study an internal state that compresses accumulated information, preventing it from growing unboundedly over time. We then implement an internal state-based natural policy gradient method to find Nash equilibria of the Markov potential game. Our main contribution is to establish a non-asymptotic convergence bound for this method. Our theoretical bound decomposes into two interpretable components: a statistical error term that also arises in standard Markov potential games, and an approximation error capturing the use of finite-state controllers. Finally, simulations across multiple partially observable environments demonstrate that the proposed method using finite-state controllers achieves consistent improvements in performance compared to the setting where only the current observation is used.


【21】In harmony with gpt-oss
标题:与gPT-oss协调一致
链接:https://arxiv.org/abs/2604.00362

作者:Borislav Mavrin
摘要:No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence -- a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).


【22】The Persistent Vulnerability of Aligned AI Systems
标题:一致人工智能系统的持续脆弱性
链接:https://arxiv.org/abs/2604.00324

作者:Aengus Lynch
备注:PhD thesis, University College London, 2025. 157 pages. Supervised by Ricardo Silva
摘要 :Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers.   ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months.   Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours.   Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness.   Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations.   The thesis does not fully resolve any of these problems but makes each tractable and measurable.


【23】Robust Multimodal Safety via Conditional Decoding
标题:通过条件解码实现稳健的多模式安全
链接:https://arxiv.org/abs/2604.00310

作者:Anurag Kumar,Raghuveer Peri,Jon Burnsky,Alexandru Nelus,Rohit Paturi,Srikanth Vishnubhotla,Yanjun Qi
备注:8 pages + Appendix section. Submitted to ACL 2026
摘要:Multimodal large-language models (MLLMs) often experience degraded safety alignment when harmful queries exploit cross-modal interactions. Models aligned on text alone show a higher rate of successful attacks when extended to two or more modalities. In this work, we propose a simple conditional decoding strategy, CASA (Classification Augmented with Safety Attention) that utilizes internal representations of MLLMs to predict a binary safety token before response generation. We introduce a novel safety attention module designed to enhance the model's ability to detect malicious queries. Our design ensures robust safety alignment without relying on any external classifier or auxiliary head, and without the need for modality-specific safety fine-tuning. On diverse benchmarks such as MM-SafetyBench, JailbreakV-28k, and adversarial audio tests, CASA lowers the average attack success rate by more than 97% across modalities and across attack types. Our empirical evaluations also show that CASA maintains strong utility in benign inputs, a result validated through both automated and human evaluations (via 13 trained annotators). Together, these results highlight CASA as a simple and generalizable framework to improve multimodal LLM safety.


【24】Vocal Prognostic Digital Biomarkers in Monitoring Chronic Heart Failure: A Longitudinal Observational Study
标题:监测慢性心力衰竭的声音预后数字生物标志物:一项纵向观察研究
链接:https://arxiv.org/abs/2604.00308

作者:Fan Wu,Matthias P. Nägele,Daryush D. Mehta,Elgar Fleisch,Frank Ruschitzka,Andreas J. Flammer,Filipe Barata
摘要:Objective: This study aimed to evaluate which voice features can predict health deterioration in patients with chronic HF.   Background: Heart failure (HF) is a chronic condition with progressive deterioration and acute decompensations, often requiring hospitalization and imposing substantial healthcare and economic burdens. Current standard-of-care (SoC) home monitoring, such as weight tracking, lacks predictive accuracy and requires high patient engagement. Voice is a promising non-invasive biomarker, though prior studies have mainly focused on acute HF stages.   Methods: In a 2-month longitudinal study, 32 patients with HF collected daily voice recordings and SoC measures of weight and blood pressure at home, with biweekly questionnaires for health status. Acoustic analysis generated detailed vowel and speech features. Time-series features were extracted from aggregated lookback windows (e.g., 7 days) to predict next-day health status. Explainable machine learning with nested cross-validation identified top vocal biomarkers, and a case study illustrated model application.   Results: A total of 21,863 recordings were analyzed. Acoustic vowel features showed strong correlations with health status. Time-series voice features within the lookback window outperformed corresponding standard care measures, achieving peak sensitivity and specificity of 0.826 and 0.782 versus 0.783 and 0.567 for SoC metrics. Key prognostic voice features identifying deterioration included delayed energy shift, low energy variability, and higher shimmer variability in vowels, along with reduced speaking and articulation rate, lower phonation ratio, decreased voice quality, and increased formant variability in speech.   Conclusion: Voice-based monitoring offers a non-invasive approach to detect early health changes in chronic HF, supporting proactive and personalized care.


【25】Softmax gradient policy for variance minimization and risk-averse multi armed bandits
标题:方差最小化和风险规避多武装土匪的Softmax梯度策略
链接:https://arxiv.org/abs/2604.00241

作者:Gabriel Turinici
摘要:Algorithms for the Multi-Armed Bandit (MAB) problem play a central role in sequential decision-making and have been extensively explored both theoretically and numerically. While most classical approaches aim to identify the arm with the highest expected reward, we focus on a risk-aware setting where the goal is to select the arm with the lowest variance, favoring stability over potentially high but uncertain returns. To model the decision process, we consider a softmax parameterization of the policy; we propose a new algorithm to select the minimal variance (or minimal risk) arm and prove its convergence under natural conditions. The algorithm constructs an unbiased estimate of the objective by using two independent draws from the current's arm distribution. We provide numerical experiments that illustrate the practical behavior of these algorithms and offer guidance on implementation choices. The setting also covers general risk-aware problems where there is a trade-off between maximizing the average reward and minimizing its variance.


【26】MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
标题:MAC-Attention:一种快速准确的注意力计算的匹配修正完整方案
链接:https://arxiv.org/abs/2604.00235

作者:Jinghan Yao,Sam Adé Jacobs,Walid Krichene,Masahiro Tanaka,Dhabaleswar K Panda
摘要 :Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git


【27】Measuring the Representational Alignment of Neural Systems in Superposition
标题:测量叠加中神经系统的代表性对齐
链接:https://arxiv.org/abs/2604.00208

作者:Sunny Liu,Habon Issa,André Longon,Liv Gorton,Meenakshi Khosla,David Klindt
备注:17 pages, 4 figures
摘要:Comparing the internal representations of neural networks is a central goal in both neuroscience and machine learning. Standard alignment metrics operate on raw neural activations, implicitly assuming that similar representations produce similar activity patterns. However, neural systems frequently operate in superposition, encoding more features than they have neurons via linear compression. We derive closed-form expressions showing that superposition systematically deflates Representational Similarity Analysis, Centered Kernel Alignment, and linear regression, causing networks with identical feature content to appear dissimilar. The root cause is that these metrics are dependent on cross-similarity between two systems' respective superposition matrices, which under assumption of random projection usually differ significantly, not on the latent features themselves: alignment scores conflate what a system represents with how it represents it. Under partial feature overlap, this confound can invert the expected ordering, making systems sharing fewer features appear more aligned than systems sharing more. Crucially, the apparent misalignment need not reflect a loss of information; compressed sensing guarantees that the original features remain recoverable from the lower-dimensional activity, provided they are sparse. We therefore argue that comparing neural systems in superposition requires extracting and aligning the underlying features rather than comparing the raw neural mixtures.


【28】Offline Constrained RLHF with Multiple Preference Oracles
标题:具有多个偏好Oracle的离线约束RL HF
链接:https://arxiv.org/abs/2604.00200

作者:Brenden Latham,Mehrdad Moharrami
摘要:We study offline constrained reinforcement learning from human feedback with multiple preference oracles. Motivated by applications that trade off performance with safety or fairness, we aim to maximize target population utility subject to a minimum protected group welfare constraint. From pairwise comparisons collected under a reference policy, we estimate oracle-specific rewards via maximum likelihood and analyze how statistical uncertainty propagates through the dual program. We cast the constrained objective as a KL-regularized Lagrangian whose primal optimizer is a Gibbs policy, reducing learning to a convex dual problem. We propose a dual-only algorithm that ensures high-probability constraint satisfaction and provide the first finite-sample performance guarantees for offline constrained preference learning. Finally, we extend our theoretical analysis to accommodate multiple constraints and general f-divergence regularization.


【29】QUEST: A robust attention formulation using query-modulated spherical attention
标题:QUEST:使用查询调制的球形注意力的稳健注意力公式
链接:https://arxiv.org/abs/2604.00199

作者:Hariprasath Govindarajan,Per Sidén,Jacob Roll,Fredrik Lindsten
备注:Accepted to ICLR 2026
摘要:The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause training instabilities when they arbitrarily increase. We demonstrate how this can happen even in simple Transformer models, in the presence of easy-to-learn spurious patterns in the data. We propose a new attention formulation, QUEry-modulated Spherical aTtention (QUEST), that constrains the keys to a hyperspherical latent space, while still allowing individual tokens to flexibly control the sharpness of the attention distribution. QUEST can be easily used as a drop-in replacement for standard attention. We focus on vision applications while also exploring other domains to highlight the method's generality. We show that (1) QUEST trains without instabilities and (2) produces models with improved performance (3) that are robust to data corruptions and adversarial attacks.


【30】Evolution Strategies for Deep RL pretraining
标题:深度RL预训练的进化策略
链接:https://arxiv.org/abs/2604.00066

作者:Adrian Martínez,Ananya Gupta,Hanka Goralija,Mario Rico,Saúl Fenollosa,Tamar Alphaidze
备注:12 pages, 3 figures, 2 algorithms; EE-568 Reinforcement learning course project
摘要:Although Deep Reinforcement Learning has proven highly effective for complex decision-making problems, it demands significant computational resources and careful parameter adjustment in order to develop successful strategies. Evolution strategies offer a more straightforward, derivative-free approach that is less computationally costly and simpler to deploy. However, ES generally do not match the performance levels achieved by DRL, which calls into question their suitability for more demanding scenarios. This study examines the performance of ES and DRL across tasks of varying difficulty, including Flappy Bird, Breakout and Mujoco environments, as well as whether ES could be used for initial training to enhance DRL algorithms. The results indicate that ES do not consistently train faster than DRL. When used as a preliminary training step, they only provide benefits in less complex environments (Flappy Bird) and show minimal or no improvement in training efficiency or stability across different parameter settings when applied to more sophisticated tasks (Breakout and MuJoCo Walker).


【31】Generalizable Dense Reward for Long-Horizon Robotic Tasks
标题:长视野机器人任务的可推广密集奖励
链接:https://arxiv.org/abs/2604.00055

作者:Silong Yong,Stephen Sheng,Carl Qi,Xiaojie Wang,Evan Sheehan,Anurag Shivaprasad,Yaqi Xie,Katia Sycara,Yesh Dattatreya
备注:Project page: https://silongyong.github.io/vllr_project_page/
摘要:Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to $10\%$ gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in https://silongyong.github.io/vllr_project_page/


【32】Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications
标题:桥梁结构化知识和数据:具有金融应用的统一框架
链接:https://arxiv.org/abs/2604.00987

作者:Yi Cao,Zexun Chen,Lin William Cong,Heqing Shi
摘要:We develop Structured-Knowledge-Informed Neural Networks (SKINNs), a unified estimation framework that embeds theoretical, simulated, previously learned, or cross-domain insights as differentiable constraints within flexible neural function approximation. SKINNs jointly estimate neural network parameters and economically meaningful structural parameters in a single optimization problem, enforcing theoretical consistency not only on observed data but over a broader input domain through collocation, and therefore nesting approaches such as functional GMM, Bayesian updating, transfer learning, PINNs, and surrogate modeling. SKINNs define a class of M-estimators that are consistent and asymptotically normal with root-N convergence, sandwich covariance, and recovery of pseudo-true parameters under misspecification. We establish identification of structural parameters under joint flexibility, derive generalization and target-risk bounds under distributional shift in a convex proxy, and provide a restricted-optimal characterization of the weighting parameter that governs the bias-variance tradeoff. In an illustrative financial application to option pricing, SKINNs improve out-of-sample valuation and hedging performance, particularly at longer horizons and during high-volatility regimes, while recovering economically interpretable structural parameters with improved stability relative to conventional calibration. More broadly, SKINNs provide a general econometric framework for combining model-based reasoning with high-dimensional, data-driven estimation.


【33】Multi-Mode Quantum Annealing for Variational Autoencoders with General Boltzmann Priors
标题:具有一般Boltzmann先验的变分自动编码器的多模式量子安妮
链接:https://arxiv.org/abs/2604.00919

作者:Gilhan Kim,Daniel K. Park
备注:17 pages, 6 figures
摘要:Variational autoencoders (VAEs) learn compact latent representations of complex data, but their generative capacity is fundamentally constrained by the choice of prior distribution over the latent space. Energy-based priors offer a principled way to move beyond factorized assumptions and capture structured interactions among latent variables, yet training such priors at scale requires accurate and efficient sampling from intractable distributions. Here we present Boltzmann-machine--prior VAEs (BM-VAEs) trained using quantum annealing--based sampling in three distinct operational modes within a single generative system. During training, diabatic quantum annealing (DQA) provides unbiased Boltzmann samples for gradient estimation of the energy-based prior; for unconditional generation, slower quantum annealing (QA) concentrates samples near low-energy minima; for conditional generation, bias fields are added to direct sampling toward attribute-specific regions of the energy landscape (c-QA). Using up to 2000 qubits on a D-Wave Advantage2 processor, we demonstrate stable and efficient training across multiple datasets, with faster convergence and lower reconstruction loss than a Gaussian-prior VAE. The learned Boltzmann prior enables unconditional generation by sampling directly from the energy-based latent distribution, a capability that plain autoencoders lack, and conditional generation through latent biasing that leverages the learned pairwise interactions.


【34】Inverse-Free Sparse Variational Gaussian Processes
标题:无逆稀疏变分高斯过程
链接:https://arxiv.org/abs/2604.00697

作者:Stefano Cortinovis,Laurence Aitchison,Stefanos Eleftheriadis,Mark van der Wilk
备注:Accepted to AISTATS 2026. 20 pages, 3 figures, 2 tables
摘要 :Gaussian processes (GPs) offer appealing properties but are costly to train at scale. Sparse variational GP (SVGP) approximations reduce cost yet still rely on Cholesky decompositions of kernel matrices, ill-suited to low-precision, massively parallel hardware. While one can construct valid variational bounds that rely only on matrix multiplications (matmuls) via an auxiliary matrix parameter, optimising them with off-the-shelf first-order methods is challenging. We make the inverse-free approach practical by proposing a better-conditioned bound and deriving a matmul-only natural-gradient update for the auxiliary parameter, markedly improving stability and convergence. We further provide simple heuristics, such as step-size schedules and stopping criteria, that make the overall optimisation routine fit seamlessly into existing workflows. Across regression and classification benchmarks, we demonstrate that our method 1) serves as a drop-in replacement in SVGP-based models (e.g., deep GPs), 2) recovers similar performance to traditional methods, and 3) can be faster than baselines when well tuned.


【35】Scenario theory for multi-criteria data-driven decision making
标题:多标准数据驱动决策的情景理论
链接:https://arxiv.org/abs/2604.00553

作者:Simone Garatti,Lucrezia Manieri,Alessandro Falsone,Algo Carè,Marco C. Campi,Maria Prandini
摘要:The scenario approach provides a powerful data-driven framework for designing solutions under uncertainty with rigorous probabilistic robustness guarantees. Existing theory, however, primarily addresses assessing robustness with respect to a single appropriateness criterion for the solution based on a dataset, whereas many practical applications - including multi-agent decision problems - require the simultaneous consideration of multiple criteria and the assessment of their robustness based on multiple datasets, one per criterion. This paper develops a general scenario theory for multi-criteria data-driven decision making. A central innovation lies in the collective treatment of the risks associated with violations of individual criteria, which yields substantially more accurate robustness certificates than those derived from a naive application of standard results. In turn, this approach enables a sharper quantification of the robustness level with which all criteria are simultaneously satisfied. The proposed framework applies broadly to multi-criteria data-driven decision problems, providing a principled, scalable, and theoretically grounded methodology for design under uncertainty.


【36】Activation Saturation and Floquet Spectrum Collapse in Neural ODEs
标题:神经ODE中的激活饱和和Floquet谱崩溃
链接:https://arxiv.org/abs/2604.00543

作者:Nikolaos M. Matzakos
备注:21 pages, 5 figures
摘要:We prove that activation saturation imposes a structural dynamical limitation on autonomous Neural ODEs $\dot{h}=f_θ(h)$ with saturating activations ($\tanh$, sigmoid, etc.): if $q$ hidden layers of the MLP $f_θ$ satisfy $|σ'|\leδ$ on a region~$U$, the input Jacobian is attenuated as $\norm{Df_θ(x)}\le C(U)$ (for activations with $\sup_{x}|σ'(x)|\le 1$, e.g.\ $\tanh$ and sigmoid, this reduces to $C_Wδ^q$), forcing every Floquet (Lyapunov) exponen along any $T$-periodic orbit $γ\subset U$ into the interval $[-C(U),\;C(U)]$. This is a collapse of the Floquet spectrum: as saturation deepens ($δ\to 0$), all exponents are driven to zero, limiting both strong contraction and chaotic sensitivity. The obstruction is structural -- it constrains the learned vector field at inference time, independent of training quality. As a secondary contribution, for activations with $σ'>0$, a saturation-weighted spectral factorisation yields a refined bound $\widetilde{C}(U)\le C(U)$ whose improvement is amplified exponentially in~$T$ at the flow level. All results are numerically illustrated on the Stuart--Landau oscillator; the bounds provide a theoretical explanation for the empirically observed failure of $\tanh$-NODEs on the Morris--Lecar neuron model.


机器翻译由腾讯交互翻译提供,仅供参考

点击“阅读原文”获取带摘要的学术速递

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/194687