点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计127篇
大模型相关(22篇)
【1】Learning the Signature of Memorization in Autoregressive Language Models
标题:学习自回归语言模型中再同步化的签名
链接:https://arxiv.org/abs/2604.03199
作者:David Ilić,Kostadin Cvejoski,David Stanojević,Evgeny Grigorenko
备注:Preprint. 10 pages, 4 figures, 12 tables
摘要:针对微调的语言模型的所有先前的成员推断攻击都使用手工制作的推理(例如,损失阈值化、Min-K\%、参考校准),每个都由设计者的直觉限定。我们介绍了第一个可转移的学习攻击,通过观察,微调任何语料库上的任何模型都会产生无限的标记数据,因为成员资格是通过构造而知道的。这消除了影子模型的瓶颈,并将成员推理带入深度学习时代:学习重要的东西,而不是设计它,通过训练多样性和规模进行泛化。我们发现,微调的语言模型产生了一个不变的签名的记忆检测跨建筑家庭和数据域。我们训练一个成员推理分类器专门基于transformer的模型。它将zero-shot转移到Mamba(状态空间)、RWKV-4(线性注意力)和RecurrentGemma(门控递归),分别达到0.963、0.972和0.936 AUC。每个评估都结合了一个在训练过程中从未见过的架构和数据集,但所有三个都超过了保持Transformers的性能(0.908 AUC)。这四个家族没有共同的计算机制,它们唯一的共同点是交叉熵损失的梯度下降。即使是简单的基于似然性的方法也表现出很强的转移性,证实了签名独立于检测方法而存在。我们的方法,学习转移MIA(LT-MIA),捕捉这个信号最有效的重构成员推断序列分类每令牌分布统计。在Transformers上,LT-MIA在0.1\% FPR时实现了比最强基线高2.8\times $的TPR。该方法还可以转换为代码(0.865 AUC),尽管只在自然语言文本上进行训练。代码和经过训练的分类器可在https://github.com/JetBrains-Research/learned-mia上获得。
摘要:All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.
【2】The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
标题:压缩差距:为什么离散代币化限制视觉-语言-动作模型缩放
链接:https://arxiv.org/abs/2604.03191
作者:Takuya Shiba
备注:11 pages, 1 figure
摘要:通过升级视觉编码器来扩展视觉语言动作(VLA)模型有望提高下游操作性能-就像在视觉语言建模中一样。我们表明,当动作表示为离散令牌时,这种期望失败,并解释了为什么通过信息理论原理,我们称之为压缩间隙:在任何可视化管道中,缩放行为都由最紧密的信息瓶颈的位置决定。当动作是连续的(例如,扩散策略),视觉编码器是绑定约束,升级它可以直接提高性能。当动作通过固定容量码本(例如,OAT),码本成为绑定约束,编码器的改进不能传播过去-无论上游表示有多丰富。我们在LIBERO基准测试中验证了这一原则,有三条证据:一个析因实验表明,编码器升级使扩散策略提高了21个百分点以上,而OAT增益在模型尺度上大幅衰减;四个编码器之间的编码器质量梯度证实了扩散策略单调地跟踪编码器质量,而OAT保持平坦;以及一个码本大小实验,证明放松码本容量部分地恢复编码器灵敏度,为瓶颈假设提供因果证据。我们的研究结果表明,物理人工智能的扩展需要确定信息瓶颈在管道中的位置,而不是均匀地增加模型或数据大小。
摘要:Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.
【3】PRISM: LLM-Guided Semantic Clustering for High-Precision Topics
标题:PRism:针对高精度主题的LLM引导的语义集群
链接:https://arxiv.org/abs/2604.03180
作者:Connor Douglas,Utkucan Balci,Joseph Aylett-Bullock
备注:To appear in Proceedings of the ACM Web Conference 2026 (WWW 26)
摘要:在本文中,我们提出了精确知情的语义建模(PRISM),一个结构化的主题建模框架,结合了LLM捕获的丰富表示的好处,潜在语义聚类方法的低成本和可解释性。PRISM使用LLM提供的标签的稀疏集合对从一些感兴趣的语料库提取的样本进行微调句子编码模型。我们分割这个嵌入空间与阈值聚类,产生集群分离密切相关的主题在一些狭窄的领域。在多个语料库中,PRISM比最先进的局部主题模型甚至比大型前沿嵌入模型的集群提高了主题可分离性,同时只需要少量的LLM查询来进行训练。这项工作有助于几个研究流,提供(i)一个学生-教师管道提取稀疏LLM监督到一个轻量级模型的主题发现;(ii)采样策略的有效性分析,以改善局部几何聚类可分性;和(iii)一种有效的网络规模文本分析方法,使研究人员和从业人员能够在线跟踪细微差别的声明和子主题,本地部署框架。
摘要
:In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train. This work contributes to several research streams by providing (i) a student-teacher pipeline to distill sparse LLM supervision into a lightweight model for topic discovery; (ii) an analysis of the efficacy of sampling strategies to improve local geometry for cluster separability; and (iii) an effective approach for web-scale text analysis, enabling researchers and practitioners to track nuanced claims and subtopics online with an interpretable, locally deployable framework.
【4】Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
标题:走向人工综合教师:程序几何数据生成和视觉语言模型的视觉基础
链接:https://arxiv.org/abs/2604.02893
作者:Hai Nguyen-Truong,Alper Balbay,Tunga Bayrak
备注:12 pages, 7 figures
摘要:我们研究视觉解释在几何教育中的参考图像分割(RIS)的问题:给定一个图和自然语言描述,任务是产生一个像素级的掩模所指的几何元素。然而,现有的RIS模型在自然图像基准(如RefCOCO)上训练,由于摄影场景和抽象的无纹理原理图之间的基本域转移,在几何图上会发生灾难性的失败。为了解决缺乏合适的训练数据的问题,我们提出了一个完全自动化的程序数据引擎,它可以生成超过20万个合成几何图,这些图具有像素完美的分割掩模和语言多样的引用表达式,不需要手动注释。我们进一步提出了视觉语言模型(VLM)的特定领域微调,证明微调的Florence-2实现了49%的IoU和85%的缓冲IoU(BIoU),而在zero-shot设置中的IoU <1%。我们引入缓冲IoU,一个几何感知的评估指标,占薄结构本地化,并表明它更好地反映了真正的分割质量比标准IoU。我们的研究结果建立了一个基础,为建设人工通用教师(AGTs)能够提供视觉接地,一步一步的几何问题的解释。
摘要:We study visual explanation in geometry education as a Referring Image Segmentation (RIS) problem: given a diagram and a natural language description, the task is to produce a pixel-level mask for the referred geometric element. However, existing RIS models trained on natural image benchmarks such as RefCOCO fail catastrophically on geometric diagrams due to the fundamental domain shift between photographic scenes and abstract, textureless schematics. To address the absence of suitable training data, we present a fully automated procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and linguistically diverse referring expressions, requiring zero manual annotation. We further propose domain-specific fine-tuning of vision-language models (VLMs), demonstrating that a fine-tuned Florence-2 achieves 49% IoU and 85% Buffered IoU (BIoU), compared to <1% IoU in zero-shot settings. We introduce Buffered IoU, a geometry-aware evaluation metric that accounts for thin-structure localization, and show that it better reflects true segmentation quality than standard IoU. Our results establish a foundation for building Artificial General Teachers (AGTs) capable of providing visually grounded, step-by-step explanations of geometry problems.
【5】Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
标题:随机难以击败:现代法学硕士在线DPO中的主动选择
链接:https://arxiv.org/abs/2604.02766
作者:Giyeong Oh,Junghyun Lee,Jaehyun Park,Youngjae Yu,Wonho Bae,Junhyug Noh
备注:first commit
摘要:现代LLM从网络规模的预训练中继承了强大的先验知识,这可能会限制训练后数据选择策略的空间。虽然主动偏好学习(APL)试图优化在线直接偏好优化(DPO)中的查询效率,但策略候选池的固有丰富性通常使简单的随机采样成为令人惊讶的强大基线。我们评估基于不确定性的APL对随机在无害,有益,和预防以下设置,利用奖励模型和LLM作为一个法官代理。我们发现,APL产生微不足道的改进,代理赢率相比随机。至关重要的是,我们观察到一种分离,即使通过标准基准测量的一般能力下降,胜率也会提高。APL未能缓解这种能力崩溃或减少方差显着优于随机抽样。我们的研究结果表明,在强大的预先训练的先验知识的制度下,主动选择的计算开销很难与简单随机样本提供的“廉价多样性”相抗衡。我们的代码可在https://github.com/BootsofLagrangian/random-vs-apl上获得。
摘要:Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability -- measured by standard benchmarks -- degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre-trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random-vs-apl.
【6】Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
标题:生成边界:为什么评估对扩散语言模型很重要
链接:https://arxiv.org/abs/2604.02718
作者:Patrick Pynadath,Jiaxin Shi,Ruqi Zhang
摘要:扩散语言模型最近取得了令人兴奋的进展,在生成轨迹方面比自回归模型提供了更大的灵活性。这种灵活性促使越来越多的研究人员研究扩散语言建模的新方法,通常从GPT-2小规模(1.5亿个参数)开始。然而,这些进展带来了评价方法的新问题。在本技术说明中,我们讨论了当前方法的局限性,并提出了原则性的增强,以确保可靠的比较。我们首先讨论为什么OpenWebText已经成为标准基准,以及为什么LM 1B等替代品本质上意义不大。然后,我们讨论了扩散模型的似然评估的局限性,并解释了为什么仅仅依靠生成困惑作为一个度量可能会导致无信息的结果。为了解决这个问题,我们表明,生成困惑和熵是两个组成部分的KL分歧的参考分布。这种分解解释了生成困惑对熵的敏感性,并自然地建议生成边界作为评估模型生成质量的原则方法。我们得出结论,在这个规模的模型质量的经验观察。我们在https://patrickpynadath1.github.io/blog/eval_methodology/上提供了一篇带有互动内容的博客文章来说明这一论点。
摘要:Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT-2 small (150 million parameters). However, these advances introduce new issues with evaluation methodology. In this technical note, we discuss the limitations of current methodology and propose principled augmentations to ensure reliable comparisons. We first discuss why OpenWebText has become the standard benchmark, and why alternatives such as LM1B are inherently less meaningful. We then discuss the limitations of likelihood evaluations for diffusion models, and explain why relying on generative perplexity alone as a metric can lead to uninformative results. To address this, we show that generative perplexity and entropy are two components of the KL divergence to a reference distribution. This decomposition explains generative perplexity's sensitivity to entropy, and naturally suggests generative frontiers as a principled method for evaluating model generative quality. We conclude with empirical observations on model quality at this scale. We include a blog post with interactive content to illustrate the argument at https://patrickpynadath1.github.io/blog/eval_methodology/.
【7】Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
标题:通过乔姆斯基层次评估大型语言模型的形式推理能力
链接:https://arxiv.org/abs/2604.02709
作者:Yihong Dong,Xiaoha Jian,Xue Jiang,Xuyuan Guo,Zhiyuan Fan,Jiaru Qian,Kechi Zhang,Jia Li,Zhi Jin,Ge Li
备注:Work in progress
摘要:LLM的形式化推理能力对于推进自动化软件工程至关重要。然而,现有的LLM基准缺乏基于计算和复杂性的系统评估,在理解其形式化推理能力方面存在重大差距。因此,SOTA LLM是否能够掌握计算理论定义的形式语言的结构化,层次化复杂性仍然是未知的。为了解决这个问题,我们引入了ChomskyBench,这是一个通过Chomsky Hierarchy镜头系统评估LLM的基准。与之前使用神经网络的向量化分类不同,ChomskyBench是第一个将完整的Chomsky Hierarchy覆盖,通过自然语言进行过程跟踪评估和确定性符号验证相结合的工作。ChomskyBench由一套全面的语言识别和生成任务组成,旨在测试每个级别的功能。大量的实验表明,一个明确的性能分层,与层次结构的复杂程度。我们的分析揭示了一个直接的关系,增加任务难度大大影响推理长度和性能。此外,我们发现,虽然更大的模型和先进的推理方法提供显着的相对收益,他们面临着严重的效率障碍:实现实际的可靠性将需要高昂的计算成本,揭示了目前的限制源于效率低下,而不是绝对的能力界限。时间复杂度分析进一步表明,LLM是显着低于传统的算法程序,这些正式的任务的效率。这些结果描绘了当前LLM的实际限制,突出了传统软件工具的不可或缺性,并提供了见解,以指导未来LLM的开发具有更强大的形式化推理能力。
摘要:The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.
【8】Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
标题:LLM作为法官的基于强化学习的知识提炼
链接:https://arxiv.org/abs/2604.02621
作者:Yiyang Shen,Lifu Tu,Weiran Wang
摘要:强化学习(RL)已被证明可以大大提高小型和大型语言模型(LLM)的推理能力,但现有方法通常依赖于可验证的奖励,因此需要使用地面真值标签。我们提出了一个强化学习框架,该框架使用LLM的奖励,LLM充当法官,在大量未标记的数据上评估模型输出,从而实现无标记的知识蒸馏,并取代对基础事实监督的需求。值得注意的是,法官操作单令牌输出,使奖励计算效率。当与可验证的奖励相结合时,我们的方法在数学推理基准测试中产生了实质性的性能提升。这些结果表明,基于LLM的评估器可以产生有效的RL微调训练信号。
摘要:Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.
【9】AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models
标题:AutoVerification:使用大型语言模型的大型自动验证框架
链接:https://arxiv.org/abs/2604.02617
作者:Yuntao Du,Minh Dinh,Kaiyuan Zhang,Ninghui Li
备注:Winner of 2025-2026 Radiance Technologies Innovation Bowl
摘要:科学和技术情报(S&TI)分析需要在快速增长的文献中验证复杂的技术声明,现有方法无法弥合表面准确性和更深层次的方法有效性之间的验证差距。我们提出了AutoVerifier,一个基于LLM的代理框架,可以自动化技术声明的端到端验证,而无需领域专业知识。AutoVerifier将每个技术断言分解为结构化的声明三元组(主题,谓词,对象),构建知识图,使结构化推理跨越六个逐步丰富的层:语料库构建和摄取,实体和声明提取,文档内验证,跨源验证,外部信号确证和最终假设矩阵生成。我们在一个有争议的量子计算声明上演示了AutoVerifier,该框架由没有量子专业知识的分析师操作,自动识别目标论文中的过度声明和度量不一致,跟踪跨源矛盾,发现未公开的商业利益冲突,并产生最终评估。这些结果表明,结构化LLM验证可以可靠地评估新兴技术的有效性和成熟度,将原始技术文档转化为可追溯的、有证据支持的情报评估。
摘要:Scientific and Technical Intelligence (S&TI) analysis requires verifying complex technical claims across rapidly growing literature, where existing approaches fail to bridge the verification gap between surface-level accuracy and deeper methodological validity. We present AutoVerifier, an LLM-based agentic framework that automates end-to-end verification of technical claims without requiring domain expertise. AutoVerifier decomposes every technical assertion into structured claim triples of the form (Subject, Predicate, Object), constructing knowledge graphs that enable structured reasoning across six progressively enriching layers: corpus construction and ingestion, entity and claim extraction, intra-document verification, cross-source verification, external signal corroboration, and final hypothesis matrix generation. We demonstrate AutoVerifier on a contested quantum computing claim, where the framework, operated by analysts with no quantum expertise, automatically identified overclaims and metric inconsistencies within the target paper, traced cross-source contradictions, uncovered undisclosed commercial conflicts of interest, and produced a final assessment. These results show that structured LLM verification can reliably evaluate the validity and maturity of emerging technologies, turning raw technical documents into traceable, evidence-backed intelligence assessments.
【10】Understanding the Effects of Safety Unalignment on Large Language Models
标题:了解安全不一致对大型语言模型的影响
链接:https://arxiv.org/abs/2604.02574
作者:John T. Halloran
备注:12 pages, 2 figures, 5 tables
摘要:安全对齐已成为确保LLM拒绝有害请求的关键步骤,同时提供有用和无害的响应。然而,尽管部署的前沿模型的安全对齐无处不在,但最近的两项工作-越狱调整(JT)和权重正交化(WO)-表明安全护栏可能在很大程度上被禁用,导致LLM符合他们通常会拒绝的有害请求。尽管具有深远的安全影响,但分析在很大程度上仅限于孤立的每种不对齐方法的拒绝率,使其对对抗性LLM能力的相对影响未知。为了填补这一空白,我们研究了使用JT和WO在大量恶意和良性任务中不对齐六个不同大小的流行LLM的影响。在评估的模型中,我们表明,虽然拒绝降级在两种方法之间被分割,但WO产生的LLM更有能力帮助恶意活动;与JT相比,大多数WO未对齐模型更不容易产生幻觉,更好地保留了其原始的自然语言性能,并且在最先进的对抗和网络攻击中更有效。因此,为了帮助减轻WO不对齐的恶意风险,我们得出结论,监督微调有效地限制了WO启用的对抗性攻击能力,而不会严重影响幻觉率或自然语言性能。
摘要
:Safety alignment has become a critical step to ensure LLMs refuse harmful requests while providing helpful and harmless responses. However, despite the ubiquity of safety alignment for deployed frontier models, two separate lines of recent work--jailbreak-tuning (JT) and weight orthogonalization (WO)--have shown that safety guardrails may be largely disabled, resulting in LLMs which comply with harmful requests they would normally refuse. In spite of far-reaching safety implications, analysis has largely been limited to refusal rates of each unalignment method in isolation, leaving their relative effects on adversarial LLM capabilities unknown. To fill this gap, we study the impact of unaligning six popular LLMs of various sizes across a large number of malicious and benign tasks, using both JT and WO. Across the evaluated models, we show that while refusal degradation is split between the two methods, WO produces LLMs far more capable of aiding in malicious activity; in contrast to JT, the majority of WO unaligned models are far less prone to hallucinations, better retain their original natural-language performance, and are more effective at state-of-the-art adversarial and cyber attacks. To thus help mitigate the malicious risks of WO unalignment, we conclude by showing that supervised fine-tuning effectively limits the adversarial attack abilities enabled by WO, without drastically affecting hallucination rates or natural language performance.
【11】WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models
标题:WSVD:用于快速高效执行低精度视觉语言模型的加权低秩近似
链接:https://arxiv.org/abs/2604.02570
作者:Haiyu Wang,Yutong Wang,Jack Jiang,Sai Qian Zhang
摘要:奇异值分解(SVD)已成为减少视觉语言模型(VLM)计算负担的重要技术,VLM在图像字幕和视觉问答等任务中发挥着核心作用。尽管多个先前的工作已经提出了有效的SVD变体来实现低秩操作,但我们发现在实践中仍然难以在模型执行期间实现实质性的延迟减少。为了解决这一限制,我们引入了一种新的计算模式,并以更细的粒度应用SVD,从而实现了执行延迟的真实和可测量的改进。此外,认识到权重元素的相对重要性不同,我们在SVD过程中自适应地为每个元素分配相对重要性,以更好地保持准确性,然后通过对权重和激活进行量化来扩展该框架,从而实现高效的VLM。总的来说,我们引入了加权SVD(WSVD),它通过在保持准确性的同时实现超过1.8倍的解码加速比而优于其他方法。我们在以下位置开源我们的代码:\href{https://github.com/SAI-Lab-NYU/WSVD}{\texttt{https://github.com/SAI-Lab-NYU/WSVD}
摘要:Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low-rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each element during SVD process to better preserve accuracy, then extend this framework with quantization applied to both weights and activations, resulting in a highly efficient VLM. Collectively, we introduce~\textit{Weighted SVD} (WSVD), which outperforms other approaches by achieving over $1.8\times$ decoding speedup while preserving accuracy. We open source our code at: \href{https://github.com/SAI-Lab-NYU/WSVD}{\texttt{https://github.com/SAI-Lab-NYU/WSVD}
【12】Fast NF4 Dequantization Kernels for Large Language Model Inference
标题:用于大型语言模型推理的快速NF 4去量化核
链接:https://arxiv.org/abs/2604.02556
作者:Xiangbo Qi,Chaoyi Jiang,Murali Annavaram
备注:7 pages, 4 figures, EMC2 Workshop at ASPLOS 2026
摘要:大型语言模型(LLM)已经超过了单个GPU设备的内存容量,需要量化技术来进行实际部署。虽然NF 4(4位NormalFloat)量化能够减少4$\times$内存,但对当前NVIDIA GPU的推断(例如,Ampere A100)需要昂贵的反量化回到FP 16格式,从而造成严重的性能瓶颈。本文提出了一种轻量级的共享内存优化,通过原则性的内存层次结构利用来解决这一差距,同时保持完整的生态系统兼容性。我们比较我们的技术对开源BitsAndroid 3实现,实现2.0- 2.2$\times $内核加速跨越三个模型(Gemma 27 B,Qwen 3 32 B,和Llama3.3 70 B)和高达1.54$\times$端到端的改进,利用12- 15$\times $的延迟优势共享内存全局内存访问。我们的优化通过简化的索引逻辑减少了指令数,同时每个线程块仅使用64字节的共享内存,这表明轻量级优化可以以最小的工程工作量提供可观的性能增益。这项工作为HuggingFace生态系统提供了一个即插即用的解决方案,使人们能够在现有的GPU基础设施上访问高级模型。
摘要:Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enables 4$\times$ memory reduction, inference on current NVIDIA GPUs (e.g., Ampere A100) requires expensive dequantization back to FP16 format, creating a critical performance bottleneck. This paper presents a lightweight shared memory optimization that addresses this gap through principled memory hierarchy exploitation while maintaining full ecosystem compatibility. We compare our technique against the open-source BitsAndBytes implementation, achieving 2.0--2.2$\times$ kernel speedup across three models (Gemma 27B, Qwen3 32B, and Llama3.3 70B) and up to 1.54$\times$ end-to-end improvement by leveraging the 12--15$\times$ latency advantage of shared memory over global memory access. Our optimization reduces instruction counts through simplified indexing logic while using only 64 bytes of shared memory per thread block, demonstrating that lightweight optimizations can deliver substantial performance gains with minimal engineering effort. This work provides a plug-and-play solution for the HuggingFace ecosystem that democratizes access to advanced models on existing GPU infrastructure.
【13】Synapse: Evolving Job-Person Fit with Explainable Two-phase Retrieval and LLM-guided Genetic Resume Optimization
标题:Synasse:通过可解释的两阶段检索和LLM引导的遗传简历优化来发展工作与人的匹配度
链接:https://arxiv.org/abs/2604.02539
作者:Ansel Kaplan Erol,Seohee Yoon,Keenan Hom,Xisheng Zhang
摘要:现代招聘平台在严重的信息不平衡下运作:求职者必须搜索大量快速变化的招聘信息,而雇主则被大量低相关性的求职者淹没。现有的招聘推荐系统通常依赖于关键字匹配或单阶段语义检索,这很难在现实世界的规模和成本约束下捕获候选人经验和工作要求之间的细粒度对齐。我们提出了Synapse,一个多阶段的语义招聘系统,它将高召回率的候选生成与高精度的语义重新排序分离开来,将FAISS的高效密集检索与对比学习和大型语言模型(LLM)推理相结合。为了提高透明度,Synapse整合了一个检索增强解释层,以明确的证据作为推荐的依据。除了检索,我们引入了一种新的进化简历优化框架,将简历细化作为一个黑盒优化问题。使用差分进化与LLM引导的突变算子,系统迭代地修改候选表示,以改善与筛选目标的比对,而无需任何标记数据。评估表明,所提出的集成提高了nDCG@10的22%以上嵌入的检索基线,而进化优化循环始终产生单调的改进,在推荐分数,超过60%的相对增益评估配置文件。我们计划在发布时发布代码和数据。
摘要
:Modern recruitment platforms operate under severe information imbalance: job seekers must search over massive, rapidly changing collections of postings, while employers are overwhelmed by high-volume, low-relevance applicant pools. Existing recruitment recommender systems typically rely on keyword matching or single-stage semantic retrieval, which struggle to capture fine-grained alignment between candidate experience and job requirements under real-world scale and cost constraints. We present Synapse, a multi-stage semantic recruitment system that separates high-recall candidate generation from high-precision semantic reranking, combining efficient dense retrieval using FAISS with an ensemble of contrastive learning and Large Language Model (LLM) reasoning. To improve transparency, Synapse incorporates a retrieval-augmented explanation layer that grounds recommendations in explicit evidence. Beyond retrieval, we introduce a novel evolutionary resume optimization framework that treats resume refinement as a black-box optimization problem. Using Differential Evolution with LLM-guided mutation operators, the system iteratively modifies candidate representations to improve alignment with screening objectives, without any labeled data. Evaluation shows that the proposed ensemble improves nDCG@10 by 22% over embedding-only retrieval baselines, while the evolutionary optimization loop consistently yields monotonic improvements in recommender scores, exceeding 60% relative gain across evaluated profiles. We plan to release code and data upon publication.
【14】Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits
标题:跳跃开始还是错误开始?LLM初始化盗贼的理论和实证评估
链接:https://arxiv.org/abs/2604.02527
作者:Adam Bayley,Xiaodan Zhu,Raquel Aoki,Yanshuai Cao,Kevin H. Wilson
备注:25 pages, 3 figures
摘要:大型语言模型(LLM)的最新进展为生成用户偏好数据提供了新的机会。最近对LLM初始化(CBLI)的背景强盗的研究表明,这些合成的先验可以显着降低早期后悔。然而,这些发现假设LLM生成的选择与实际用户偏好合理一致。在本文中,我们系统地研究了LLM生成的偏好在随机和标签翻转噪声注入合成训练数据时的表现。对于对齐域,我们发现,热启动仍然有效高达30%的腐败,失去其优势约40%,并降低性能超过50%。当存在系统未对准时,即使没有添加噪声,LLM生成的先验也会导致比冷启动强盗更高的遗憾。为了解释这些行为,我们开发了一个理论分析,分解随机标签噪声和系统失调的影响驱动的强盗的遗憾的先验错误,并推导出一个充分条件下,基于LLM的热启动可证明优于冷启动强盗。我们在多个联合数据集和LLM上验证了这些结果,表明估计的对齐可靠地跟踪热启动何时提高或降低推荐质量。
摘要:The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30% corruption, loses its advantage around 40%, and degrades performance beyond 50%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit's regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.
【15】Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
标题:未能证伪:评估和缓解语言模型中的确认偏差
链接:https://arxiv.org/abs/2604.02485
作者:Ayush Rajesh Jhaveri,Anthony GX-Chen,Ilia Sucholutsky,Eunsol Choi
摘要:确认偏差,倾向于寻找支持而不是挑战自己信念的证据,阻碍了一个人的推理能力。我们通过调整人类心理学的规则发现研究来研究大型语言模型(LLM)是否表现出确认偏差:给定一个三个数序列(一个“三元组”),智能体参与一个交互式反馈循环,其中它(1)提出一个新的三元组,(2)接收关于它是否满足隐藏规则的反馈,(3)猜测规则。在11个多族多标度的LLM中,我们发现LLM表现出确认偏差,经常提出三元组来确认他们的假设,而不是试图证伪它,这导致隐藏规则的发现速度更慢,频率更低。我们进一步探索干预策略(例如,鼓励代理人考虑反例)开发的人类。我们发现,提示LLM与这样的指令一致降低确认偏差LLM,提高规则发现率从42%到56%的平均水平。最后,我们通过将干预诱导的行为提取到LLM中来减轻确认偏差,这对一个新的任务,即Blicket测试,表现出很有希望的推广。我们的工作表明,确认偏差是LLM在假设探索中的一个限制,并且可以通过为人类设计的注射干预措施来减轻。
摘要:Confirmation bias, the tendency to seek evidence that supports rather than challenges one's belief, hinders one's reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human psychology: given a sequence of three numbers (a "triple"), an agent engages in an interactive feedback loop where it (1) proposes a new triple, (2) receives feedback on whether it satisfies the hidden rule, and (3) guesses the rule. Across eleven LLMs of multiple families and scales, we find that LLMs exhibit confirmation bias, often proposing triples to confirm their hypothesis rather than trying to falsify it. This leads to slower and less frequent discovery of the hidden rule. We further explore intervention strategies (e.g., encouraging the agent to consider counter examples) developed for humans. We find prompting LLMs with such instruction consistently decreases confirmation bias in LLMs, improving rule discovery rates from 42% to 56% on average. Lastly, we mitigate confirmation bias by distilling intervention-induced behavior into LLMs, showing promising generalization to a new task, the Blicket test. Our work shows that confirmation bias is a limitation of LLMs in hypothesis exploration, and that it can be mitigated via injecting interventions designed for humans.
【16】On the Geometric Structure of Layer Updates in Deep Language Models
标题:关于深度语言模型中层更新的几何结构
链接:https://arxiv.org/abs/2604.02459
作者:Jun-Sik Yoo
备注:11 pages, 5 figures
摘要:我们研究了深度语言模型中层更新的几何结构。我们不是分析什么信息被编码在中间表示中,而是询问表示如何从一层到下一层变化。我们表明,逐层更新允许分解成一个占主导地位的tokenwise组件和残留的限制tokenwise函数类没有捕获。 在多个架构中,包括Transformers和状态空间模型,我们发现完整的层更新几乎与tokenwise分量完美对齐,而残差显示出明显较弱的对齐,较大的角度偏差,以及显着较低的投影到主导tokenwise子空间。这表明残差不仅仅是一个小的校正,而是变换的几何上不同的分量。 这种几何分离具有功能性后果:限制令牌模型下的近似误差与输出扰动密切相关,斯皮尔曼相关性通常超过0.7,在较大的模型中高达0.95。总之,这些结果表明,大多数分层更新的行为像结构化reparameterizations沿主导方向,而功能上重要的计算集中在一个几何上不同的残留成分。 我们的框架提供了一个简单的,架构不可知的方法,用于探测现代语言模型中的层更新的几何和功能结构。
摘要:We study the geometric structure of layer updates in deep language models. Rather than analyzing what information is encoded in intermediate representations, we ask how representations change from one layer to the next. We show that layerwise updates admit a decomposition into a dominant tokenwise component and a residual that is not captured by restricted tokenwise function classes. Across multiple architectures, including Transformers and state-space models, we find that the full layer update is almost perfectly aligned with the tokenwise component, while the residual exhibits substantially weaker alignment, larger angular deviation, and significantly lower projection onto the dominant tokenwise subspace. This indicates that the residual is not merely a small correction, but a geometrically distinct component of the transformation. This geometric separation has functional consequences: approximation error under the restricted tokenwise model is strongly associated with output perturbation, with Spearman correlations often exceeding 0.7 and reaching up to 0.95 in larger models. Together, these results suggest that most layerwise updates behave like structured reparameterizations along a dominant direction, while functionally significant computation is concentrated in a geometrically distinct residual component. Our framework provides a simple, architecture-agnostic method for probing the geometric and functional structure of layer updates in modern language models.
【17】Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations
标题:用人工智能对抗人工智能:人工智能代理在学生评估期间增强DNS对LLM服务的封锁
链接:https://arxiv.org/abs/2604.02360
作者:Yonas Kassa,James Bonacci,Ping Wang
备注:accepted at ITNG 2026
摘要:大型语言模型(LLM)在教育领域的变革潜力,例如提高可访问性和个性化学习,正在被重大挑战所掩盖。这些挑战源于人们担心LLM通过绕过批判性思维来破坏学术评估,从而导致认知负荷增加。这一新兴趋势强调了利用人工智能的教育优势,同时在不断发展的人工智能生态系统中维护批判性思维和学术严谨性的双重必要性。为此,我们引入了AI-Sinkhole,这是一个基于AI代理增强DNS的框架,可以在监考考试期间动态发现,语义分类和暂时阻止网络范围内出现的LLM聊天机器人服务。AI-Sinkhole通过量化的LLM(LLama 3,DeepSeek-R1,Qwen-3)和Pi-Hole的动态DNS阻止提供可解释的分类。我们还分享了我们使用LLM作为可解释分类器的观察结果,这些分类器实现了强大的跨语言性能(F1得分> 0.83)。为了支持该领域的未来研究和开发,可在https://github.com/AIMLEdu/ai-sinkhole上获得具有可随时部署的“AI-Sinkhole”blockist的初始代码。
摘要:The transformative potential of large language models (LLMs) in education, such as improving accessibility and personalized learning, is being eclipsed by significant challenges. These challenges stem from concerns that LLMs undermine academic assessment by enabling bypassing of critical thinking, leading to increased cognitive offloading. This emerging trend stresses the dual imperative of harnessing AI's educational benefits while safeguarding critical thinking and academic rigor in the evolving AI ecosystem. To this end, we introduce AI-Sinkhole, an AI-agent augmented DNS-based framework that dynamically discovers, semantically classifies, and temporarily network-wide blocks emerging LLM chatbot services during proctored exams. AI-Sinkhole offers explainable classification via quantized LLMs (LLama 3, DeepSeek-R1, Qwen-3) and dynamic DNS blocking with Pi-Hole. We also share our observations in using LLMs as explainable classifiers which achieved robust cross-lingual performance (F1-score > 0.83). To support future research and development in this domain initial codes with a readily deployable 'AI-Sinkhole' blockist is available on https://github.com/AIMLEdu/ai-sinkhole.
【18】DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
标题:DrugPlayGround:为药物发现的大型语言模型和嵌入进行基准测试
链接:https://arxiv.org/abs/2604.02346
作者:Tianyu Liu,Sihan Jiang,Fan Zhang,Kunyang Sun,Teresa Head-Gordon,Hongyu Zhao
备注:29 pages, 6 figures
摘要:大型语言模型(LLM)在药物发现研究中处于优势地位,通过加速假设生成、优化候选优先级以及实现更具可扩展性和成本效益的药物发现管道,为重塑药物研究提供了前所未有的机会。然而,目前缺乏对LLM性能的客观评估,以确定其相对于传统药物发现平台的优势和局限性。为了解决这个紧急问题,我们开发了DrugPlayground,这是一个评估和基准LLM性能的框架,用于生成有意义的基于文本的理化药物特征,药物协同作用,药物-蛋白质相互作用以及对药物分子引入的扰动的生理反应的描述。此外,DrugPlayGround旨在与领域专家合作,为LLM的预测提供详细的解释,从而测试LLM的化学和生物推理能力,以推动其在药物发现的各个阶段的前沿应用。
摘要:Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.
【19】Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
标题:描述跨四个图形处理器、三个后台和三个浏览器的LLM推理的Web图形处理器调度费用
链接:https://arxiv.org/abs/2604.02344
作者:Jędrzej Maczan
摘要:WebGPU以安全为中心的设计要求对每个操作进行验证,这种验证在神经网络推理中的许多小调度中都存在,但这种开销的真实成本却很难描述。我们提出了一个系统的表征WebGPU调度开销的LLM推理在批量大小为1,跨越四个GPU供应商(NVIDIA,AMD,苹果,英特尔),两个本地实现(黎明,wgpu-native)和三个浏览器(Chrome,Safari,Firefox),和两个模型大小(Qwen2.5-0.5B和1.5B)。我们的主要贡献是一个顺序调度方法,揭示了天真的单操作基准高估调度成本${\sim} 20\times $。仅WebGPU API开销的真实每次调度成本在Vulkan上为24-36 $μ$s,在Metal上为32-71 $μ$s,而包括Python成本在内的每次操作总开销为${\sim}95$~$μ$s,这是优化的关键区别。在Vulkan上,内核融合将吞吐量提高了53%,而CUDA融合没有带来任何好处,这证实了每次操作开销是一个主要的区别因素。LLM推理在三个主要操作系统(Linux,Windows,macOS)上进行了测试。我们构建了$\texttt{torch-webgpu}$,这是一个基于Privatetramp 1的树外PyTorch后端和一个FX到WebGPU编译器,在我们的参考平台上实现了CUDA性能的11- 12%。在数据类型匹配float 32时,RTX PRO 2000实现了WebGPU吞吐量的1.4$\times$,尽管计算量比RTX 5090低${\sim}6\times$。对于调度开销,后端选择是主要因素,尽管实现选择在后端中也很重要(Metal为2.2$\times$)。在调度与内核计算效率方面,我们得出结论,在当前调度繁重的管道中,在批处理=1时,无论内核质量如何,每次操作的开销都占主导地位。所有代码、基准测试和原始数据都是开源的。
摘要:WebGPU's security-focused design imposes per-operation validation that compounds across the many small dispatches in neural network inference, yet the true cost of this overhead is poorly characterized. We present a systematic characterization of WebGPU dispatch overhead for LLM inference at batch size 1, spanning four GPU vendors (NVIDIA, AMD, Apple, Intel), two native implementations (Dawn, wgpu-native) and three browsers (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B and 1.5B). Our primary contribution is a sequential-dispatch methodology that reveals naive single-operation benchmarks overestimate dispatch cost by ${\sim}20\times$. The true per-dispatch cost of WebGPU API overhead alone is 24-36 $μ$s on Vulkan and 32-71 $μ$s on Metal, while the total per-operation overhead including Python cost is ${\sim}95$~$μ$s, which turns out to be a distinction critical for optimization. On Vulkan, kernel fusion improves throughput by 53%, while CUDA fusion provides no benefit, confirming that per-operation overhead is a primary differentiator. LLM inference was tested across three major operating systems (Linux, Windows, macOS). We built $\texttt{torch-webgpu}$, a PrivateUse1-based out-of-tree PyTorch backend and an FX-to-WebGPU compiler, which on our reference platform achieves 11--12% of CUDA performance. At dtype-matched float32, RTX PRO 2000 achieves 1.4$\times$ WebGPU's throughput despite ${\sim}6\times$ less compute than RTX 5090. For dispatch overhead, backend choice is the dominant factor, although implementation choice also matters substantially within a backend (2.2$\times$ for Metal). In terms of dispatch vs kernel compute efficiency, we conclude that at batch=1 with the current dispatch-heavy pipeline, per-operation overhead dominates regardless of kernel quality. All code, benchmarks, and raw data are open source.
【20】Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains
标题:俳句到Opus只需10位:LLM获得巨大压缩收益
链接:https://arxiv.org/abs/2604.02343
作者:Roy Rinberg,Annabelle Michael Carrell,Simon Henniger,Nicholas Carlini,Keri Warr
摘要
:我们研究了LLM生成的文本在无损和有损制度的压缩,其特征在于压缩计算前沿,更多的压缩是可能的成本更多的计算。对于无损压缩,域适配LoRA适配器可以将基于LLM的算术编码提高2倍于仅使用基本LLM的压缩。对于有损压缩,提示模型进行简洁的重写,然后应用算术编码可以实现约0.03的压缩比,比压缩原始响应提高2倍。 我们进一步介绍了问答压缩(QA),一个互动的有损协议的灵感来自游戏“二十个小时”。一个小模型通过向一个更强的模型提出是/否问题来迭代地改进其响应,每个答案只传输一位。在涵盖数学、科学和代码的8个基准测试中,10个二进制问题在标准基准测试中恢复了小型和大型模型之间23%到72%的能力差距,在更难的基准测试中恢复了7%到38%的能力差距,实现了0.0006到0.004的压缩比。这比现有的基于LLM的压缩小100倍以上(Deletang等人,2024),这表明交互式协议可以比传输完整的响应更有效地传递知识。
摘要:We study the compression of LLM-generated text across lossless and lossy regimes, characterizing a compression-compute frontier where more compression is possible at the cost of more compute. For lossless compression, domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression with the base LLM alone. For lossy compression, prompting a model for a succinct rewrite then applying arithmetic coding can achieve compression ratios of approximately 0.03, a 2x improvement over compressing the original response. We further introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game 'Twenty Questions'. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004. This is over 100x smaller than prior LLM-based compression (Deletang et al., 2024), suggesting that interactive protocols can transfer knowledge far more efficiently than transmitting full responses.
【21】LLM Reasoning with Process Rewards for Outcome-Guided Steps
标题:LLM推理与结果引导步骤的流程奖励
链接:https://arxiv.org/abs/2604.02341
作者:Mohammad Rezaei,Jens Lehmann,Sahar Vahdati
备注:8 pages, 3 figures, 2 tables, submitted to IJCNN 2026 conference
摘要:大型语言模型中的数学推理已经通过使用可验证奖励的强化学习得到了大幅改善,最终答案可以自动检查并转换为可靠的训练信号。大多数此类管道仅优化结果正确性,这会为长时间的多步解决方案提供稀疏反馈,并对中间推理错误提供有限的指导。因此,最近的工作引入了过程奖励模型(PRM)来对中间步骤进行评分,并提供更密集的监督。在实践中,PRM分数通常与最终正确性不完全一致,并且可以奖励仍然以错误答案结束的局部流畅推理。当被优化为绝对奖励时,这些信号可以放大流畅的失败模式,并诱导奖励黑客。 我们提出PROGRS,一个框架,利用PRM,同时保持结果的正确性占主导地位。PROGRS将过程奖励视为结果组中的相对偏好,而不是绝对目标。我们引入了结果条件中心,转移PRM分数的不正确的轨迹有零均值在每个提示组。它消除了系统性偏见,同时保留了信息排名。PROGRS将冻结分位数回归PRM与多尺度相干性评估器相结合。我们将由此产生的集中过程奖金集成到组相对策略优化(GRPO)中,而无需辅助目标或额外的可训练组件。在MATH-500、AMC、AIME、MinervaMath和OlympiadBench中,PROGRS始终在仅结果基线上提高Pass@1,并以更少的推出实现更强的性能。这些结果表明,以结果为条件的中心可以安全有效地使用过程奖励进行数学推理。
摘要:Mathematical reasoning in large language models has improved substantially with reinforcement learning using verifiable rewards, where final answers can be checked automatically and converted into reliable training signals. Most such pipelines optimize outcome correctness only, which yields sparse feedback for long, multi-step solutions and offers limited guidance on intermediate reasoning errors. Recent work therefore introduces process reward models (PRMs) to score intermediate steps and provide denser supervision. In practice, PRM scores are often imperfectly aligned with final correctness and can reward locally fluent reasoning that still ends in an incorrect answer. When optimized as absolute rewards, such signals can amplify fluent failure modes and induce reward hacking. We propose PROGRS, a framework that leverages PRMs while keeping outcome correctness dominant. PROGRS treats process rewards as relative preferences within outcome groups rather than absolute targets. We introduce outcome-conditioned centering, which shifts PRM scores of incorrect trajectories to have zero mean within each prompt group. It removes systematic bias while preserving informative rankings. PROGRS combines a frozen quantile-regression PRM with a multi-scale coherence evaluator. We integrate the resulting centered process bonus into Group Relative Policy Optimization (GRPO) without auxiliary objectives or additional trainable components. Across MATH-500, AMC, AIME, MinervaMath, and OlympiadBench, PROGRS consistently improves Pass@1 over outcome-only baselines and achieves stronger performance with fewer rollouts. These results show that outcome-conditioned centering enables safe and effective use of process rewards for mathematical reasoning.
【22】Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
标题:并非所有去噪步骤都是平等的:更快掩蔽扩散语言模型的模型调度
链接:https://arxiv.org/abs/2604.02340
作者:Ivan Sedykh,Nikita Sorokin,Valentin Malykh
摘要:掩蔽扩散语言模型(MDLM)的最新进展缩小了自回归LM的质量差距,但它们的采样仍然很昂贵,因为生成需要使用大型Transformer的许多全序列去噪通道,并且与自回归解码不同,不能从KV缓存中受益。在这项工作中,我们利用了扩散框架和研究模型调度的灵活性,其中一个较小的MDLM在去噪步骤的子集上取代了完整的模型。在OpenWebText上,我们证明了早期和晚期去噪步骤对这种替换比中间步骤更鲁棒,能够减少高达17%的FLOP,而生成困惑只有适度的退化。我们支持这些研究结果与步骤的重要性分析的基础上损失和KL之间的分歧跨时间步长的小型和大型模型,以及一个详尽的搜索粗步段,这两者都确定中间的扩散轨迹最敏感。我们的研究结果表明,简单的,架构不可知的调度规则可以显着加快MDLM采样,同时在很大程度上保持生成质量的生成困惑。
摘要:Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. On OpenWebText, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality as measured by generative perplexity.
Graph相关(图学习|图神经网络|图优化等)(5篇)
【1】DSBD: Dual-Aligned Structural Basis Distillation for Graph Domain Adaptation
标题:PSBD:用于图域自适应的双对齐结构基础蒸馏
链接:https://arxiv.org/abs/2604.03154
作者:Yingxu Wang,Kunyu Zhang,Jiaxin Huang,Mengzhu Wang,Mingyan Xiao,Siyang Gao,Nan Yin
摘要:图域自适应(GDA)的目的是在分布偏移的情况下将知识从标记的源图转移到未标记的目标图。然而,现有的方法在很大程度上是以特征为中心的,忽略了结构差异,这在显著的拓扑变化下变得特别有害。这种差异改变了几何关系和光谱特性,导致图神经网络(GNN)的不可靠传输。为了解决这一限制,我们提出了双对齐结构基础蒸馏(DSBD)的GDA,一种新的框架,明确建模和适应跨域的结构变化。DSBD通过合成连续的概率原型图构建了一个可微的结构基础,从而实现了基于梯度的图拓扑优化。该基础是在源域监督下学习的,以保持语义的可辨别性,同时通过双对齐目标显式地对齐到目标域。具体而言,几何一致性是通过置换不变的拓扑矩匹配,并通过Dirichlet能量校准实现光谱一致性,共同捕捉跨域的结构特征。此外,我们引入了一个解耦的推理范式,通过在蒸馏结构的基础上训练一个新的GNN来减轻特定于源的结构偏差。在图形和图像基准上的大量实验表明,DSBD始终优于最先进的方法。
摘要
:Graph domain adaptation (GDA) aims to transfer knowledge from a labeled source graph to an unlabeled target graph under distribution shifts. However, existing methods are largely feature-centric and overlook structural discrepancies, which become particularly detrimental under significant topology shifts. Such discrepancies alter both geometric relationships and spectral properties, leading to unreliable transfer of graph neural networks (GNNs). To address this limitation, we propose Dual-Aligned Structural Basis Distillation (DSBD) for GDA, a novel framework that explicitly models and adapts cross-domain structural variation. DSBD constructs a differentiable structural basis by synthesizing continuous probabilistic prototype graphs, enabling gradient-based optimization over graph topology. The basis is learned under source-domain supervision to preserve semantic discriminability, while being explicitly aligned to the target domain through a dual-alignment objective. Specifically, geometric consistency is enforced via permutation-invariant topological moment matching, and spectral consistency is achieved through Dirichlet energy calibration, jointly capturing structural characteristics across domains. Furthermore, we introduce a decoupled inference paradigm that mitigates source-specific structural bias by training a new GNN on the distilled structural basis. Extensive experiments on graph and image benchmarks demonstrate that DSBD consistently outperforms state-of-the-art methods.
【2】Extracting Money Laundering Transactions from Quasi-Temporal Graph Representation
标题:从准时态图表示中提取洗钱交易
链接:https://arxiv.org/abs/2604.02899
作者:Haseeb Tariq,Marwan Hassani
摘要:洗钱是世界各地金融机构面临的一个持续挑战,而犯罪组织不断发展其战术,以绕过检测系统。传统的反洗钱方法主要依赖于预先确定的基于风险的规则,导致资源密集型调查和大量误报警报。为了限制运营成本的激增,在每天处理数十亿笔交易的同时,金融机构正在投资更复杂的机制来改善现有系统。在本文中,我们提出了ExSTraQt(从准时态图表示中提取可疑交易),这是一种先进的监督学习方法,用于检测金融数据集中的洗钱(或可疑)交易。与最先进的AML(反洗钱)检测模型相比,我们提出的框架在性能方面表现出色。我们的框架的主要优势是设计和参数数量方面的简单性;以及计算和内存需求方面的可扩展性。我们使用真实数据集和一组合成金融交易数据集评估了我们的交易级检测准确性框架。对于大多数数据集,我们的F1得分一直在提升,真实数据集的F1得分高达1%;对于其中一个合成数据集,F1得分超过8%。我们还声称,我们的框架可以无缝地补充银行现有的AML检测系统。我们的代码和数据集可以在https://github.com/mhaseebtariq/exstraqt上找到。
摘要:Money laundering presents a persistent challenge for financial institutions worldwide, while criminal organizations constantly evolve their tactics to bypass detection systems. Traditional anti-money laundering approaches mainly rely on predefined risk-based rules, leading to resource-intensive investigations and high numbers of false positive alerts. In order to restrict operational costs from exploding, while billions of transactions are being processed every day, financial institutions are investing in more sophisticated mechanisms to improve existing systems. In this paper, we present ExSTraQt (EXtract Suspicious TRAnsactions from Quasi-Temporal graph representation), an advanced supervised learning approach to detect money laundering (or suspicious) transactions in financial datasets. Our proposed framework excels in performance, when compared to the state-of-the-art AML (Anti Money Laundering) detection models. The key strengths of our framework are sheer simplicity, in terms of design and number of parameters; and scalability, in terms of the computing and memory requirements. We evaluated our framework on transaction-level detection accuracy using a real dataset; and a set of synthetic financial transaction datasets. We consistently achieve an uplift in the F1 score for most datasets, up to 1% for the real dataset; and more than 8% for one of the synthetic datasets. We also claim that our framework could seamlessly complement existing AML detection systems in banks. Our code and datasets are available at https://github.com/mhaseebtariq/exstraqt.
【3】Analytic Drift Resister for Non-Exemplar Continual Graph Learning
标题:非示例连续图学习的分析漂移研究
链接:https://arxiv.org/abs/2604.02633
作者:Lei Song,Shihan Guan,Youyong Kong
摘要:非示例连续图学习(NECGL)旨在通过保留仅类级别的原型表示而不是原始图示例来消除基于排练的范式固有的隐私风险,以减轻灾难性遗忘。然而,这种设计选择不可避免地沉淀特征漂移。作为一种新兴的替代方案,分析持续学习(ACL)利用冻结预训练模型的内在泛化特性来增强持续学习性能。尽管如此,一个关键的缺点在于模型可塑性的明显衰减。为了克服这些挑战,我们提出了分析漂移抵抗(ADR),一个新的和理论接地NECGL框架。ADR利用迭代反向传播来摆脱冻结的预训练约束,适应不断变化的任务图分布并增强模型的可塑性。由于参数更新会触发特征漂移,因此我们进一步提出了分层分析合并(HAM),通过岭回归在图神经网络(GNNs)中执行线性变换的分层合并,从而确保绝对抵抗特征漂移。在此基础上,分析分类器重构(ACR)理论上实现了零遗忘类增量学习。对四个节点分类基准的实证评估表明,ADR保持了较强的竞争力,对现有的国家的最先进的方法。
摘要:Non-Exemplar Continual Graph Learning (NECGL) seeks to eliminate the privacy risks intrinsic to rehearsal-based paradigms by retaining solely class-level prototype representations rather than raw graph examples for mitigating catastrophic forgetting. However, this design choice inevitably precipitates feature drift. As a nascent alternative, Analytic Continual Learning (ACL) capitalizes on the intrinsic generalization properties of frozen pre-trained models to bolster continual learning performance. Nonetheless, a key drawback resides in the pronounced attenuation of model plasticity. To surmount these challenges, we propose Analytic Drift Resister (ADR), a novel and theoretically grounded NECGL framework. ADR exploits iterative backpropagation to break free from the frozen pre-trained constraint, adapting to evolving task graph distributions and fortifying model plasticity. Since parameter updates trigger feature drift, we further propose Hierarchical Analytic Merging (HAM), performing layer-wise merging of linear transformations in Graph Neural Networks (GNNs) via ridge regression, thereby ensuring absolute resistance to feature drift. On this basis, Analytic Classifier Reconstruction (ACR) enables theoretically zero-forgetting class-incremental learning. Empirical evaluation on four node classification benchmarks demonstrates that ADR maintains strong competitiveness against existing state-of-the-art methods.
【4】Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs
标题:Guideline2Shape:可执行临床决策图的配置文件感知多模式解析
链接:https://arxiv.org/abs/2604.02477
作者:Onur Selim Kilic,Yeti Z. Gurbuz,Cem O. Yaldiz,Afra Nawar,Etrit Haxholli,Ogul Can,Eli Waxman
摘要:临床实践指南是长的多模态文档,其分支建议难以转换为可执行的临床决策支持(CDS),并且一次性解析通常会破坏跨页面的连续性。最近的LLM/VLM提取器大多是本地的或以文本为中心的,未指定部分接口,并且未能将整个文档的跨页面控制流合并到一个连贯的决策图中。我们提出了一个分解优先的管道,通过拓扑感知分块,接口约束的块图生成,并保持原产地的全球聚合转换完整的指南证据到一个可执行的临床决策图。管道不依赖于单遍生成,而是使用显式的入口/终端接口和语义重复数据删除来保持跨页面的连续性,同时保持诱导的控制流可审计和结构一致。我们评估了一个裁定的前列腺指南基准与匹配的输入和相同的基础VLM骨干跨比较的方法。在完整的合并图上,我们的方法将边和三重精确度/召回率从现有模型的19.6\%/16.1\%$提高到69.0\%/87.5\%$,而节点召回率从78.1\%$提高到93.8\%$。这些结果支持在该基准上分解优先、可审计的指南到CDS的转换,而目前的证据仍然限于一个裁定的前列腺指南,并激发了更广泛的多指南验证。
摘要
:Clinical practice guidelines are long, multimodal documents whose branching recommendations are difficult to convert into executable clinical decision support (CDS), and one-shot parsing often breaks cross-page continuity. Recent LLM/VLM extractors are mostly local or text-centric, under-specifying section interfaces and failing to consolidate cross-page control flow across full documents into one coherent decision graph. We present a decomposition-first pipeline that converts full-guideline evidence into an executable clinical decision graph through topology-aware chunking, interface-constrained chunk graph generation, and provenance-preserving global aggregation. Rather than relying on single-pass generation, the pipeline uses explicit entry/terminal interfaces and semantic deduplication to preserve cross-page continuity while keeping the induced control flow auditable and structurally consistent. We evaluate on an adjudicated prostate-guideline benchmark with matched inputs and the same underlying VLM backbone across compared methods. On the complete merged graph, our approach improves edge and triplet precision/recall from $19.6\%/16.1\%$ in existing models to $69.0\%/87.5\%$, while node recall rises from $78.1\%$ to $93.8\%$. These results support decomposition-first, auditable guideline-to-CDS conversion on this benchmark, while current evidence remains limited to one adjudicated prostate guideline and motivates broader multi-guideline validation.
【5】Homophily-aware Supervised Contrastive Counterfactual Augmented Fair Graph Neural Network
标题:具有同性恋意识的监督对比反事实增强公平图神经网络
链接:https://arxiv.org/abs/2604.02342
作者:Mahdi Tavassoli Kejani,Fadi Dornaika,Charlotte Laclau,Jean-Michel Loubes
备注:This paper has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning, 2026
摘要:近年来,图神经网络(GNNs)在节点分类、链接预测和图表示学习等任务中取得了显着的成功。然而,它们仍然容易受到偏见的影响,这些偏见不仅来自节点属性,而且来自图结构本身。因此,解决GNN中的公平问题已成为一项关键的研究挑战。在这项工作中,我们提出了一种新的模型,通过改进反事实增强公平图神经网络框架(CAF)来训练公平感知GNN。具体来说,我们的方法引入了一个两阶段的训练策略:在第一阶段,我们编辑图以增加类标签的同质性比率,同时减少敏感属性标签的同质性比率;在第二阶段,我们将修改后的监督对比损失和环境损失整合到优化过程中,使模型能够共同提高预测性能和公平性。在五个真实数据集上的实验表明,我们的模型在分类准确性和公平性指标方面都优于CAF和几种最先进的基于图的学习方法。
摘要:In recent years, Graph Neural Networks (GNNs) have achieved remarkable success in tasks such as node classification, link prediction, and graph representation learning. However, they remain susceptible to biases that can arise not only from node attributes but also from the graph structure itself. Addressing fairness in GNNs has therefore emerged as a critical research challenge. In this work, we propose a novel model for training fairness-aware GNNs by improving the counterfactual augmented fair graph neural network framework (CAF). Specifically, our approach introduces a two-phase training strategy: in the first phase, we edit the graph to increase homophily ratio with respect to class labels while reducing homophily ratio with respect to sensitive attribute labels; in the second phase, we integrate a modified supervised contrastive loss and environmental loss into the optimization process, enabling the model to jointly improve predictive performance and fairness. Experiments on five real-world datasets demonstrate that our model outperforms CAF and several state-of-the-art graph-based learning methods in both classification accuracy and fairness metrics.
Transformer(1篇)
【1】FTimeXer: Frequency-aware Time-series Transformer with Exogenous variables for Robust Carbon Footprint Forecasting
标题:FTimeXer:具有外生变量的频率感知时间序列Transformer,用于稳健的碳足迹预测
链接:https://arxiv.org/abs/2604.02347
作者:Qingzhong Li,Yue Hu,Zhou Long,Qingchang Ma,Hui Ma,Jinhai Sa
备注:Accepted by The 5th International Conference on Electronics Technology and Artificial Intelligence (ETAI 2026)
摘要:准确和最新的电网碳足迹预测对于有效的产品碳足迹(PCF)核算和明智的脱碳决策至关重要。然而,网格的碳强度表现出高度的非平稳性,现有的方法往往难以有效地利用周期性和振荡模式。此外,这些方法往往表现不佳时,面对不规则的外部输入,如缺失数据或不对齐。为了解决这些挑战,我们提出了FTimeXer,频率感知的时间序列Transformer设计了一个强大的训练方案,可容纳外源因素。FTimeXer具有快速傅里叶变换(FFT)驱动的频率分支,结合门控时频融合,使其能够有效地捕获多尺度周期性。它还采用随机外源掩蔽与一致性正则化相结合,这有助于减少虚假相关性并增强稳定性。在三个真实世界的数据集上进行的实验表明,在强基线上有一致的改进。因此,这些改进导致对电网碳因子的更可靠预测,这对于有效的PCF核算和有关脱碳的明智决策至关重要。
摘要:Accurate and up-to-date forecasting of the power grid's carbon footprint is crucial for effective product carbon footprint (PCF) accounting and informed decarbonization decisions. However, the carbon intensity of the grid exhibits high non-stationarity, and existing methods often struggle to effectively leverage periodic and oscillatory patterns. Furthermore, these methods tend to perform poorly when confronted with irregular exogenous inputs, such as missing data or misalignment. To tackle these challenges, we propose FTimeXer, a frequency-aware time-series Transformer designed with a robust training scheme that accommodates exogenous factors. FTimeXer features an Fast Fourier Transform (FFT)-driven frequency branch combined with gated time-frequency fusion, allowing it to capture multi-scale periodicity effectively. It also employs stochastic exogenous masking in conjunction with consistency regularization, which helps reduce spurious correlations and enhance stability. Experiments conducted on three real-world datasets show consistent improvements over strong baselines. As a result, these enhancements lead to more reliable forecasts of grid carbon factors, which are essential for effective PCF accounting and informed decision-making regarding decarbonization.
GAN|对抗|攻击|生成相关(10篇)
【1】A Tsetlin Machine-driven Intrusion Detection System for Next-Generation IoMT Security
标题:用于下一代IoMT安全的Tsetlin机器驱动入侵检测系统
链接:https://arxiv.org/abs/2604.03205
作者:Rahul Jaiswal,Per-Arne Andersen,Linga Reddy Cenkeramaddi,Lei Jiao,Ole-Christoffer Granmo
备注:8 pages, 15 figures, 9 tables. Accepted at the 7th Silicon Valley Cybersecurity Conference (SVCC 2026), California, USA
摘要:医疗物联网(IoMT)的快速采用正在通过实现医疗设备、系统和服务之间的无缝连接来改变医疗保健。然而,它也带来了严重的网络安全和患者安全问题,因为攻击者越来越多地利用新方法和新出现的漏洞来渗透IoMT网络。本文提出了一种新的基于Tsetlin Machine(TM)的入侵检测系统(IDS),用于检测针对IoMT网络的各种网络攻击。TM是一种基于规则和可解释的机器学习(ML)方法,使用命题逻辑对攻击模式进行建模。在CICIoMT-2024数据集上进行的大量实验,其中包括多种IoMT协议和网络攻击类型,证明了所提出的基于TM的IDS优于传统的ML分类器。该模型在二分类中的准确率达到99.5%,在多分类中的准确率达到90.7%,超过了现有的最先进的方法。此外,为了增强模型的信任度和可解释性,所提出的基于TM的模型提供了类的投票分数和子句激活热图,从而提供了对最有影响力的子句和对最终模型决策有贡献的主导类的清晰见解。
摘要:The rapid adoption of the Internet of Medical Things (IoMT) is transforming healthcare by enabling seamless connectivity among medical devices, systems, and services. However, it also introduces serious cybersecurity and patient safety concerns as attackers increasingly exploit new methods and emerging vulnerabilities to infiltrate IoMT networks. This paper proposes a novel Tsetlin Machine (TM)-based Intrusion Detection System (IDS) for detecting a wide range of cyberattacks targeting IoMT networks. The TM is a rule-based and interpretable machine learning (ML) approach that models attack patterns using propositional logic. Extensive experiments conducted on the CICIoMT-2024 dataset, which includes multiple IoMT protocols and cyberattack types, demonstrate that the proposed TM-based IDS outperforms traditional ML classifiers. The proposed model achieves an accuracy of 99.5\% in binary classification and 90.7\% in multi-class classification, surpassing existing state-of-the-art approaches. Moreover, to enhance model trust and interpretability, the proposed TM-based model presents class-wise vote scores and clause activation heatmaps, providing clear insights into the most influential clauses and the dominant class contributing to the final model decision.
【2】Generating DDPM-based Samples from Tilted Distributions
标题:从倾斜分布生成基于DDPM的样本
链接:https://arxiv.org/abs/2604.03015
作者:Himadri Mandal,Dhruman Gupta,Rushil Gupta,Sarvesh Ravichandran Iyer,Agniv Bandyopadhyay,Achal Bassamboo,Varun Gupta,Sandeep Juneja
备注:33 pages, 4 figures
摘要:给定来自$d$维概率分布的$n$个独立样本,我们的目标是从通过倾斜原始分布获得的分布中生成基于扩散的样本,其中倾斜度由$θ\in \mathbb{R}^d$参数化。我们定义了一个插件估计,并表明它是最小最优的。我们开发的插件估计器的分布和真实分布之间的Wasserstein界作为$n$和$θ$的函数,说明制度的输出和所需的真实分布是接近的。此外,在一些假设下,我们证明了在这些倾斜样本上运行扩散的TV精度。我们的理论结果得到了广泛的模拟支持。我们工作的应用包括金融,天气和气候建模,以及许多其他领域,其目的可能是从满足实际激励矩约束的倾斜分布中生成样本。
摘要:Given $n$ independent samples from a $d$-dimensional probability distribution, our aim is to generate diffusion-based samples from a distribution obtained by tilting the original, where the degree of tilt is parametrized by $θ\in \mathbb{R}^d$. We define a plug-in estimator and show that it is minimax-optimal. We develop Wasserstein bounds between the distribution of the plug-in estimator and the true distribution as a function of $n$ and $θ$, illustrating regimes where the output and the desired true distribution are close. Further, under some assumptions, we prove the TV-accuracy of running Diffusion on these tilted samples. Our theoretical results are supported by extensive simulations. Applications of our work include finance, weather and climate modelling, and many other domains, where the aim may be to generate samples from a tilted distribution that satisfies practically motivated moment constraints.
【3】Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
标题:超越语义操纵:对奖励模型的代币空间攻击
链接:https://arxiv.org/abs/2604.02686
作者:Yuheng Zhang,Mingyue Huo,Minghao Zhu,Mengxue Zhang,Nan Jiang
摘要:奖励模型(RM)被广泛用作人类反馈强化学习(RLHF)的优化目标,但它们仍然容易受到奖励黑客的攻击。现有的攻击主要在语义空间内进行,利用RM偏见构建人类可读的对抗性输出。在这项工作中,我们引入了一个根本不同的范式:令牌映射扰动攻击(TOMPA),这是一个直接在令牌空间中执行对抗优化的框架。通过绕过策略和奖励模型之间的标准解码-重新令牌化接口,TOMPA使攻击策略能够在原始令牌序列而不是连贯的自然语言上进行优化。仅使用黑盒标量反馈,TOMPA自动发现非语言标记模式,这些模式在多个最先进的RM中获得极高的奖励。具体来说,当针对Skywork-Reward-V2-Llama-3.1-8B时,TOMPA几乎是GPT-5参考答案的两倍,并且在98.0%的提示上优于它们。尽管有这些高分,生成的输出退化成无意义的文本,揭示了RM可以被系统地利用语义机制之外,并暴露了当前RLHF管道中的关键漏洞。
摘要:Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.
【4】Cross-subject Muscle Fatigue Detection via Adversarial and Supervised Contrastive Learning with Inception-Attention Network
标题:通过认知-注意力网络的对抗和监督对比学习进行跨学科肌肉疲劳检测
链接:https://arxiv.org/abs/2604.02670
作者:Zitao Lin,Chang Zhu,Wei Meng
备注:This work has been submitted to ICARM 2026 for possible publication. 6 pages, 7 figures, 5 tables
摘要:肌肉疲劳检测在身体康复中起着重要的作用。先前的研究已经证明,sEMG在检测肌肉疲劳方面提供了比其他生物信号更高的灵敏度。然而,从表面肌电信号中提取的特征可能会在动态收缩期间和在不同的主体之间变化,这导致疲劳检测的不稳定性。为了解决这些挑战,本研究提出了一种新的神经网络,包括一个初始注意力模块作为特征提取器,疲劳分类器和域分类器配备了梯度反转层。集成域分类器鼓励网络学习受试者不变的常见疲劳特征,同时最大限度地减少受试者特定的特征。此外,监督对比损失函数也被用来提高模型的泛化能力。实验结果表明,该模型在三类分类任务中取得了优异的性能,准确率达到93.54%,召回率达到92.69%,F1得分达到92.69%,为跨主体肌肉疲劳检测提供了一种鲁棒的解决方案,对康复训练和辅助具有重要的指导意义。
摘要:Muscle fatigue detection plays an important role in physical rehabilitation. Previous researches have demonstrated that sEMG offers superior sensitivity in detecting muscle fatigue compared to other biological signals. However, features extracted from sEMG may vary during dynamic contractions and across different subjects, which causes unstability in fatigue detection. To address these challenges, this research proposes a novel neural network comprising an Inception-attention module as a feature extractor, a fatigue classifier and a domain classifier equipped with a gradient reversal layer. The integrated domain classifier encourages the network to learn subject-invariant common fatigue features while minimizing subject-specific features. Furthermore, a supervised contrastive loss function is also employed to enhance the generalization capability of the model. Experimental results demonstrate that the proposed model achieved outstanding performance in three-class classification tasks, reaching 93.54% accuracy, 92.69% recall and 92.69% F1-score, providing a robust solution for cross-subject muscle fatigue detection, offering significant guidance for rehabilitation training and assistance.
【5】VoxelCodeBench: Benchmarking 3D World Modeling Through Code Generation
标题:VoxelCodeBench:通过代码生成对3D世界建模进行基准测试
链接:https://arxiv.org/abs/2604.02580
作者:Yan Zheng,Florian Bordes
摘要:评估用于3D空间推理的代码生成模型需要在现实环境中执行生成的代码,并评估超出表面级别正确性的输出。我们引入了一个平台VoxelCode,用于分析代码生成功能,以实现3D理解和环境创建。我们的平台集成了自然语言任务规范、虚幻引擎中API驱动的代码执行以及支持自动化指标和人工评估的统一评估管道。为了证明它的效用,我们构建了VoxelCodeBench,一个跨越三个推理维度的体素操作任务的基准:符号解释,几何构造和艺术构图。评估领先的代码生成模型,我们发现,生产可执行代码远比生产空间上正确的输出,几何构造和多对象组合证明特别具有挑战性。通过开源我们的平台和基准,我们为社区提供了可扩展的基础设施,用于开发新的3D代码生成基准和探索未来模型中的空间推理。
摘要:Evaluating code generation models for 3D spatial reasoning requires executing generated code in realistic environments and assessing outputs beyond surface-level correctness. We introduce a platform VoxelCode, for analyzing code generation capabilities for 3D understanding and environment creation. Our platform integrates natural language task specification, API-driven code execution in Unreal Engine, and a unified evaluation pipeline supporting both automated metrics and human assessment. To demonstrate its utility, we construct VoxelCodeBench, a benchmark of voxel manipulation tasks spanning three reasoning dimensions: symbolic interpretation, geometric construction, and artistic composition. Evaluating leading code generation models, we find that producing executable code is far easier than producing spatially correct outputs, with geometric construction and multi-object composition proving particularly challenging. By open-sourcing our platform and benchmark, we provide the community with extensible infrastructure for developing new 3D code generation benchmarks and probing spatial reasoning in future models.
【6】SEDGE: Structural Extrapolated Data Generation
标题:SEDGE:结构外推数据生成
链接:https://arxiv.org/abs/2604.02482
作者:Kun Zhang,Jiaqi Sun,Yiqing Li,Ignavier Ng,Namrata Deka,Shaoan Xie
摘要:本文提出了一个框架结构外推数据生成(SEDGE)的基础上适当的假设的基础数据生成过程。我们提供的条件下,可以可靠地生成满足新规格的数据,以及在某些“保守”的假设下,这些数据的分布的近似可识别性。在算法方面,我们开发了实用的方法来实现外推数据生成,基于结构知情的优化策略或扩散后验采样,分别。我们验证了合成数据的外推性能,并考虑外推图像生成作为一个现实世界的情况下,说明所提出的框架的有效性。
摘要:This paper proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data generating process. We provide conditions under which data satisfying new specifications can be generated reliably, together with the approximate identifiability of the distribution of such data under certain ``conservative" assumptions. On the algorithmic side, we develop practical methods to achieve extrapolated data generation, based on the structure-informed optimization strategy or diffusion posterior sampling, respectively. We verify the extrapolation performance on synthetic data and also consider extrapolated image generation as a real-world scenario to illustrate the validity of the proposed framework.
【7】PlayGen-MoG: Framework for Diverse Multi-Agent Play Generation via Mixture-of-Gaussians Trajectory Prediction
标题:PlayGen-MoG:通过混合高斯轨迹预测生成多样化多智能体游戏的框架
链接:https://arxiv.org/abs/2604.02447
作者:Kevin Song
备注:9 pages, 4 figures, 2 tables. Accepted to CVPRW 2026
摘要:团队运动中的多智能体轨迹生成需要捕捉可能的比赛的多样性和比赛中球员之间的现实空间协调的模型。标准的生成方法,如条件变分自动编码器(CVAE)和扩散模型,都很难完成这项任务,表现出后验崩溃或收敛到数据集均值。此外,大多数轨迹预测方法在需要多帧观察历史的预测机制中操作,限制了它们在只有初始形成可用的情况下用于游戏设计。我们提出了PlayGen-MoG,一个可扩展的框架形成条件发挥产生,通过三个设计选择来解决这些挑战:1/具有跨所有代理的共享混合权重的高斯混合(MoG)输出头,其中单个权重集合选择耦合所有玩家的轨迹的游戏场景,2/相对空间注意力,其将成对玩家位置和距离编码为学习的注意力偏差,以及3/从初始地层的绝对位移的非自回归预测,消除累积误差漂移并去除对观察到的轨迹历史的依赖,使得能够仅从单个静态地层产生真实的运动。在美式足球跟踪数据上,PlayGen-MoG实现了1.68码ADE和3.98码FDE,同时保持了所有8个混合物组分的充分利用,熵为2.06/2.08,并且定性地确认了没有模式崩溃的多样性生成。
摘要:Multi-agent trajectory generation in team sports requires models that capture both the diversity of possible plays and realistic spatial coordination between players on plays. Standard generative approaches such as Conditional Variational Autoencoders (CVAE) and diffusion models struggle with this task, exhibiting posterior collapse or convergence to the dataset mean. Moreover, most trajectory prediction methods operate in a forecasting regime that requires multiple frames of observed history, limiting their use for play design where only the initial formation is available. We present PlayGen-MoG, an extensible framework for formation-conditioned play generation that addresses these challenges through three design choices: 1/ a Mixture-of-Gaussians (MoG) output head with shared mixture weights across all agents, where a single set of weights selects a play scenario that couples all players' trajectories, 2/ relative spatial attention that encodes pairwise player positions and distances as learned attention biases, and 3/ non-autoregressive prediction of absolute displacements from the initial formation, eliminating cumulative error drift and removing the dependence on observed trajectory history, enabling realistic play generation from a single static formation alone. On American football tracking data, PlayGen-MoG achieves 1.68 yard ADE and 3.98 yard FDE while maintaining full utilization of all 8 mixture components with entropy of 2.06 out of 2.08, and qualitatively confirming diverse generation without mode collapse.
【8】Backdoor Attacks on Decentralised Post-Training
标题:对分散化后训练的后门攻击
链接:https://arxiv.org/abs/2604.02372
作者:Oğuzhan Ersoy,Nikolay Blagoev,Jona te Lintelo,Stefanos Koffas,Marina Krček,Stjepan Picek
备注:Accepted to ICLR 2026 Workshop 'Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities'
摘要:大型语言模型的分散式后训练利用数据和管道并行技术来分割数据和模型。不幸的是,分散的后期训练可能容易受到一个或多个恶意参与者的中毒和后门攻击。已经有几项针对分散式数据并行或联邦学习的攻击和防御工作。然而,现有的工作对管道并行的鲁棒性仅限于中毒攻击。据我们所知,本文提出了第一个后门攻击流水线并行,旨在错位训练模型。在我们的设置中,攻击者控制管道的中间阶段,而不是整个模型或数据集,使得现有的攻击(如数据中毒)不适用。我们的实验结果表明,即使是这样一个有限的对手也可以在后训练期间注入后门并导致模型的不对齐,而与学习的域或数据集无关。在我们的攻击中,包含触发词将对齐百分比从80\%$降低到6\%$。我们通过对最终模型应用安全对齐训练来进一步测试我们的攻击的鲁棒性,并证明我们的后门攻击仍然在60%的情况下成功。
摘要:Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, decentralised post-training can be vulnerable to poisoning and backdoor attacks by one or more malicious participants. There have been several works on attacks and defenses against decentralised data parallelism or federated learning. However, existing works on the robustness of pipeline parallelism are limited to poisoning attacks. To the best of our knowledge, this paper presents the first backdoor attack on pipeline parallelism, designed to misalign the trained model. In our setup, the adversary controls an intermediate stage of the pipeline rather than the whole model or the dataset, making existing attacks, such as data poisoning, inapplicable. Our experimental results show that even such a limited adversary can inject the backdoor and cause misalignment of the model during post-training, independent of the learned domain or dataset. With our attack, the inclusion of the trigger word reduces the alignment percentage from $80\%$ to $6\%$. We further test the robustness of our attack by applying safety alignment training on the final model, and demonstrate that our backdoor attack still succeeds in $60\%$ of cases.
【9】From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
标题:从广泛探索到稳定合成:自回归图像生成的自回归优化
链接:https://arxiv.org/abs/2604.02355
作者:Han Song,Yucheng Zhou,Jianbing Shen,Yu Cheng
摘要:将思想链(CoT)与强化学习(RL)相结合可以改进文本到图像(T2 I)的生成,但CoT的探索和RL的优化之间的潜在相互作用仍不清楚。我们提出了一个系统的基于熵的分析,产生了三个关键的见解:(1)CoT扩展了生成探索空间,而RL将其收缩到高回报区域;(2)最终奖励与图像令牌熵的均值和方差都呈强负相关,突出了减少不确定性和不稳定性的必要性;以及(3)文本CoT的熵直接控制下游图像质量,其中较低熵CoT导致更好的生成。受这些发现的启发,我们提出了熵引导的组相对策略优化(EG-GRPO),这是一种微调策略,通过不确定性重新分配优化预算:低熵令牌被排除在奖励驱动的更新之外以保持稳定性,而高熵令牌则获得熵奖励,鼓励结构化探索而不会崩溃。在标准T2 I基准上的实验表明,EG-GRPO达到了最先进的性能。
摘要
:Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.
【10】Generating Counterfactual Patient Timelines from Real-World Data
标题:从现实世界数据生成反事实患者时间线
链接:https://arxiv.org/abs/2604.02337
作者:Yu Akagi,Tomohisa Seki,Toru Takiguchi,Hiromasa Ito,Yoshimasa Kawazoe,Kazuhiko Ohe
摘要:反事实模拟-探索替代临床场景下的假设结果-为个性化医疗和计算机试验等变革性应用带来了希望。然而,由于方法上的限制,这一工作仍然具有挑战性。在这里,我们展示了一个在来自超过30万名患者和4亿名患者时间轴条目的真实数据上训练的自回归生成模型可以生成临床上合理的反事实轨迹。作为验证任务,我们将该模型应用于2023年因COVID-19住院的患者,修改年龄、血清C反应蛋白(CRP)和血清肌酐,以模拟7天的结果。在年龄较大、CRP升高和血清肌酐升高的反事实模拟中观察到院内死亡率增加。Remdesivir处方在CRP值较高的模拟中增加,在肾功能受损的模拟中减少。这些反事实的轨迹再现了已知的临床模式。这些发现表明,以自我监督的方式在真实世界数据上训练的自回归生成模型可以为反事实临床模拟奠定基础。
摘要:Counterfactual simulation - exploring hypothetical consequences under alternative clinical scenarios - holds promise for transformative applications such as personalized medicine and in silico trials. However, it remains challenging due to methodological limitations. Here, we show that an autoregressive generative model trained on real-world data from over 300,000 patients and 400 million patient timeline entries can generate clinically plausible counterfactual trajectories. As a validation task, we applied the model to patients hospitalized with COVID-19 in 2023, modifying age, serum C-reactive protein (CRP), and serum creatinine to simulate 7-day outcomes. Increased in-hospital mortality was observed in counterfactual simulations with older age, elevated CRP, and elevated serum creatinine. Remdesivir prescriptions increased in simulations with higher CRP values and decreased in those with impaired kidney function. These counterfactual trajectories reproduced known clinical patterns. These findings suggest that autoregressive generative models trained on real-world data in a self-supervised manner can establish a foundation for counterfactual clinical simulation.
迁移|Zero/Few/One-Shot|自适应(6篇)
【1】Adaptive Semantic Communication for Wireless Image Transmission Leveraging Mixture-of-Experts Mechanism
标题:利用专家混合机制的无线图像传输自适应语义通信
链接:https://arxiv.org/abs/2604.02691
作者:Haowen Wan,Qianqian Yang
摘要:基于深度学习的语义通信在无线图像传输中取得了重大进展,但大多数现有方案依赖于固定模型,因此对不同的图像内容和动态信道条件缺乏鲁棒性。为了提高适应性,最近的研究已经开发出自适应语义通信策略,根据源内容或信道状态调整传输或模型行为。最近,基于MoE的语义通信已经成为一种稀疏和高效的自适应架构,尽管现有的设计仍然主要依赖于单驱动路由。为了解决这个问题,我们提出了一种新的多级端到端的多输入多输出(MIMO)通道的图像语义通信系统,建立在自适应MoE Swin Transformer块。具体来说,我们引入了一个动态的专家门机制,共同评估实时CSI和输入图像补丁的语义内容来计算自适应路由概率。通过选择性地激活只基于此联合条件的专家的一个专门的子集,我们的方法打破了传统的自适应方法的刚性耦合,克服了单驱动路由的瓶颈。仿真结果表明,在保持传输效率的同时,重建质量显着改善现有的方法。
摘要:Deep learning based semantic communication has achieved significant progress in wireless image transmission, but most existing schemes rely on fixed models and thus lack robustness to diverse image contents and dynamic channel conditions. To improve adaptability, recent studies have developed adaptive semantic communication strategies that adjust transmission or model behavior according to either source content or channel state. More recently, MoE-based semantic communication has emerged as a sparse and efficient adaptive architecture, although existing designs still mainly rely on single-driven routing. To address this limitation, we propose a novel multi-stage end-to-end image semantic communication system for multi-input multi-output (MIMO) channels, built upon an adaptive MoE Swin Transformer block. Specifically, we introduce a dynamic expert gating mechanism that jointly evaluates both real-time CSI and the semantic content of input image patches to compute adaptive routing probabilities. By selectively activating only a specialized subset of experts based on this joint condition, our approach breaks the rigid coupling of traditional adaptive methods and overcomes the bottlenecks of single-driven routing. Simulation results indicate a significant improvement in reconstruction quality over existing methods while maintaining the transmission efficiency.
【2】Time-Warping Recurrent Neural Networks for Transfer Learning
标题:用于迁移学习的时间扭曲回归神经网络
链接:https://arxiv.org/abs/2604.02474
作者:Jonathon Hirschi
摘要:动态系统描述了物理系统如何随时间演化。在不同的环境条件下,物理过程可能发展得更快或更慢。我们使用时间扭曲来重新调整物理系统模型中的时间。本文提出了一种基于时间规整的递归神经网络迁移学习方法。我们证明,对于一类线性一阶微分方程(称为时滞模型),LSTM可以以任何期望的精度近似这些系统,并且模型可以在保持近似精度的同时进行时间扭曲。 然后,在预测燃料水分含量(FMC),野火建模中的一个重要概念的应用问题中评估迁移学习的时间规整方法。具有LSTM循环层的RNN在特征时间尺度为10小时的燃料上进行预训练,其中有大量数据可用于训练。然后使用迁移学习修改RNN,以生成具有1小时,100小时和1000小时特征时间尺度的燃料预测。针对几种已知的迁移学习方法对时间规整方法进行了评估。时间扭曲方法产生的预测精度与已建立的方法相当,尽管只修改了其他方法修改的一小部分参数。
摘要:Dynamical systems describe how a physical system evolves over time. Physical processes can evolve faster or slower in different environmental conditions. We use time-warping as rescaling the time in a model of a physical system. This thesis proposes a new method of transfer learning for Recurrent Neural Networks (RNNs) based on time-warping. We prove that for a class of linear, first-order differential equations known as time lag models, an LSTM can approximate these systems with any desired accuracy, and the model can be time-warped while maintaining the approximation accuracy. The Time-Warping method of transfer learning is then evaluated in an applied problem on predicting fuel moisture content (FMC), an important concept in wildfire modeling. An RNN with LSTM recurrent layers is pretrained on fuels with a characteristic time scale of 10 hours, where there are large quantities of data available for training. The RNN is then modified with transfer learning to generate predictions for fuels with characteristic time scales of 1 hour, 100 hours, and 1000 hours. The Time-Warping method is evaluated against several known methods of transfer learning. The Time-Warping method produces predictions with an accuracy level comparable to the established methods, despite modifying only a small fraction of the parameters that the other methods modify.
【3】Koopman-Based Nonlinear Identification and Adaptive Control of a Turbofan Engine
标题:基于Koopman的涡轮风扇发动机非线性辨识与自适应控制
链接:https://arxiv.org/abs/2604.01730
作者:David Grasev
备注:21 pages, 23 figures
摘要:本文研究了基于Koopman算子的双轴涡扇发动机多变量控制方法。一个基于物理的组件级模型的开发,以生成训练数据和验证控制器。开发了一种元启发式扩展动态模式分解,其成本函数被设计为精确地捕获线轴速度动态和发动机压力比(EPR),从而能够构建适合于多个控制目标的单个Koopman模型。使用所确定的时变Koopman模型,开发了两个控制器:一个自适应Koopman的模型预测控制器(AKMPC)与干扰观测器和Koopman的反馈线性化控制器(K-FBLC),作为基准。两种控制策略,即配置的线轴速度和EPR,在海平面和不同的飞行条件下的控制器进行评估。结果表明,所提出的识别方法能够准确地预测两个阀芯速度和EPR,允许Koopman模型可以灵活地跨不同的控制配方重复使用。虽然这两种控制策略在稳态条件下实现了相当的性能,但由于AKMPC能够补偿模型失配,因此在不同的飞行条件下,AKMPC与K-FBLC相比具有更好的鲁棒性。此外,EPR控制策略改善了推力响应。该研究突出了基于Koopman的控制的适用性,并展示了基于AKMPC的框架的鲁棒涡扇发动机控制的优势。
摘要:This paper investigates Koopman operator-based approaches for multivariable control of a two-spool turbofan engine. A physics-based component-level model is developed to generate training data and validate the controllers. A meta-heuristic extended dynamic mode decomposition is developed, with a cost function designed to accurately capture both spool-speed dynamics and the engine pressure ratio (EPR), enabling the construction of a single Koopman model suitable for multiple control objectives. Using the identified time-varying Koopman model, two controllers are developed: an adaptive Koopman-based model predictive controller (AKMPC) with a disturbance observer and a Koopman-based feedback linearization controller (K-FBLC), which serves as a benchmark. The controllers are evaluated for two control strategies, namely configurations of spool speeds and EPR, under both sea-level and varying flight conditions. The results demonstrate that the proposed identification approach enables accurate predictions of both spool speeds and EPR, allowing the Koopman model to be reused flexibly across different control formulations. While both control strategies achieve comparable performance in steady conditions, the AKMPC exhibits superior robustness compared with the K-FBLC under varying flight conditions due to its ability to compensate for model mismatch. Moreover, the EPR control strategy improves the thrust response. The study highlights the applicability of Koopman-based control and demonstrates the advantages of the AKMPC-based framework for robust turbofan engine control.
【4】Transfer Learning for Loan Recovery Prediction under Distribution Shifts with Heterogeneous Feature Spaces
标题:具有异类特征空间分布变化下贷款回收预测的转移学习
链接:https://arxiv.org/abs/2604.02832
作者:Christopher Gerling,Hanqiu Peng,Ying Chen,Stefan Lessmann
备注:Preprint before Peer-Review
摘要:准确预测回收率(RR)是信用风险管理和监管资本确定的核心。然而,在许多贷款组合中,RR建模受到罕见违约事件引起的数据稀缺性的限制。迁移学习(TL)通过利用来自相关但更丰富的源域的信息来缓解这一挑战,提供了一种有希望的途径,但其有效性主要取决于分布变化的存在和强度,以及源和目标特征空间之间的潜在异质性。 本文介绍了FT-MDN-Transformer,一个混合密度的表格式Transformer架构,专门设计用于跨异构特征集的RR预测TL。该模型产生贷款层面的点估计和投资组合层面的预测分布,从而支持广泛的实际RR预测应用。我们在受控蒙特卡罗模拟中评估了所提出的方法,该模拟促进了协变量、条件和标签变化的系统性变化,以及在使用全球信贷数据(GCD)贷款数据集作为源和新型债券数据集作为目标的真实世界转移环境中评估了所提出的方法。 我们的研究结果表明,当目标域数据有限时,FT-MDN-Transformer的性能优于基线模型,在协变量和条件移位下具有特别明显的增益,而标签移位仍然具有挑战性。我们还观察到其概率预测密切跟踪经验恢复分布,提供更丰富的信息比传统的点预测指标。总体而言,研究结果突出了分布感知TL架构的潜力,以改善数据稀缺的信贷组合的RR预测,并为风险管理人员在异构数据环境下操作提供了实用的见解。
摘要:Accurate forecasting of recovery rates (RR) is central to credit risk management and regulatory capital determination. In many loan portfolios, however, RR modeling is constrained by data scarcity arising from infrequent default events. Transfer learning (TL) offers a promising avenue to mitigate this challenge by exploiting information from related but richer source domains, yet its effectiveness critically depends on the presence and strength of distributional shifts, and on potential heterogeneity between source and target feature spaces. This paper introduces FT-MDN-Transformer, a mixture-density tabular Transformer architecture specifically designed for TL in RR forecasting across heterogeneous feature sets. The model produces both loan-level point estimates and portfolio-level predictive distributions, thereby supporting a wide range of practical RR forecasting applications. We evaluate the proposed approach in a controlled Monte Carlo simulation that facilitates systematic variation of covariate, conditional, and label shifts, as well as in a real-world transfer setting using the Global Credit Data (GCD) loan dataset as source and a novel bonds dataset as target. Our results show that FT-MDN-Transformer outperforms baseline models when target-domain data are limited, with particularly pronounced gains under covariate and conditional shifts, while label shift remains challenging. We also observe its probabilistic forecasts to closely track empirical recovery distributions, providing richer information than conventional point-prediction metrics alone. Overall, the findings highlight the potential of distribution-aware TL architectures to improve RR forecasting in data-scarce credit portfolios and offer practical insights for risk managers operating under heterogeneous data environments.
【5】Transfer Learning for Meta-analysis Under Covariate Shift
标题:协变量变化下的元分析迁移学习
链接:https://arxiv.org/abs/2604.02656
作者:Zilong Wang,Ali Abdeen,Turgay Ayer
备注:Accepted to IEEE ICHI 2026 Early Bird Track (Oral Presentation)
摘要:随机对照试验通常不代表做出决策的人群,研究之间的协变量变化可能使标准IPD荟萃分析和运输估计无效。我们提出了一个安慰剂锚定运输框架,将源试验结果视为丰富的代理信号,将目标试验安慰剂结果视为稀缺的高保真金标签,以校准基线风险。一个低复杂性(稀疏)校正锚代理结果模型的目标人群,锚定模型嵌入在交叉拟合双稳健学习,产生一个奈曼正交,目标网站双稳健估计患者水平的异质性治疗效果时,目标治疗的结果是可用的。我们区分两种制度:在连接的目标(治疗组),该方法产生目标识别的效果估计;在断开的目标(安慰剂),它减少到一个原则性的屏幕-然后-运输程序下明确的工作模型的运输假设。合成数据和半合成IHDP基准实验评估逐点CATE精度,ATE误差,排名质量的目标,决策理论的政策遗憾,和校准。在连接的设置,所提出的方法是最好的或接近最好的,并大大提高了代理,目标,并在小目标样本量的运输基线;在断开连接的设置,它保留了强大的排名性能为目标,而逐点的准确性取决于强度的工作运输条件。
摘要:Randomized controlled trials often do not represent the populations where decisions are made, and covariate shift across studies can invalidate standard IPD meta-analysis and transport estimators. We propose a placebo-anchored transport framework that treats source-trial outcomes as abundant proxy signals and target-trial placebo outcomes as scarce, high-fidelity gold labels to calibrate baseline risk. A low-complexity (sparse) correction anchors proxy outcome models to the target population, and the anchored models are embedded in a cross-fitted doubly robust learner, yielding a Neyman-orthogonal, target-site doubly robust estimator for patient-level heterogeneous treatment effects when target treated outcomes are available. We distinguish two regimes: in connected targets (with a treated arm), the method yields target-identified effect estimates; in disconnected targets (placebo-only), it reduces to a principled screen--then--transport procedure under explicit working-model transport assumptions. Experiments on synthetic data and a semi-synthetic IHDP benchmark evaluate pointwise CATE accuracy, ATE error, ranking quality for targeting, decision-theoretic policy regret, and calibration. Across connected settings, the proposed method is best or near-best and improves substantially over proxy-only, target-only, and transport baselines at small target sample sizes; in disconnected settings, it retains strong ranking performance for targeting while pointwise accuracy depends on the strength of the working transport condition.
【6】Optimal Projection-Free Adaptive SGD for Matrix Optimization
标题:用于矩阵优化的最佳无投影自适应SGD
链接:https://arxiv.org/abs/2604.02505
作者:Dmitry Kovalev
摘要:最近,Jiang等人[2026]开发了Leon,一种单面洗发水的实用变体[Xie等人,2025 a,An等人,2025]算法的在线凸优化,它不需要计算一个昂贵的二次投影在每次迭代。不幸的是,根据现有的分析,Leon需要在其预条件子中调整一个额外的超参数,并且无法实现凸优化问题的维数无关收敛保证。在本文中,我们解决了这个问题,证明了一定的稳定性性质的莱昂的预条件。使用我们改进的分析,我们表明可以避免调整额外的超参数,更重要的是,开发了具有Nesterov加速的单面洗发水的第一个实用变体,它不需要在每次迭代时计算投影。作为一个侧面的贡献,我们得到了改进的维数无关率在非光滑非凸设置,并制定了一个统一的分析所提出的算法,这产生了加速无投影自适应SGD(块)对角预条件。
摘要:Recently, Jiang et al. [2026] developed Leon, a practical variant of One-sided Shampoo [Xie et al., 2025a, An et al., 2025] algorithm for online convex optimization, which does not require computing a costly quadratic projection at each iteration. Unfortunately, according to the existing analysis, Leon requires tuning an additional hyperparameter in its preconditioner and cannot achieve dimension-independent convergence guarantees for convex optimization problems beyond the bounded gradients assumption. In this paper, we resolve this issue by proving certain stability properties of Leon's preconditioner. Using our improved analysis, we show that tuning the extra hyperparameter can be avoided and, more importantly, develop the first practical variant of One-sided Shampoo with Nesterov acceleration, which does not require computing projections at each iteration. As a side contribution, we obtain improved dimension-independent rates in the non-smooth non-convex setting and develop a unified analysis of the proposed algorithm, which yields accelerated projection-free adaptive SGD with (block-)diagonal preconditioners.
强化学习(7篇)
【1】Generalization Limits of Reinforcement Learning Alignment
标题:强化学习对齐的一般化限制
链接:https://arxiv.org/abs/2604.02652
作者:Haruhi Shida,Koo Imai,Keigo Kansa
备注:7 pages, 2 figures, 2 tables, accepted at JSAI 2026
摘要:大型语言模型(LLM)的安全性依赖于对齐技术,例如来自人类反馈的强化学习(RLHF)。然而,最近的理论分析表明,基于强化学习的训练并没有获得新的能力,而只是重新分配了现有能力的利用概率。在这项研究中,我们提出了针对OpenAI gpt-oss-20 b的“复合越狱”,它利用了对齐的泛化失败。这种方法结合了多种攻击技术--每种攻击都单独防御--以使指令层次结构维护过程饱和。我们的评估表明,攻击成功率(ASR)从单独方法的14.3%提高到组合方法的71.4%。这些结果提供了经验证据的假设,安全培训并不普遍的模型能力,强调需要多方面的安全评估,使用复合攻击的情况。
摘要:The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.
【2】Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization
标题:用于元素级桥梁生命周期优化的可解释深度强化学习
链接:https://arxiv.org/abs/2604.02528
作者:Seyyed Amirhossein Moayyedi,David Y. Yang
备注:under review
摘要:从2022年起生效的国家桥梁清单(SNBI)的新规范强调使用元素级条件状态(CS)进行基于风险的桥梁管理。代替一般组件评级,元件级条件数据使用相对CS量的阵列(即,CS比例)来表示桥梁的状况。虽然这大大增加了桥梁状况数据的粒度,但由于状态空间从一个单一的分类整数扩展到四维概率阵列,因此它为建立最佳生命周期策略带来了挑战。本研究提出一种新的可解释强化学习(RL)方法,以寻求基于元素级状态表示的最佳生命周期策略。与现有的强化学习方法相比,该算法以具有合理数量的节点和深度的斜决策树的形式产生生命周期策略,使其直接可由人类理解和审计,并且易于实现到当前的桥梁管理系统中。为了实现接近最优的策略,所提出的方法对现有的RL方法进行了三项主要改进:(a)使用可微软树模型作为演员函数近似器,(b)训练期间的温度退火过程,以及(c)正则化与修剪规则配对以限制策略复杂性。总的来说,这些改进可以以确定性斜决策树的形式产生可解释的生命周期策略。这些技术的好处和权衡在监督和强化学习环境中得到了证明。由此产生的框架说明在一个生命周期优化问题的钢梁桥。
摘要:The new Specifications for the National Bridge Inventory (SNBI), in effect from 2022, emphasize the use of element-level condition states (CS) for risk-based bridge management. Instead of a general component rating, element-level condition data use an array of relative CS quantities (i.e., CS proportions) to represent the condition of a bridge. Although this greatly increases the granularity of bridge condition data, it introduces challenges to set up optimal life-cycle policies due to the expanded state space from one single categorical integer to four-dimensional probability arrays. This study proposes a new interpretable reinforcement learning (RL) approach to seek optimal life-cycle policies based on element-level state representations. Compared to existing RL methods, the proposed algorithm yields life-cycle policies in the form of oblique decision trees with reasonable amounts of nodes and depth, making them directly understandable and auditable by humans and easily implementable into current bridge management systems. To achieve near-optimal policies, the proposed approach introduces three major improvements to existing RL methods: (a) the use of differentiable soft tree models as actor function approximators, (b) a temperature annealing process during training, and (c) regularization paired with pruning rules to limit policy complexity. Collectively, these improvements can yield interpretable life-cycle policies in the form of deterministic oblique decision trees. The benefits and trade-offs from these techniques are demonstrated in both supervised and reinforcement learning settings. The resulting framework is illustrated in a life-cycle optimization problem for steel girder bridges.
【3】Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models
标题:使用物理知识的深度生成模型缓解航天应用中的数据稀缺性进行离线强化学习
链接:https://arxiv.org/abs/2604.02438
作者:Alex E. Ballentine,Nachiket U. Bapat,Raghvendra V. Cowlagi
摘要:基于强化学习(RL)的控制器在物理系统上的部署通常受到对真实世界场景的泛化能力差的限制,称为模拟到现实(sim-to-real)差距。这一差距在航天领域尤其具有挑战性,因为由于成本高和行星探索数据有限,现实世界的训练数据很少。传统的方法,如系统识别和合成数据生成,依赖于足够的数据,并经常失败,由于建模假设或缺乏基于物理的约束。我们建议通过在生成模型中引入基于物理的学习偏差来解决这种数据稀缺性。具体来说,我们开发了基于互信息的分裂变分自动编码器(MI-VAE),这是一种物理信息VAE,可以学习观察到的系统轨迹与基于物理模型预测的轨迹之间的差异。MI-VAE的潜在空间使得能够生成尊重物理约束的合成数据集。我们评估MI-VAE的行星着陆器的问题,专注于有限的现实世界的数据和离线RL培训。结果表明,用MI-VAE样本扩充数据集显著提高了下游RL性能,在统计保真度、样本多样性和策略成功率方面优于标准VAE。这项工作展示了一个可扩展的战略,提高自主控制器的鲁棒性在复杂的,数据受限的环境。
摘要
:The deployment of reinforcement learning (RL)-based controllers on physical systems is often limited by poor generalization to real-world scenarios, known as the simulation-to-reality (sim-to-real) gap. This gap is particularly challenging in spaceflight, where real-world training data are scarce due to high cost and limited planetary exploration data. Traditional approaches, such as system identification and synthetic data generation, depend on sufficient data and often fail due to modeling assumptions or lack of physics-based constraints. We propose addressing this data scarcity by introducing physics-based learning bias in a generative model. Specifically, we develop the Mutual Information-based Split Variational Autoencoder (MI-VAE), a physics-informed VAE that learns differences between observed system trajectories and those predicted by physics-based models. The latent space of the MI-VAE enables generation of synthetic datasets that respect physical constraints. We evaluate MI-VAE on a planetary lander problem, focusing on limited real-world data and offline RL training. Results show that augmenting datasets with MI-VAE samples significantly improves downstream RL performance, outperforming standard VAEs in statistical fidelity, sample diversity, and policy success rate. This work demonstrates a scalable strategy for enhancing autonomous controller robustness in complex, data-constrained environments.
【4】Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
标题:棱镜:强化学习中通过可解释策略映射的政策重用
链接:https://arxiv.org/abs/2604.02353
作者:Thomas Pravetz
备注:13 pages, 3 figures, 5 tables
摘要:我们提出了PRISM(通过可解释策略映射的策略重用),这是一个框架,它将强化学习代理的决策建立在离散的、因果验证的概念中,并将这些概念用作用不同算法训练的代理之间的zero-shot传输接口。PRISM通过K-means将每个代理的编码器特征聚类到$K$个概念中。因果干预建立了这些概念直接驱动-而不仅仅是与-代理行为相关:在69.4%的干预中,覆盖概念分配改变了所选择的行为(p = 8.6 × 10^{-86}$,2500次干预)。概念的重要性和使用频率是分离的:最常用的概念(C47,33.0%的频率)在消融时仅导致9.4%的胜率下降,而消融C16(15.4%的频率)使胜率从100%下降到51.8%。因为概念因果编码策略,通过最佳二分匹配对齐它们传递策略知识zero-shot。在Go~7$\times$7上,有三个独立训练的代理,概念转移在两个成功的转移对(10个种子)中实现了69.5%$\pm$3.2%和76.4%$\pm$3.4%的胜利率,而随机代理的胜利率为3.5%,没有对齐的胜利率为9.2%。当源策略为强时,传输成功;几何对齐质量不预测任何内容($R^2 \约0$)。该框架适用于战略状态自然离散的领域:Atari Breakout上的相同管道在随机代理性能下产生瓶颈策略,确认Go结果反映了该领域的结构属性。
摘要:We present PRISM (Policy Reuse via Interpretable Strategy Mapping), a framework that grounds reinforcement learning agents' decisions in discrete, causally validated concepts and uses those concepts as a zero-shot transfer interface between agents trained with different algorithms. PRISM clusters each agent's encoder features into $K$ concepts via K-means. Causal intervention establishes that these concepts directly drive - not merely correlate with - agent behavior: overriding concept assignments changes the selected action in 69.4% of interventions ($p = 8.6 \times 10^{-86}$, 2500 interventions). Concept importance and usage frequency are dissociated: the most-used concept (C47, 33.0% frequency) causes only a 9.4% win-rate drop when ablated, while ablating C16 (15.4% frequency) collapses win rate from 100% to 51.8%. Because concepts causally encode strategy, aligning them via optimal bipartite matching transfers strategic knowledge zero-shot. On Go~7$\times$7 with three independently trained agents, concept transfer achieves 69.5%$\pm$3.2% and 76.4%$\pm$3.4% win rate against a standard engine across the two successful transfer pairs (10 seeds), compared to 3.5% for a random agent and 9.2% without alignment. Transfer succeeds when the source policy is strong; geometric alignment quality predicts nothing ($R^2 \approx 0$). The framework is scoped to domains where strategic state is naturally discrete: the identical pipeline on Atari Breakout yields bottleneck policies at random-agent performance, confirming that the Go results reflect a structural property of the domain.
【5】OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration
标题:OPRIDE:通过数据集内探索进行基于离线偏好的强化学习
链接:https://arxiv.org/abs/2604.02349
作者:Yiqin Yang,Hao Hu,Yihuan Mao,Jin Zhang,Chengjie Wu,Yuhua Jiang,Xu Yang,Runpeng Xie,Yi Fan,Bo Liu,Yang Gao,Bo Xu,Chongjie Zhang
摘要:基于偏好的强化学习(PbRL)可以帮助避免复杂的奖励设计,并更好地与人类意图保持一致,在各种现实世界的应用中显示出巨大的前景。然而,获取人类对偏好的反馈可能是昂贵且耗时的,这对PbRL形成了强大的障碍。在这项工作中,我们解决了离线PbRL中查询效率低的问题,指出了两个主要原因:低效的探索和学习奖励函数的过度优化。为了应对这些挑战,我们提出了一种新的算法,\textbf{O}ffline \textbf{P}b\textbf{R}L via \textbf{I}n-\textbf{D}ataset \textbf{E} exploration(OPRIDE),旨在提高离线PbRL的查询效率。OPRIDE包括两个关键功能:一个原则性的探索策略,最大限度地提高查询的信息量和折扣调度机制,旨在减轻过度优化的学习奖励函数。通过实证评估,我们表明,OPRIDE显着优于以前的方法,实现了强大的性能,显着更少的查询。此外,我们提供了理论保证算法的效率。各种运动,操纵和导航任务的实验结果强调了我们的方法的有效性和多功能性。
摘要:Preference-based reinforcement learning (PbRL) can help avoid sophisticated reward designs and align better with human intentions, showing great promise in various real-world applications. However, obtaining human feedback for preferences can be expensive and time-consuming, which forms a strong barrier for PbRL. In this work, we address the problem of low query efficiency in offline PbRL, pinpointing two primary reasons: inefficient exploration and overoptimization of learned reward functions. In response to these challenges, we propose a novel algorithm, \textbf{O}ffline \textbf{P}b\textbf{R}L via \textbf{I}n-\textbf{D}ataset \textbf{E}xploration (OPRIDE), designed to enhance the query efficiency of offline PbRL. OPRIDE consists of two key features: a principled exploration strategy that maximizes the informativeness of the queries and a discount scheduling mechanism aimed at mitigating overoptimization of the learned reward functions. Through empirical evaluations, we demonstrate that OPRIDE significantly outperforms prior methods, achieving strong performance with notably fewer queries. Moreover, we provide theoretical guarantees of the algorithm's efficiency. Experimental results across various locomotion, manipulation, and navigation tasks underscore the efficacy and versatility of our approach.
【6】Contextual Intelligence The Next Leap for Reinforcement Learning
标题:上下文智能强化学习的下一个飞跃
链接:https://arxiv.org/abs/2604.02348
作者:André Biedenkapp
备注:Accepted to AAMAS 2025 (Blue Sky Ideas Track)
摘要:强化学习(RL)在游戏、机器人和连续控制方面取得了惊人的成果。然而,尽管取得了这些成功,学习的政策往往无法推广到他们的培训分布之外,限制了现实世界的影响。最近关于上下文RL(cRL)的工作表明,将代理暴露于环境特征-上下文-可以改善zero-shot传输。到目前为止,社区已经将上下文视为单一的静态可观察对象,这种方法限制了RL代理的泛化能力。 为了实现上下文智能,我们首先提出了一种新的分类法的上下文,分离异体(环境强加的)自生(代理驱动)的因素。我们确定了三个基本的研究方向,必须解决,以促进真正的上下文智能:(1)学习与异构上下文明确利用分类水平,使代理可以推理他们对世界的影响,反之亦然;(2)多时间尺度建模,以识别同种异体变量缓慢演变或保持静态,而自体变量可能在一个事件内发生变化,可能需要不同的学习机制;(3)抽象的,高层次的背景整合,包括角色,资源和监管制度,不确定性,以及其他非物理描述符,至关重要的影响行为。 我们设想的情况下,作为一个一流的建模原语,授权代理人的原因,他们是谁,什么样的世界允许,以及如何随着时间的推移演变。通过这样做,我们的目标是促进新一代的上下文感知代理,可以安全有效地部署在现实世界中。
摘要
:Reinforcement learning (RL) has produced spectacular results in games, robotics, and continuous control. Yet, despite these successes, learned policies often fail to generalize beyond their training distribution, limiting real-world impact. Recent work on contextual RL (cRL) shows that exposing agents to environment characteristics -- contexts -- can improve zero-shot transfer. So far, the community has treated context as a monolithic, static observable, an approach that constrains the generalization capabilities of RL agents. To achieve contextual intelligence we first propose a novel taxonomy of contexts that separates allogenic (environment-imposed) from autogenic (agent-driven) factors. We identify three fundamental research directions that must be addressed to promote truly contextual intelligence: (1) Learning with heterogeneous contexts to explicitly exploit the taxonomy levels so agents can reason about their influence on the world and vice versa; (2) Multi-time-scale modeling to recognize that allogenic variables evolve slowly or remain static, whereas autogenic variables may change within an episode, potentially requiring different learning mechanisms; (3) Integration of abstract, high-level contexts to incorporate roles, resource & regulatory regimes, uncertainties, and other non-physical descriptors that crucially influence behavior. We envision context as a first-class modeling primitive, empowering agents to reason about who they are, what the world permits, and how both evolve over time. By doing so, we aim to catalyze a new generation of context-aware agents that can be deployed safely and efficiently in the real world.
【7】Reinforcement Learning from Human Feedback: A Statistical Perspective
标题:来自人类反馈的强化学习:统计角度
链接:https://arxiv.org/abs/2604.02507
作者:Pangpang Liu,Chengchun Shi,Will Wei Sun
摘要:基于人类反馈的强化学习(RLHF)已经成为将大型语言模型(LLM)与人类偏好相匹配的核心框架。尽管RLHF在实践中取得了成功,但它提出了基本的统计问题,因为它依赖于嘈杂的,主观的,而且往往是异构的反馈来学习奖励模型和优化策略。本次调查提供了一个统计的角度对RLHF,主要集中在LLM对齐设置。我们介绍RLHF的主要组成部分,包括监督微调,奖励建模和政策优化,并将它们与常见的统计思想,如布拉德利-特里-卢斯(BTL)模型,潜在效用估计,主动学习,实验设计和不确定性量化。我们回顾了从成对偏好数据中学习奖励函数的方法,以及通过两阶段RLHF管道和新兴的一阶段方法(如直接偏好优化)优化策略的方法。我们进一步讨论了最近的扩展,包括来自AI反馈的强化学习,推理时间算法和来自可验证奖励的强化学习,以及支持RLHF研究的基准数据集,评估协议和开源框架。最后,我们强调RLHF面临的挑战。附带的GitHub演示https://github.com/Pangpang-Liu/RLHF_demo展示了RLHF管道的关键组件。
摘要:Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it relies on noisy, subjective, and often heterogeneous feedback to learn reward models and optimize policies. This survey provides a statistical perspective on RLHF, focusing primarily on the LLM alignment setting. We introduce the main components of RLHF, including supervised fine-tuning, reward modeling, and policy optimization, and relate them to familiar statistical ideas such as Bradley-Terry-Luce (BTL) model, latent utility estimation, active learning, experimental design, and uncertainty quantification. We review methods for learning reward functions from pairwise preference data and for optimizing policies through both two-stage RLHF pipelines and emerging one-stage approaches such as direct preference optimization. We further discuss recent extensions including reinforcement learning from AI feedback, inference-time algorithms, and reinforcement learning from verifiable rewards, as well as benchmark datasets, evaluation protocols, and open-source frameworks that support RLHF research. We conclude by highlighting open challenges in RLHF. An accompanying GitHub demo https://github.com/Pangpang-Liu/RLHF_demo illustrates key components of the RLHF pipeline.
符号|符号学习(1篇)
【1】Differentiable Symbolic Planning: A Neural Architecture for Constraint Reasoning with Learned Feasibility
标题:可区分符号规划:具有习得可行性的约束推理神经架构
链接:https://arxiv.org/abs/2604.02350
作者:Venkatakrishna Reddy Oruganti
备注:12 pages, 4 figures, 7 tables
摘要:神经网络擅长模式识别,但难以进行约束推理-确定配置是否满足逻辑或物理约束。我们介绍微分符号规划(DSP),神经架构,执行离散符号推理,同时保持完全可微。DSP维护一个可行性通道(phi),用于跟踪每个节点上的约束满足证据,通过学习规则加权组合将其聚合为全局可行性信号(Phi),并使用sparsemax注意力来实现精确零离散规则选择。我们将DSP集成到一个通用的认知内核(UCK),结合图形的注意力与迭代约束传播。在三个约束推理基准(图可达性,布尔可满足性和规划可行性)上进行评估,UCK+DSP在4倍大小泛化下的规划准确率为97.4%(消融基线为59.7%),在2倍泛化下的SAT准确率为96.4%,并且在标准神经方法崩溃的正类和负类上保持平衡的性能。消融研究表明,全局phi聚集是至关重要的:删除它会导致准确率从98%下降到64%。学习的phi信号表现出可解释的语义,对于可行的情况,值为+18,对于不可行的情况,值为-13。
摘要:Neural networks excel at pattern recognition but struggle with constraint reasoning -- determining whether configurations satisfy logical or physical constraints. We introduce Differentiable Symbolic Planning (DSP), a neural architecture that performs discrete symbolic reasoning while remaining fully differentiable. DSP maintains a feasibility channel (phi) that tracks constraint satisfaction evidence at each node, aggregates this into a global feasibility signal (Phi) through learned rule-weighted combination, and uses sparsemax attention to achieve exact-zero discrete rule selection. We integrate DSP into a Universal Cognitive Kernel (UCK) that combines graph attention with iterative constraint propagation. Evaluated on three constraint reasoning benchmarks -- graph reachability, Boolean satisfiability, and planning feasibility -- UCK+DSP achieves 97.4% accuracy on planning under 4x size generalization (vs. 59.7% for ablated baselines), 96.4% on SAT under 2x generalization, and maintains balanced performance on both positive and negative classes where standard neural approaches collapse. Ablation studies reveal that global phi aggregation is critical: removing it causes accuracy to drop from 98% to 64%. The learned phi signal exhibits interpretable semantics, with values of +18 for feasible cases and -13 for infeasible cases emerging without supervision.
医学相关(2篇)
【1】PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction
标题:PR 3DICTR:用于基于医学3D图像的检测和结果预测的模块化人工智能框架
链接:https://arxiv.org/abs/2604.03203
作者:Daniel C. MacRae,Luuk van der Hoek,Robert van der Wal,Suzanne P. M. de Vette,Hendrike Neh,Baoqiang Ma,Peter M. A. van Ooijen,Lisanne V. van Dijk
备注:16 pages, 6 figures and 1 table
摘要:三维医学图像数据和计算机辅助决策,特别是使用深度学习,在医学领域变得越来越重要。为了帮助这些发展,我们介绍PR3DICTR:3D图像分类和标准化tRaining研究平台。PR3DICTR使用社区标准发行版(PyTorch和MONAI)构建,为预测模型开发提供了一个开放访问、灵活方便的框架,明确关注使用三维医学图像数据进行分类。通过将模块化设计原则和标准化相结合,它旨在减轻开发负担,同时保持可调整性。它为用户提供了丰富的预先建立的功能,例如模型架构设计选项、超参数解决方案和培训方法,但仍然为用户提供了“插入”自己的解决方案或模块的机会和自由。PR3DICTR可以应用于任何二进制或基于事件的三维分类任务,并且只需两行代码即可工作。
摘要:Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in'' their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.
【2】Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation
标题:医疗VQA中的过度自信和校准:经验发现和幻觉缓解措施
链接:https://arxiv.org/abs/2604.02543
作者:Ji Young Byun,Young-Jin Park,Jean-Philippe Corbeil,Asma Ben Abacha
摘要:随着视觉语言模型(VLM)越来越多地应用于临床决策支持,需要的不仅仅是准确性:知道何时信任他们的预测同样重要。然而,这些模型的过度自信的全面和系统的调查仍然显着缺乏在医疗领域。我们通过一个全面的实证研究,在VLMs的信心校准,跨越三个模型家族(Qwen 3-VL,InternVL 3,LLaVA-NeXT),三个模型规模(2B-38 B),和多个信心估计提示策略,在三个医疗视觉问答(VQA)基准解决这一差距。我们的研究产生了三个关键发现:第一,过度自信在模型家族中持续存在,并且不能通过缩放或提示来解决,例如思维链和言语化的信心变量。其次,简单的事后校准方法,如普拉特缩放,减少校准误差,并始终优于基于普拉特的策略。第三,由于它们的(严格)单调性,这些事后校准方法在提高预测的判别质量方面存在固有的局限性,使AUROC处于同一水平。受这些发现的启发,我们研究了幻觉感知校准(HAC),该校准将基于视觉的幻觉检测信号作为补充输入来改进置信度估计。我们发现,利用这些幻觉信号可以改善校准和AUROC,其中开放式问题的收益最大。总体而言,我们的研究结果表明,事后校准是医疗VLM部署的标准实践,而不是原始置信度估计,并强调了幻觉信号的实际用途,可以在医疗VQA中更可靠地使用VLM。
摘要:As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling, reduce calibration error and consistently outperform the prompt-based strategy. Third, due to their (strict) monotonicity, these post-hoc calibration methods are inherently limited in improving the discriminative quality of predictions, leaving AUROC at the same level. Motivated by these findings, we investigate hallucination-aware calibration (HAC), which incorporates vision-grounded hallucination detection signals as complementary inputs to refine confidence estimates. We find that leveraging these hallucination signals improves both calibration and AUROC, with the largest gains on open-ended questions. Overall, our findings suggest post-hoc calibration as standard practice for medical VLM deployment over raw confidence estimates, and highlight the practical usefulness of hallucination signals to enable more reliable use of VLMs in medical VQA.
联邦学习|隐私保护|加密(2篇)
【1】Enhancing Robustness of Federated Learning via Server Learning
标题:通过服务器学习增强联邦学习的鲁棒性
链接:https://arxiv.org/abs/2604.03226
作者:Van Sy Mai,Kushal Chakrabarti,Richard J. La,Dipankar Maity
摘要:本文探讨了使用服务器学习来增强联邦学习对恶意攻击的鲁棒性,即使客户端的训练数据不是独立和同分布的。我们提出了一个启发式算法,使用服务器学习和客户端更新过滤结合几何中值聚合。我们通过实验证明,这种方法可以实现显着提高模型的准确性,即使当恶意客户端的分数很高,甚至超过50\%$在某些情况下,服务器使用的数据集是小的,可以合成其分布不一定接近客户端的聚合数据。
摘要:This paper explores the use of server learning for enhancing the robustness of federated learning against malicious attacks even when clients' training data are not independent and identically distributed. We propose a heuristic algorithm that uses server learning and client update filtering in combination with geometric median aggregation. We demonstrate via experiments that this approach can achieve significant improvement in model accuracy even when the fraction of malicious clients is high, even more than $50\%$ in some cases, and the dataset utilized by the server is small and could be synthetic with its distribution not necessarily close to that of the clients' aggregated data.
【2】MLFCIL: A Multi-Level Forgetting Mitigation Framework for Federated Class-Incremental Learning in LEO Satellites
标题:MLFCIL:LEO卫星中联邦类增量学习的多层遗忘缓解框架
链接:https://arxiv.org/abs/2604.02356
作者:Heng Zhang,Xiaohong Deng,Sijing Duan,Wu Ouyang,KM Mahfujul,Yiqin Deng,Zhigang Chen
备注:Submitted to IEEE Internet of Things Journal
摘要:低地球轨道(LEO)卫星星座正在越来越多地进行机载计算。然而,在严格的记忆和通信限制下不断出现的新类对协作训练提出了重大挑战。联邦类增量学习(FCIL)实现了分布式增量学习而无需共享原始数据,但面临三个LEO特有的挑战:轨道动力学引起的非独立和相同分布的数据异构性,聚合过程中放大的灾难性遗忘,以及在有限资源下平衡稳定性和可塑性的需求。为了应对这些挑战,我们提出了MLFCIL,一个多层次的遗忘缓解框架,将灾难性遗忘分解为三个来源,并在不同的层次上解决它们:类重新加权损失,以减少局部偏差,知识蒸馏与特征重放和原型引导的漂移补偿,以保留跨任务的知识,以及类感知聚合,以减轻遗忘在联邦。此外,我们设计了一个双粒度的协调策略,结合轮级自适应损失平衡与步级梯度投影,以进一步提高稳定性和可塑性的权衡。在NWPU-RESISC 45数据集上的实验表明,MLFCIL在准确性和遗忘缓解方面都显着优于基线,同时引入了最小的资源开销。
摘要:Low-Earth-orbit (LEO) satellite constellations are increasingly performing on-board computing. However, the continuous emergence of new classes under strict memory and communication constraints poses major challenges for collaborative training. Federated class-incremental learning (FCIL) enables distributed incremental learning without sharing raw data, but faces three LEO-specific challenges: non-independent and identically distributed data heterogeneity caused by orbital dynamics, amplified catastrophic forgetting during aggregation, and the need to balance stability and plasticity under limited resources. To tackle these challenges, we propose MLFCIL, a multi-level forgetting mitigation framework that decomposes catastrophic forgetting into three sources and addresses them at different levels: class-reweighted loss to reduce local bias, knowledge distillation with feature replay and prototype-guided drift compensation to preserve cross-task knowledge, and class-aware aggregation to mitigate forgetting during federation. In addition, we design a dual-granularity coordination strategy that combines round-level adaptive loss balancing with step-level gradient projection to further enhance the stability-plasticity trade-off. Experiments on the NWPU-RESISC45 dataset show that MLFCIL significantly outperforms baselines in both accuracy and forgetting mitigation, while introducing minimal resource overhead.
推理|分析|理解|解释(9篇)
【1】Real-Time Surrogate Modeling for Personalized Blood Flow Prediction and Hemodynamic Analysis
标题:个性化血流预测和血流动力学分析的实时代孕建模
链接:https://arxiv.org/abs/2604.03197
作者:Sokratis J. Anagnostopoulos,George Rovas,Vasiliki Bikia,Theodore G. Papaioannou,Athanase D. Protogerou,Nikolaos Stergiopulos
摘要
:在过去的几十年里,由于对健康跟踪和心血管疾病早期检测的需求不断增加,心血管建模得到了迅速发展。虽然1-D动脉模型提供了一个有吸引力的折衷之间的计算效率和解决方案的保真度,他们的应用程序上的大人口或产生大的计算机模拟队列仍然具有挑战性。某些血液动力学参数(如终末阻力/顺应性)难以在临床上估计,并且在单纯采样时通常会产生非生理血液动力学,导致大部分模拟数据集被丢弃。在这项工作中,我们提出了一个用于训练机器学习(ML)模型的系统框架,能够进行瞬时血流动力学预测和参数估计。我们首先从生成一个参数化的虚拟患者队列开始,该队列基于在大型Asklepios临床数据集中观察到的多变量相关性,确保生理参数分布得到尊重。然后,我们训练一个深度神经代理模型,该模型能够预测患者特定的动脉压和心输出量(CO),从而实现对输入参数的快速先验筛选。这允许立即拒绝非生理组合,并大大降低了目标合成数据集生成(例如高血压组)的成本。该模型还提供了一个原则性的手段,采样的终端电阻,以尽量减少不可测量的参数的不确定性。此外,通过评估模型的预测性能,我们确定了足以解决估计CO的逆问题的理论信息。最后,我们将替代应用于临床数据集来估计中心主动脉血流动力学,即CO和主动脉收缩压(cSBP)。
摘要:Cardiovascular modeling has rapidly advanced over the past few decades due to the rising needs for health tracking and early detection of cardiovascular diseases. While 1-D arterial models offer an attractive compromise between computational efficiency and solution fidelity, their application on large populations or for generating large \emph{in silico} cohorts remains challenging. Certain hemodynamic parameters like the terminal resistance/compliance, are difficult to clinically estimate and often yield non-physiological hemodynamics when sampled naively, resulting in large portions of simulated datasets to be discarded. In this work, we present a systematic framework for training machine learning (ML) models, capable of instantaneous hemodynamic prediction and parameter estimation. We initially start with generating a parametric virtual cohort of patients which is based on the multivariate correlations observed in the large Asklepios clinical dataset, ensuring that physiological parameter distributions are respected. We then train a deep neural surrogate model, able to predict patient-specific arterial pressure and cardiac output (CO), enabling rapid a~priori screening of input parameters. This allows for immediate rejection of non-physiological combinations and drastically reduces the cost of targeted synthetic dataset generation (e.g. hypertensive groups). The model also provides a principled means of sampling the terminal resistance to minimize the uncertainties of unmeasurable parameters. Moreover, by assessing the model's predictive performance we determine the theoretical information which suffices for solving the inverse problem of estimating the CO. Finally, we apply the surrogate on a clinical dataset for the estimation of central aortic hemodynamics i.e. the CO and aortic systolic blood pressure (cSBP).
【2】Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
标题:了解幻觉在多模式推理模型训练后强化中的作用
链接:https://arxiv.org/abs/2604.03179
作者:Gengwei Zhang,Jie Peng,Zhen Tan,Mufan Qiu,Hossein Nourkhiz Mahjoub,Vaishnav Tadiparthi,Kwonjoon Lee,Yanyong Zhang,Tianlong Chen
备注:CVPR 2026
摘要:最近强化学习(RL)在大型推理模型中的成功激发了人们越来越多地将RL用于后训练多模态大型语言模型(MLLM),以增强其视觉推理能力。尽管许多研究报告了性能的改善,但目前尚不清楚RL训练是否真的能够使模型从视觉信息中学习。在这项工作中,我们提出了幻觉提示框架,这是一个分析框架,旨在从模型幻觉的角度研究基于RL的后训练对多模态推理模型的影响。具体来说,我们引入了幻觉诱导的,特定于模态的腐败,删除或替换导出正确答案所需的基本信息,从而迫使模型通过幻觉进行推理。通过在训练和评估过程中应用这些腐败,我们的框架为诊断RL训练动态和理解数据集的内在属性提供了独特的视角。通过对多个多模态推理基准的广泛实验和分析,我们揭示了模型幻觉在RL训练中的作用比以前认识到的更重要。例如,我们发现在纯幻觉诱导设置下的RL后训练仍然可以显着提高模型的推理性能,在某些情况下甚至优于标准训练。这些发现挑战了关于MLLM推理训练的普遍假设,并激励了更多基于模态意识的RL训练设计的发展。
摘要:The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.
【3】Explainable Machine Learning Reveals 12-Fold Ucp1 Upregulation and Thermogenic Reprogramming in Female Mouse White Adipose Tissue After 37 Days of Microgravity: First AI/ML Analysis of NASA OSD-970
标题:可解释的机器学习揭示微重力37天后雌性小鼠白色脂肪组织Ucp 1上调和产热重编程12倍:NASA OSC-970的首次AI/ML分析
链接:https://arxiv.org/abs/2604.02942
作者:Md. Rashadul Islam
备注:11 pages, 9 figures, 5 tables. First AI/ML analysis of NASA OSD-970 (GLDS-790). Code available at https://github.com/Rashadul22/NASA_OSD970_Complete_Output
摘要:微重力在哺乳动物生理学中诱导了深刻的代谢适应,但女性白色脂肪组织(WAT)产热的分子机制仍然很难确定。本文介绍了NASA开放科学数据库(OSDR)数据集OSD-970的第一个机器学习(ML)分析,该数据集来自啮齿动物研究-1(RR-1)任务。使用来自16只雌性C57 BL/6 J小鼠(8只飞行,8只地面对照)在国际空间站(ISS)上37天后性腺WAT中89个脂肪生成和产热途径基因的RT-qPCR数据,我们应用了差异表达分析,多个ML分类器与留一交叉验证(LOO-CV),以及通过SHapley加法解释(SHAP)的可解释AI。最引人注目的发现是在微重力暴露的WAT中Ucp 1显著上调12.21倍(Δ-Δ-Ct =-3.61,p = 0.0167),伴随着产热途径的显著激活(平均途径倍数变化= 3.24)。通过LOO-CV,表现最好的模型(具有前20个特征的随机森林)达到AUC = 0.922,准确度= 0.812,F1 = 0.824。SHAP分析始终将Ucp 1列为最佳预测特征,而Angpt 2,Irs 2,Jun和Klf家族转录因子成为主要的共识分类器。主成分分析(PCA)显示飞行和地面样本之间有明显的分离,PC 1解释了69.1%的方差。这些结果表明,快速产热重编程女性WAT作为一种补偿性反应微重力。这项研究展示了可解释人工智能在重新分析新发布的NASA空间生物学数据集方面的力量,对长期任务中的女宇航员健康以及地球上的肥胖和代谢疾病研究具有直接影响。
摘要:Microgravity induces profound metabolic adaptations in mammalian physiology, yet the molecular mechanisms governing thermogenesis in female white adipose tissue (WAT) remain poorly characterized. This paper presents the first machine learning (ML) analysis of NASA Open Science Data Repository (OSDR) dataset OSD-970, derived from the Rodent Research-1 (RR-1) mission. Using RT-qPCR data from 89 adipogenesis and thermogenesis pathway genes in gonadal WAT of 16 female C57BL/6J mice (8 flight, 8 ground control) following 37 days aboard the International Space Station (ISS), we applied differential expression analysis, multiple ML classifiers with Leave-One-Out Cross-Validation (LOO-CV), and Explainable AI via SHapley Additive exPlanations (SHAP). The most striking finding is a dramatic 12.21-fold upregulation of Ucp1 (Delta-Delta-Ct = -3.61, p = 0.0167) in microgravity-exposed WAT, accompanied by significant activation of the thermogenesis pathway (mean pathway fold-change = 3.24). The best-performing model (Random Forest with top-20 features) achieved AUC = 0.922, Accuracy = 0.812, and F1 = 0.824 via LOO-CV. SHAP analysis consistently ranked Ucp1 among the top predictive features, while Angpt2, Irs2, Jun, and Klf-family transcription factors emerged as dominant consensus classifiers. Principal component analysis (PCA) revealed clear separation between flight and ground samples, with PC1 explaining 69.1% of variance. These results suggest rapid thermogenic reprogramming in female WAT as a compensatory response to microgravity. This study demonstrates the power of explainable AI for re-analysis of newly released NASA space biology datasets, with direct implications for female astronaut health on long-duration missions and for Earth-based obesity and metabolic disease research.
【4】Understanding Latent Diffusability via Fisher Geometry
标题:通过Fisher几何理解潜在扩散性
链接:https://arxiv.org/abs/2604.02751
作者:Jing Gu,Morteza Mardani,Wonjun Lee,Dongmian Zou,Gilad Lerman
摘要:当在潜在空间中训练时,扩散模型通常会降级(例如,VAE),但正式的原因仍然知之甚少。我们通过沿扩散轨迹的最小均方误差(MMSE)的变化率来量化潜在空间的扩散性。我们的框架将MMSE速率分解为Fisher信息(FI)和Fisher信息速率(FIR)的贡献。我们证明,虽然全球等距确保FI对齐,FIR是由编码器的局部几何属性。我们的分析明确地将潜在的几何失真分成三个可测量的惩罚:维度压缩,切向失真和曲率注入。我们推导出FIR保存跨空间的理论条件,确保保持扩散性。不同的自动编码架构的实验验证了我们的框架,并建立了这些有效的FI和FIR指标作为一个强大的诊断套件,用于识别和减轻潜在的扩散故障。
摘要:Diffusion models often degrade when trained in latent spaces (e.g., VAEs), yet the formal causes remain poorly understood. We quantify latent-space diffusability through the rate of change of the Minimum Mean Squared Error (MMSE) along the diffusion trajectory. Our framework decomposes this MMSE rate into contributions from Fisher Information (FI) and Fisher Information Rate (FIR). We demonstrate that while global isometry ensures FI alignment, FIR is governed by the encoder's local geometric properties. Our analysis explicitly decouples latent geometric distortion into three measurable penalties: dimensional compression, tangential distortion, and curvature injection. We derive theoretical conditions for FIR preservation across spaces, ensuring maintained diffusability. Experiments across diverse autoencoding architectures validate our framework and establish these efficient FI and FIR metrics as a robust diagnostic suite for identifying and mitigating latent diffusion failure.
【5】Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
标题:用于统一3D场景理解的对比透视色点图预训练
链接:https://arxiv.org/abs/2604.02546
作者:Ye Mao,Weixun Luo,Ranran Huang,Junpeng Jing,Krystian Mikolajczyk
备注:24 pages
摘要:通过与对比语言图像预训练(CLIP)对齐来预训练3D编码器已经成为学习用于3D场景理解的可概括表示的有希望的方向。在本文中,我们提出了UniScene3D,一个基于变换的编码器,学习统一的场景表示从多视图彩色点图,联合建模图像外观和几何。对于鲁棒的彩色点图表示学习,我们引入了新的交叉视图几何对齐和接地视图对齐,以加强交叉视图的几何和语义一致性。广泛的低镜头和特定任务的微调评估的观点接地,场景检索,场景类型分类,和3D VQA证明了我们的国家的最先进的性能。这些结果突出了我们的方法统一的3D场景理解的有效性。https://yebulabula.github.io/UniScene3D/
摘要:Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/
【6】Re-analysis of the Human Transcription Factor Atlas Recovers TF-Specific Signatures from Pooled Single-Cell Screens with Missing Controls
标题:人类转录因子图谱的重新分析从缺失对照的合并单细胞屏幕中恢复TF特异性特征
链接:https://arxiv.org/abs/2604.02511
作者:Arka Jain,Umesh Sharma
摘要:公共汇集的单细胞扰动图谱是研究转录因子(TF)功能的宝贵资源,但下游再分析可能受到不完整的存储元数据和缺少内部对照的限制。在这里,我们重新分析了人类TF Atlas数据集(GSE 216481),这是一个基于MORF的合并过表达筛选,涵盖3,550个TF开放阅读框和254,519个细胞,具有可重复的质量控制,MORF条形码解复用,每个TF差异表达和功能富集的管道。从合并筛选中的77,018个细胞中,我们将60,997(79.2%)分配给87个TF身份。因为所沉积的条形码映射缺少原始文库中存在的GFP和mCherry阴性对照,所以我们使用胚状体(EB)细胞作为外部基线,并通过背景扣除去除共享的批次/转导伪影。该策略恢复了61个可测试TF中的59个TF特异性特征,与单独使用一个与休息相比检测到的27个相比,表明尽管缺少池内对照,但可以挽救稳健的TF水平信号。HOPX、MAZ、PAX 6、FOS和FEZF 2成为最强的转录重塑因子,而per-TF富集将FEZF 2与分化调节、EGFR 1与Hippo和心脏程序、FOS与粘着斑、NFIC与胶原蛋白生物合成联系起来。条件水平分析揭示了Wnt、神经源性、EMT和Hippo特征的收敛性,Harmony表明合并重复中的混杂批次效应最小。我们的每TF效应量显著同意Jorge et al.的公布排名(斯皮尔曼$ρ=-0.316 $,$p = 0.013$;负,因为较低的排名表明更强的效果)。总之,这些结果表明,当与原则性外部对照、伪影去除和可再现计算配对时,所存放的TF图谱数据可以支持经验证的TF特异性转录和途径分析。
摘要:Public pooled single-cell perturbation atlases are valuable resources for studying transcription factor (TF) function, but downstream re-analysis can be limited by incomplete deposited metadata and missing internal controls. Here we re-analyze the human TF Atlas dataset (GSE216481), a MORF-based pooled overexpression screen spanning 3,550 TF open reading frames and 254,519 cells, with a reproducible pipeline for quality control, MORF barcode demultiplexing, per-TF differential expression, and functional enrichment. From 77,018 cells in the pooled screen, we assign 60,997 (79.2\%) to 87 TF identities. Because the deposited barcode mapping lacks the GFP and mCherry negative controls present in the original library, we use embryoid body (EB) cells as an external baseline and remove shared batch/transduction artifacts by background subtraction. This strategy recovers TF-specific signatures for 59 of 61 testable TFs, compared with 27 detected by one-vs-rest alone, showing that robust TF-level signal can be rescued despite missing intra-pool controls. HOPX, MAZ, PAX6, FOS, and FEZF2 emerge as the strongest transcriptional remodelers, while per-TF enrichment links FEZF2 to regulation of differentiation, EGR1 to Hippo and cardiac programs, FOS to focal adhesion, and NFIC to collagen biosynthesis. Condition-level analyses reveal convergent Wnt, neurogenic, EMT, and Hippo signatures, and Harmony indicates minimal confounding batch effects across pooled replicates. Our per-TF effect sizes significantly agree with Joung et al.'s published rankings (Spearman $ρ= -0.316$, $p = 0.013$; negative because lower rank indicates stronger effect). Together, these results show that the deposited TF Atlas data can support validated TF-specific transcriptional and pathway analyses when paired with principled external controls, artifact removal, and reproducible computation.
【7】TRACE: Traceroute-based Internet Route change Analysis with Ensemble Learning
标题:TRACE:使用Ensemble Learning的基于Tracerouth的互联网路由变化分析
链接:https://arxiv.org/abs/2604.02361
作者:Raul Suzuki,Rodrigo Moreira,Pedro Henrique A. Damaso de Melo,Larissa F. Rodrigues Moreira,Flávio de Oliveira Silva
备注:Paper accepted for publication in Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (SBRC) 2026
摘要:检测互联网路由不稳定性是一项关键但具有挑战性的任务,特别是在仅依赖于端点主动测量时。本研究介绍了TRACE,这是一种机器学习(ML)管道,旨在仅使用traceroute延迟数据来识别路由更改,从而确保与控制平面信息无关。我们提出了一个强大的功能工程策略,捕捉时间动态使用滚动统计和聚合上下文模式。该架构利用了由超参数优化的元学习器优化的梯度提升决策树的堆叠集合。通过严格校准决策阈值来解决罕见路由事件的固有类别不平衡,TRACE实现了卓越的F1得分性能,显著优于传统的基线模型,并在检测互联网上的路由变化方面表现出强大的有效性。
摘要
:Detecting Internet routing instability is a critical yet challenging task, particularly when relying solely on endpoint active measurements. This study introduces TRACE, a MachineLearning (ML)pipeline designed to identify route changes using only traceroute latency data, thereby ensuring independence from control plane information. We propose a robust feature engineering strategy that captures temporal dynamics using rolling statistics and aggregated context patterns. The architecture leverages a stacked ensemble of Gradient Boosted Decision Trees refined by a hyperparameter-optimized meta-learner. By strictly calibrating decision thresholds to address the inherent class imbalance of rare routing events, TRACE achieves a superior F1-score performance, significantly outperforming traditional baseline models and demonstrating strong effective ness in detecting routing changes on the Internet.
【8】State estimations and noise identifications with intermittent corrupted observations via Bayesian variational inference
标题:通过Bayesian变分推理进行间歇性损坏观测的状态估计和噪音识别
链接:https://arxiv.org/abs/2604.02738
作者:Peng Sun,Ruoyu Wang,Xue Luo
备注:8 pages, 6 figures
摘要:研究了分布式传感器网络中间歇性丢包、观测值受损和噪声协方差未知共存的状态估计问题。为了解决这一挑战,我们制定的联合估计系统的状态,噪声参数和网络的可靠性作为贝叶斯变分推理问题,并提出了一种新的变分贝叶斯自适应卡尔曼滤波器(VB-AKF)近似的联合后验概率密度的潜在参数。与单独处理缺失数据和测量异常值的现有AKF不同,所提出的VB-AKF采用了具有两个独立伯努利随机变量的双掩码生成模型,明确地表征了可观察的通信损失和潜在的数据真实性。此外,VB-AKF将多个并发的多个观测集成到自适应滤波框架中,这显著增强了统计可识别性。数值实验验证了该方法的有效性和渐近最优性,表明随着传感器数目的增加,参数辨识和状态估计均渐近收敛于理论最优下界。
摘要:This paper focuses on the state estimation problem in distributed sensor networks, where intermittent packet dropouts, corrupted observations, and unknown noise covariances coexist. To tackle this challenge, we formulate the joint estimation of system states, noise parameters, and network reliability as a Bayesian variational inference problem, and propose a novel variational Bayesian adaptive Kalman filter (VB-AKF) to approximate the joint posterior probability densities of the latent parameters. Unlike existing AKF that separately handle missing data and measurement outliers, the proposed VB-AKF adopts a dual-mask generative model with two independent Bernoulli random variables, explicitly characterizing both observable communication losses and latent data authenticity. Additionally, the VB-AKF integrates multiple concurrent multiple observations into the adaptive filtering framework, which significantly enhances statistical identifiability. Comprehensive numerical experiments verify the effectiveness and asymptotic optimality of the proposed method, showing that both parameter identification and state estimation asymptotically converge to the theoretical optimal lower bound with the increase in the number of sensors.
【9】Neural posterior estimation for scalable and accurate inverse parameter inference in Li-ion batteries
标题:神经后验估计用于锂离子电池中可扩展且准确的逆参数推断
链接:https://arxiv.org/abs/2604.02520
作者:Malik Hassanaly,Corey R. Randall,Peter J. Weddle,Paul J. Gasper,Conlain Kelly,Tanvir R. Tanim,Kandler Smith
摘要:诊断锂离子电池的内部状态对于电池研究、实际系统的运行以及剩余寿命的预测评估至关重要。通过使用基于物理的模型来执行概率参数估计,通过贝叶斯校准,诊断可以解释由于模型适应性,数据噪声和任何给定参数的可观测性而导致的不确定性。然而,使用电化学数据在锂离子电池中进行贝叶斯校准是计算密集型的,即使在使用快速替代物代替基于物理的模型时也是如此,需要数千次模型评估。一个完全摊销的替代方案是神经后验估计(NPE)。NPE将计算负担从参数估计步骤转移到数据生成和模型训练,将参数估计时间从几分钟减少到几毫秒,从而实现实时应用。目前的工作表明,NPE校准参数相等或更准确地比贝叶斯校准,我们表明,更高的计算成本的数据生成是易于处理的,即使在高维的情况下(范围从6到27估计参数),但NPE方法可能会导致更高的电压预测误差。与贝叶斯校准相比,NPE方法还提供了几个可解释性优势,例如局部参数对电压曲线特定区域的敏感性。NPE方法使用实验快速充电数据集进行了演示,参数估计值针对锂库存损失和活性材料损失的测量进行了验证。该实现在配套存储库(https://github.com/NatLabRockies/BatFIT)中提供。
摘要:Diagnosing the internal state of Li-ion batteries is critical for battery research, operation of real-world systems, and prognostic evaluation of remaining lifetime. By using physics-based models to perform probabilistic parameter estimation via Bayesian calibration, diagnostics can account for the uncertainty due to model fitness, data noise, and the observability of any given parameter. However, Bayesian calibration in Li-ion batteries using electrochemical data is computationally intensive even when using a fast surrogate in place of physics-based models, requiring many thousands of model evaluations. A fully amortized alternative is neural posterior estimation (NPE). NPE shifts the computational burden from the parameter estimation step to data generation and model training, reducing the parameter estimation time from minutes to milliseconds, enabling real-time applications. The present work shows that NPE calibrates parameters equally or more accurately than Bayesian calibration, and we demonstrate that the higher computational costs for data generation are tractable even in high-dimensional cases (ranging from 6 to 27 estimated parameters), but the NPE method can lead to higher voltage prediction errors. The NPE method also offers several interpretability advantages over Bayesian calibration, such as local parameter sensitivity to specific regions of the voltage curve. The NPE method is demonstrated using an experimental fast charge dataset, with parameter estimates validated against measurements of loss of lithium inventory and loss of active material. The implementation is made available in a companion repository (https://github.com/NatLabRockies/BatFIT).
检测相关(2篇)
【1】Matrix Profile for Time-Series Anomaly Detection: A Reproducible Open-Source Benchmark on TSB-AD
标题:用于时间序列异常检测的矩阵配置文件:TSB-AD上的可复制开源基准
链接:https://arxiv.org/abs/2604.02445
作者:Chin-Chia Michael Yeh
摘要:矩阵配置文件(MP)方法是一个可解释的和可扩展的家庭的距离为基础的方法的时间序列异常检测,但强大的基准性能仍然取决于设计选择超越香草最近邻配置文件。本技术报告记录了提交给TSB-AD的开源异常检测矩阵配置文件(MMPAD),这是一个涵盖单变量和多变量时间序列的基准。提交的系统结合了预排序的多维聚合,有效的排除区意识的k最近邻(kNN)检索重复异常,和移动平均后处理。为了作为TSB-AD上基于MP的异常检测的可重复参考,我们详细介绍了已发布的实现,单变量和多变量轨迹的超参数设置以及相应的基准测试结果。我们进一步分析了该系统在聚合排行榜和特定数据集特征上的表现。开源实现可在https://github.com/mcyeh/mmpad_tsb上获得。
摘要:Matrix Profile (MP) methods are an interpretable and scalable family of distance-based methods for time-series anomaly detection, but strong benchmark performance still depends on design choices beyond a vanilla nearest-neighbor profile. This technical report documents an open-source Matrix Profile for Anomaly Detection (MMPAD) submission to TSB-AD, a benchmark that covers both univariate and multivariate time series. The submitted system combines pre-sorted multidimensional aggregation, efficient exclusion-zone-aware k-nearest-neighbor (kNN) retrieval for repeated anomalies, and moving-average post-processing. To serve as a reproducible reference for MP-based anomaly detection on TSB-AD, we detail the released implementation, the hyperparameter settings for the univariate and multivariate tracks, and the corresponding benchmark results. We further analyze how the system performs on the aggregate leaderboard and across specific dataset characteristics.The open-source implementation is available at https://github.com/mcyeh/mmpad_tsb.
【2】Financial Anomaly Detection for the Canadian Market
标题:加拿大市场的财务异常检测
链接:https://arxiv.org/abs/2604.02549
作者:Luigi Caputi,Nicholas Meadows
摘要:在这项工作中,我们评估了三类方法检测金融异常的性能:拓扑数据分析(TDA),主成分分析(PCA),和基于神经网络的方法。我们将这些方法应用于TSX-60数据,以确定加拿大股市的主要金融压力事件。我们展示了基于神经网络的方法(如GlocalKD和One-Shot GIN(E))和TDA方法如何实现最强的性能。TDA在发现金融异常方面的有效性表明,全局拓扑性质在区分金融压力事件方面具有重要意义。
摘要:In this work we evaluate the performance of three classes of methods for detecting financial anomalies: topological data analysis (TDA), principal component analyis (PCA), and Neural Network-based approaches. We apply these methods to the TSX-60 data to identify major financial stress events in the Canadian stock market. We show how neural network-based methods (such as GlocalKD and One-Shot GIN(E)) and TDA methods achieve the strongest performance. The effectiveness of TDA in detecting financial anomalies suggests that global topological properties are meaningful in distinguishing financial stress events.
分类|识别(1篇)
【1】Self-Directed Task Identification
标题:自主任务识别
链接:https://arxiv.org/abs/2604.02430
作者:Timothy Gould,Sidike Paheding
备注:9 pages, 3 figures, 3 tables, 17 equations
摘要:在这项工作中,我们提出了一种新的机器学习框架,称为自导向任务识别(SDTI),它使模型能够在没有预先训练的情况下,在zero-shot设置中自主识别每个数据集的正确目标变量。SDTI是一个最小的、可解释的框架,展示了将核心机器学习概念重新用于新任务结构的可行性。据我们所知,没有现有的架构已经证明了这种能力。传统方法缺乏这种能力,使得数据注释成为一个非常耗时的过程,严重依赖于人工。只使用标准的神经网络组件,我们表明,SDTI可以通过适当的问题制定和架构设计。我们在一系列基准任务上评估了所提出的框架,并证明了其在可靠地识别一组潜在目标变量的地面真相方面的有效性。SDTI在综合任务识别基准测试中的F1得分比基线架构高出14%。这些概念验证实验突出了SDTI未来的潜力,以减少对手动注释的依赖,并提高自主学习系统在现实世界中的应用的可扩展性。
摘要:In this work, we present a novel machine learning framework called Self-Directed Task Identification (SDTI), which enables models to autonomously identify the correct target variable for each dataset in a zero-shot setting without pre-training. SDTI is a minimal, interpretable framework demonstrating the feasibility of repurposing core machine learning concepts for a novel task structure. To our knowledge, no existing architectures have demonstrated this ability. Traditional approaches lack this capability, leaving data annotation as a time-consuming process that relies heavily on human effort. Using only standard neural network components, we show that SDTI can be achieved through appropriate problem formulation and architectural design. We evaluate the proposed framework on a range of benchmark tasks and demonstrate its effectiveness in reliably identifying the ground truth out of a set of potential target variables. SDTI outperformed baseline architectures by 14% in F1 score on synthetic task identification benchmarks. These proof-of-concept experiments highlight the future potential of SDTI to reduce dependence on manual annotation and to enhance the scalability of autonomous learning systems in real-world applications.
表征(2篇)
【1】On Data-Driven Koopman Representations of Nonlinear Delay Differential Equations
标题:非线性延迟方程的数据驱动Koopman表示
链接:https://arxiv.org/abs/2604.03086
作者:Santosh Mohan Rajkumar,Dibyasri Barman,Kumar Vikram Singh,Debdipta Goswami
备注:Github: https://github.com/santoshrajkumar/koopman-dde-kEDMD
摘要:这项工作在无限维延迟动力学和有限维Koopman学习之间建立了一个严格的桥梁,具有明确和可解释的误差保证。虽然Koopman分析在常微分方程(ODE)和部分偏微分方程(PDE)中得到了很好的发展,但由于延迟微分方程(DDE)的无限维相空间,其扩展仍然受到限制。我们提出了一个有限维Koopman近似框架的基础上的历史离散化和一个合适的重建算子,使一个易于处理的表示Koopman算子通过基于内核的扩展动态模式分解(kEDMD)。确定性的误差范围来自学习的预测器,分解的总误差为历史离散化,内核插值和数据驱动的回归的贡献。此外,我们开发了一种基于核的重建方法,从提升的Koopman坐标恢复离散状态,具有可证明的保证。数值结果表明,收敛的学习预测器的离散分辨率和训练数据,支持可靠的预测和控制的延迟系统。
摘要:This work establishes a rigorous bridge between infinite-dimensional delay dynamics and finite-dimensional Koopman learning, with explicit and interpretable error guarantees. While Koopman analysis is well-developed for ordinary differential equations (ODEs) and partially for partial differential equations (PDEs), its extension to delay differential equations (DDEs) remains limited due to the infinite-dimensional phase space of DDEs. We propose a finite-dimensional Koopman approximation framework based on history discretization and a suitable reconstruction operator, enabling a tractable representation of the Koopman operator via kernel-based extended dynamic mode decomposition (kEDMD). Deterministic error bounds are derived for the learned predictor, decomposing the total error into contributions from history discretization, kernel interpolation, and data-driven regression. Additionally, we develop a kernel-based reconstruction method to recover discretized states from lifted Koopman coordinates, with provable guarantees. Numerical results demonstrate convergence of the learned predictor with respect to both discretization resolution and training data, supporting reliable prediction and control of delay systems.
【2】VALOR: Value-Aware Revenue Uplift Modeling with Treatment-Gated Representation for B2B Sales
标题:VALOR:B2B销售的价值感知收入Upper建模,采用治疗门控表示
链接:https://arxiv.org/abs/2604.02472
作者:Vamshi Guduguntla,Kavin Soni,Debanshu Das
摘要:B2B销售组织必须在零膨胀的收入分配中确定“可说服”的帐户,以优化昂贵的人力资源分配。标准的提升框架与高维空间中的治疗信号崩溃以及回归校准和高价值“鲸鱼”排名之间的不一致作斗争。“我们介绍了VALOR(优化(B2B)收入的价值感知学习),这是一个统一的框架,具有一个治疗门控稀疏收入网络,使用双线性相互作用来防止因果信号崩溃。该框架通过一个新的成本敏感的焦点ZILN目标,结合了一个焦点机制的分布鲁棒性与价值加权排名损失,根据财务规模的罚款。为了提供高接触销售计划的可解释性,我们进一步推导出鲁棒ZILN-GBDT,一个基于树的变体,利用自定义的分裂标准提升异质性。广泛的评估证实了VALOR的主导地位,在公共基准测试中,与最先进的方法相比,排名提高了20%,并在严格的4个月生产A/B测试中,每个账户的增量收入增加了2.7倍。
摘要:B2B sales organizations must identify "persuadable" accounts within zero-inflated revenue distributions to optimize expensive human resource allocation. Standard uplift frameworks struggle with treatment signal collapse in high-dimensional spaces and a misalignment between regression calibration and the ranking of high-value "whales." We introduce VALOR (Value Aware Learning of Optimized (B2B) Revenue), a unified framework featuring a Treatment-Gated Sparse-Revenue Network that uses bilinear interaction to prevent causal signal collapse. The framework is optimized via a novel Cost-Sensitive Focal-ZILN objective that combines a focal mechanism for distributional robustness with a value-weighted ranking loss that scales penalties based on financial magnitude. To provide interpretability for high-touch sales programs, we further derive Robust ZILN-GBDT, a tree based variant utilizing a custom splitting criterion for uplift heterogeneity. Extensive evaluations confirm VALOR's dominance, achieving a 20% improvement in rankability over state-of-the-art methods on public benchmarks and delivering a validated 2.7x increase in incremental revenue per account in a rigorous 4-month production A/B test.
3D|3D重建等相关(1篇)
【1】Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling
标题:3D离散裂缝的卷积代理-矩阵张量放大
链接:https://arxiv.org/abs/2604.02335
作者:Martin Špetlík,Jan Březina
备注:28 pages, 9 figures, published, https://github.com/ martinspetlik/MLMC-DFM/tree/MS_3d
摘要:三维裂隙结晶介质中地下水流的模拟需要考虑裂隙引起的强烈空间异质性。精细尺度离散矩阵(DFM)模拟可以捕捉这种复杂性,但计算成本高,特别是当需要重复评估。为了解决这个问题,我们的目标是采用多级蒙特卡罗(MLMC)框架,其中数值均匀化是用来高档次分辨率断裂的影响时,准确度水平之间的过渡。 为了降低传统的三维数值均匀化的成本,我们开发了一个代理模型,预测等效水力传导率张量Keq从一个体素化的三维域表示张量值的矩阵和裂缝的传导率随机场。裂缝的大小,方向和孔径是从自然观测得到的分布中取样的。 代理架构将3D卷积神经网络与前馈层相结合,使其能够捕获局部空间特征和全局交互。三个代理人的DFM模拟生成的数据进行训练,每个对应于一个不同的矩阵到矩阵的电导率对比。性能评估在广泛的裂缝网络参数和矩阵场相关长度。 经过训练的模型实现了高精度,在大多数测试用例中,归一化均方根误差低于0.22。通过比较数值均匀化的电导率与替代预测在两个宏观尺度的问题:计算等效电导率张量和预测流出的约束三维域的实用性。在这两种情况下,基于代理的升级保留了准确性,同时大大降低了计算成本,在GPU上执行推理时实现了超过100倍的加速。
摘要:Modeling groundwater flow in three-dimensional fractured crystalline media requires accounting for strong spatial heterogeneity induced by fractures. Fine-scale discrete fracture-matrix (DFM) simulations can capture this complexity but are computationally expensive, especially when repeated evaluations are needed. To address this, we aim to employ a multilevel Monte Carlo (MLMC) framework in which numerical homogenization is used to upscale sub-resolution fracture effects when transitioning between accuracy levels. To reduce the cost of conventional 3D numerical homogenization, we develop a surrogate model that predicts the equivalent hydraulic conductivity tensor Keq from a voxelized 3D domain representing tensor-valued random fields of matrix and fracture conductivities. Fracture size, orientation, and aperture are sampled from distributions informed by natural observations. The surrogate architecture combines a 3D convolutional neural network with feed-forward layers, enabling it to capture both local spatial features and global interactions. Three surrogates are trained on data generated by DFM simulations, each corresponding to a different fracture-to-matrix conductivity contrast. Performance is evaluated across a wide range of fracture network parameters and matrix-field correlation lengths. The trained models achieve high accuracy, with normalized root-mean-square errors below 0.22 across most test cases. Practical applicability is demonstrated by comparing numerically homogenized conductivities with surrogate predictions in two macro-scale problems: computing equivalent conductivity tensors and predicting outflow from a constrained 3D domain. In both cases, surrogate-based upscaling preserves accuracy while substantially reducing computational cost, achieving speedups exceeding 100x when inference is performed on a GPU.
优化|敛散性(6篇)
【1】Reflective Context Learning: Studying the Optimization Primitives of Context Space
标题:反思性上下文学习:研究上下文空间的优化基元
链接:https://arxiv.org/abs/2604.03189
作者:Nikita Vassilyev,William Berrios,Ruowang Zhang,Bo Han,Douwe Kiela,Shikib Mehri
备注:Under review at COLM. Github: https://github.com/nvassilyev/RCL
摘要:一般来说,有能力的代理必须以跨任务和环境进行概括的方式从经验中学习。学习的基本问题,包括信用分配,过拟合,遗忘,局部最优和高方差学习信号,坚持学习对象是否位于参数空间或上下文空间。虽然这些挑战在经典的机器学习优化中得到了很好的理解,但它们在上下文空间中仍然没有得到充分的探索,导致当前的方法是碎片化和临时的。我们提出了反思性上下文学习(RCL),一个统一的框架,通过反复的互动,行为和故障模式的反思,并迭代更新上下文的代理人学习。在RCL中,反射将轨迹和当前上下文转换为类似于梯度的方向性更新信号,而突变应用该信号来改善上下文空间中的未来行为。我们重铸最近的上下文优化方法作为这种共享学习问题的实例,并系统地扩展它们与经典的优化原语,包括优化,改进的信用分配信号,辅助损失,故障重放,和分组推出方差减少。在AppWorld,BrowseComp+和RewardBench2上,这些原语在强大的基线上得到了改进,其相对重要性在任务体系中发生了变化。我们进一步分析了初始化的鲁棒性,批量大小,抽样和课程策略,优化器状态变量的影响,以及将更强或更弱的模型分配给不同的优化组件的影响。我们的研究结果表明,通过上下文更新的学习不应被视为一组孤立的算法,而是作为一个优化问题,其机制可以系统地研究和改进,通过可转移的原则。
摘要:Generally capable agents must learn from experience in ways that generalize across tasks and environments. The fundamental problems of learning, including credit assignment, overfitting, forgetting, local optima, and high-variance learning signals, persist whether the learned object lies in parameter space or context space. While these challenges are well understood in classical machine learning optimization, they remain underexplored in context space, leading current methods to be fragmented and ad hoc. We present Reflective Context Learning (RCL), a unified framework for agents that learn through repeated interaction, reflection on behavior and failure modes, and iterative updates to context. In RCL, reflection converts trajectories and current context into a directional update signal analogous to gradients, while mutation applies that signal to improve future behavior in context space. We recast recent context-optimization approaches as instances of this shared learning problem and systematically extend them with classical optimization primitives, including batching, improved credit-assignment signal, auxiliary losses, failure replay, and grouped rollouts for variance reduction. On AppWorld, BrowseComp+, and RewardBench2, these primitives improve over strong baselines, with their relative importance shifting across task regimes. We further analyze robustness to initialization, the effects of batch size, sampling and curriculum strategy, optimizer-state variants, and the impact of allocating stronger or weaker models to different optimization components. Our results suggest that learning through context updates should be treated not as a set of isolated algorithms, but as an optimization problem whose mechanisms can be studied systematically and improved through transferable principles.
【2】FedSQ: Optimized Weight Averaging via Fixed Gating
标题:FedSJ:通过固定门控优化体重平均
链接:https://arxiv.org/abs/2604.02990
作者:Cristian Pérez-Corral,Jose I. Mestre,Alberto Fernández-Hernández,Manuel F. Dolz,José Duato,Enrique S. Quintana-Ortí
摘要:联合学习(FL)可以在不共享原始数据的情况下实现跨组织的协作培训,但它受到统计异质性(非i.i.d.)的阻碍。客户端数据)以及客户端漂移下的朴素权重平均的不稳定性。在许多跨竖井部署中,FL从强大的预训练骨干(例如,ImageNet-1 K),然后适应本地域。最近的证据表明,类似ReLU的门控机制(结构知识)比其余参数值(定量知识)更早稳定,因此我们提出了FedSQ(联邦结构-定量学习),这是一种基于DualCopy的转移初始化神经联邦过程,深度网络的分段线性视图。FedSQ冻结预训练模型的结构副本,以在联邦微调期间诱导固定的二进制门控掩码,而只有定量副本在本地进行优化并跨轮次聚合。固定门控将学习减少到机制内仿射细化,这在异构分区下稳定聚合。在i.i.d.下对两个卷积神经网络骨干的实验和Dirichlet分裂表明,FedSQ提高了鲁棒性,并可以减少相对于标准基线的最佳验证性能,同时保持传递设置的准确性。
摘要
:Federated learning (FL) enables collaborative training across organizations without sharing raw data, but it is hindered by statistical heterogeneity (non-i.i.d.\ client data) and by instability of naive weight averaging under client drift. In many cross-silo deployments, FL is warm-started from a strong pretrained backbone (e.g., ImageNet-1K) and then adapted to local domains. Motivated by recent evidence that ReLU-like gating regimes (structural knowledge) stabilize earlier than the remaining parameter values (quantitative knowledge), we propose FedSQ (Federated Structural-Quantitative learning), a transfer-initialized neural federated procedure based on a DualCopy, piecewise-linear view of deep networks. FedSQ freezes a structural copy of the pretrained model to induce fixed binary gating masks during federated fine-tuning, while only a quantitative copy is optimized locally and aggregated across rounds. Fixing the gating reduces learning to within-regime affine refinements, which stabilizes aggregation under heterogeneous partitions. Experiments on two convolutional neural network backbones under i.i.d.\ and Dirichlet splits show that FedSQ improves robustness and can reduce rounds-to-best validation performance relative to standard baselines while preserving accuracy in the transfer setting.
【3】Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability
标题:乘积稳定性:梯度下降在稳定边缘的可证收敛性
链接:https://arxiv.org/abs/2604.02653
作者:Eric Gan
摘要:从经验上讲,现代深度学习训练通常发生在稳定边缘(EoS),此时损失的锐度超过了经典收敛分析所适用的阈值。尽管最近取得了进展,现有的理论解释EoS要么依赖于限制性假设或专注于特定的平方损失型目标。在这项工作中,我们介绍和研究的损失函数的结构性质,我们长期的产品稳定性。我们表明,对于具有产品稳定最小值的损失,应用于形式$(x,y)\mapsto l(xy)$的目标的梯度下降可以证明收敛到局部最小值,即使在EoS制度中进行训练。这个框架基本上概括了以前的结果,并适用于广泛的一类损失,包括二进制交叉熵。使用分叉图,我们描述了由此产生的训练动态,解释了稳定振荡的出现,并精确地量化了收敛时的锐度。总之,我们的结果提供了一个原则性的解释稳定的EoS培训更广泛的损失函数。
摘要:Empirically, modern deep learning training often occurs at the Edge of Stability (EoS), where the sharpness of the loss exceeds the threshold below which classical convergence analysis applies. Despite recent progress, existing theoretical explanations of EoS either rely on restrictive assumptions or focus on specific squared-loss-type objectives. In this work, we introduce and study a structural property of loss functions that we term product-stability. We show that for losses with product-stable minima, gradient descent applied to objectives of the form $(x,y) \mapsto l(xy)$ can provably converge to the local minimum even when training in the EoS regime. This framework substantially generalizes prior results and applies to a broad class of losses, including binary cross entropy. Using bifurcation diagrams, we characterize the resulting training dynamics, explain the emergence of stable oscillations, and precisely quantify the sharpness at convergence. Together, our results offer a principled explanation for stable EoS training for a wider class of loss functions.
【4】Robust Learning with Optimal Error
标题:具有最佳误差的鲁棒学习
链接:https://arxiv.org/abs/2604.02555
作者:Guy Blanc
摘要:我们构造了具有最佳误差的算法,用于对抗性噪声的学习。这项工作的首要主题是,使用\textsl{随机}假设可以大大提高确定性假设实现的最佳错误率。 - 对于$η$率恶意噪声,我们证明了最佳误差为$\frac{1}{2} \cdot η/(1-η)$,将确定性假设的最佳误差提高了1/2$。这回答了Cesa-Bianchi等人(JACM 1999)提出的一个开放性问题,他们表明随机性可以将误差提高6/7$。 - 对于$η$率讨厌的噪声,我们证明了最佳误差是$\frac{3}{2} \cdot η$的分布无关的学习者和$η$的固定分布的学习者,都提高了确定性假设的最佳$2 η$错误。这弥补了Bshouty等人(Theoretical Computer Science 2002)在引入讨厌的噪声时首先注意到的差距,并在Klivans等人(NeurIPS 2025)和Blanc等人(SODA 2026)的最近工作中重申。 - 对于$η$速率不可知噪声和密切相关的讨厌分类噪声模型,我们证明最佳误差为$η$,比确定性假设的最佳$2η$误差有所改进。 我们所有的学习器的样本复杂度在概念类的VC维中是线性的,在逆超额误差中是多项式。除了固定分布的讨厌的噪音学习器是时间有效的访问经验风险最小化的预言。
摘要:We construct algorithms with optimal error for learning with adversarial noise. The overarching theme of this work is that the use of \textsl{randomized} hypotheses can substantially improve upon the best error rates achievable with deterministic hypotheses. - For $η$-rate malicious noise, we show the optimal error is $\frac{1}{2} \cdot η/(1-η)$, improving on the optimal error of deterministic hypotheses by a factor of $1/2$. This answers an open question of Cesa-Bianchi et al. (JACM 1999) who showed randomness can improve error by a factor of $6/7$. - For $η$-rate nasty noise, we show the optimal error is $\frac{3}{2} \cdot η$ for distribution-independent learners and $η$ for fixed-distribution learners, both improving upon the optimal $2 η$ error of deterministic hypotheses. This closes a gap first noted by Bshouty et al. (Theoretical Computer Science 2002) when they introduced nasty noise and reiterated in the recent works of Klivans et al. (NeurIPS 2025) and Blanc et al. (SODA 2026). - For $η$-rate agnostic noise and the closely related nasty classification noise model, we show the optimal error is $η$, improving upon the optimal $2η$ error of deterministic hypotheses. All of our learners have sample complexity linear in the VC-dimension of the concept class and polynomial in the inverse excess error. All except for the fixed-distribution nasty noise learner are time efficient given access to an oracle for empirical risk minimization.
【5】Scalable Mean-Variance Portfolio Optimization via Subspace Embeddings and GPU-Friendly Nesterov-Accelerated Projected Gradient
标题:通过子空间嵌入和对图形处理器友好的Nesterov加速投影梯度的可扩展均值-方差投资组合优化
链接:https://arxiv.org/abs/2604.02917
作者:Yi-Shuai Niu,Yajuan Wang
备注:28 pages, 7 figures
摘要:我们开发了一个基于草图的因素减少和Nesterov加速投影梯度算法(NPGA)与GPU加速,产生一个双重加速求解器的大规模约束均值方差投资组合优化。该方法从样本协方差因子L$出发,结合随机子空间嵌入、谱截断和岭稳定等方法构造有效因子L_{eff}$。然后,它通过标量对偶搜索和GPU友好的矩阵-向量内核计算的结构化投影来解决由此产生的约束问题,为基线,草图和草图-截断-岭(STR)正则化模型产生一个计算管道。我们还建立了近似,条件和稳定性保证的草图和STR模型,包括明确的$O(\vareps)$界的协方差近似,最佳值误差,和解决方案扰动下$(\vareps,δ)$-子空间嵌入。在模拟和真实股票收益数据上的实验表明,该方法在保持客观准确性的同时大大减少了运行时间。在5440个资产的真实数据基准测试中,48374个训练周期,NPGA-GPU在2.80秒内解决了未简化的完整模型,而Guidelines则为64.84秒,而优化的压缩GPU变体仍然处于低个位数秒的状态。这些结果表明,全密集模型在现代GPU上已经是实用的,并且在压缩之后,剩下的瓶颈是投影而不是矩阵向量乘法。
摘要:We develop a sketch-based factor reduction and a Nesterov-accelerated projected gradient algorithm (NPGA) with GPU acceleration, yielding a doubly accelerated solver for large-scale constrained mean-variance portfolio optimization. Starting from the sample covariance factor $L$, the method combines randomized subspace embedding, spectral truncation, and ridge stabilization to construct an effective factor $L_{eff}$. It then solves the resulting constrained problem with a structured projection computed by scalar dual search and GPU-friendly matrix-vector kernels, yielding one computational pipeline for the baseline, sketched, and Sketch-Truncate-Ridge (STR)-regularized models. We also establish approximation, conditioning, and stability guarantees for the sketching and STR models, including explicit $O(\varepsilon)$ bounds for the covariance approximation, the optimal value error, and the solution perturbation under $(\varepsilon,δ)$-subspace embeddings. Experiments on synthetic and real equity-return data show that the method preserves objective accuracy while reducing runtime substantially. On a 5440-asset real-data benchmark with 48374 training periods, NPGA-GPU solves the unreduced full model in 2.80 seconds versus 64.84 seconds for Gurobi, while the optimized compressed GPU variants remain in the low-single-digit-second regime. These results show that the full dense model is already practical on modern GPUs and that, after compression, the remaining bottleneck is projection rather than matrix-vector multiplication.
【6】Structure-Preserving Multi-View Embedding Using Gromov-Wasserstein Optimal Transport
标题:使用Gromov-Wasserstein最优传输的结构保持多视图嵌入
链接:https://arxiv.org/abs/2604.02610
作者:Rafael Pereira Eufrazio,Eduardo Fernandes Montesuma,Charles Casimiro Cavalcante
备注:This manuscript is currently under review for possible publication in the journal Signal Processing (ELSEVIER)
摘要:多视图数据分析试图整合相同样本的多个表示,以恢复一致的低维结构。经典的方法通常依赖于特征拼接或显式对齐假设,这在异构几何或非线性失真下变得具有限制性。在这项工作中,我们提出了两个几何感知的多视图嵌入策略接地Gromov-Wasserstein(GW)的最佳运输。第一,称为平均GWMDS,聚合视图特定的关系信息,平均距离矩阵和应用基于GW的多维缩放,以获得一个代表性的嵌入。第二种策略,称为多GWMDS,采用基于选择的范式,其中多个几何一致的候选嵌入通过基于GW的对齐生成,并选择一个代表性的嵌入。在人工流形和真实数据集上的实验表明,该方法有效地保持了视图间的内在关系结构。这些结果突出了基于GW的方法作为多视图表示学习的灵活和原则性框架。
摘要:Multi-view data analysis seeks to integrate multiple representations of the same samples in order to recover a coherent low-dimensional structure. Classical approaches often rely on feature concatenation or explicit alignment assumptions, which become restrictive under heterogeneous geometries or nonlinear distortions. In this work, we propose two geometry-aware multi-view embedding strategies grounded in Gromov-Wasserstein (GW) optimal transport. The first, termed Mean-GWMDS, aggregates view-specific relational information by averaging distance matrices and applying GW-based multidimensional scaling to obtain a representative embedding. The second strategy, referred to as Multi-GWMDS, adopts a selection-based paradigm in which multiple geometry-consistent candidate embeddings are generated via GW-based alignment and a representative embedding is selected. Experiments on synthetic manifolds and real-world datasets show that the proposed methods effectively preserve intrinsic relational structure across views. These results highlight GW-based approaches as a flexible and principled framework for multi-view representation learning.
预测|估计(2篇)
【1】Toward an Operational GNN-Based Multimesh Surrogate for Fast Flood Forecasting
标题:迈向基于GNN的可操作的多网络替代品快速洪水预测
链接:https://arxiv.org/abs/2604.02876
作者:Valentin Mercier,Serge Gratton,Lapeyre Corentin,Gwenaël Chevallet
摘要:业务洪水预报仍然依赖于高保真的二维水力求解器,但它们的运行时间可能会限制大型城市洪泛区的快速决策支持。同时,基于人工智能的代理模型在计算物理的几个领域显示出强大的潜力,可以加速昂贵的高保真模拟。我们解决这个问题的较低的泰特河(法国),从生产级的Telemac 2D模型定义的高分辨率非结构化有限元网格超过4\times 10^5$节点。从这个设置中,我们建立了一个学习准备数据库的合成,但操作接地洪水事件,涵盖几个代表性的水文家庭和峰值流量。在此数据库之上,我们开发了一个基于投影网格和多网格连接的图形神经代理。投影网格策略保持训练易于处理,同时保留来自原始Telemac模拟的高保真监督,并且多网格构建在不增加网络深度的情况下扩大了有效的空间感受野。我们进一步研究了显式放电特征$Q(t)$和前推训练对长期自回归推出的影响。实验表明,在这种边界驱动的设置中,对$Q(t)$的调节是必不可少的,一旦模型得到适当的调节,多网格连接就会带来额外的增益,并且前推进一步提高了卷展稳定性。在测试的配置中,$Q(t)$、多网连接和前推的组合提供了最佳的总体结果。这些收益都观察到液压变量的代理网格和洪水地图上插入到一个共同的25\,\mathrm {m}$规则的网格,并对原来的高分辨率Telemac解决方案进行比较。在所研究的案例中,学习的代理在单个NVIDIA A100 GPU上以约0.4\,\mathrm {s}$的时间生成6小时的预测,而在56个CPU内核上的参考模拟约为180\,\mathrm {min}$。这些结果支持基于图形的代理人作为实际的补充,工业液压求解器的业务洪水映射。
摘要:Operational flood forecasting still relies on high-fidelity two-dimensional hydraulic solvers, but their runtime can be prohibitive for rapid decision support on large urban floodplains. In parallel, AI-based surrogate models have shown strong potential in several areas of computational physics for accelerating otherwise expensive high-fidelity simulations. We address this issue on the lower Têt River (France), starting from a production-grade Telemac2D model defined on a high-resolution unstructured finite-element mesh with more than $4\times 10^5$ nodes. From this setup, we build a learning-ready database of synthetic but operationally grounded flood events covering several representative hydrograph families and peak discharges. On top of this database, we develop a graph-neural surrogate based on projected meshes and multimesh connectivity. The projected-mesh strategy keeps training tractable while preserving high-fidelity supervision from the original Telemac simulations, and the multimesh construction enlarges the effective spatial receptive field without increasing network depth. We further study the effect of an explicit discharge feature $Q(t)$ and of pushforward training for long autoregressive rollouts. The experiments show that conditioning on $Q(t)$ is essential in this boundary-driven setting, that multimesh connectivity brings additional gains once the model is properly conditioned, and that pushforward further improves rollout stability. Among the tested configurations, the combination of $Q(t)$, multimesh connectivity, and pushforward provides the best overall results. These gains are observed both on hydraulic variables over the surrogate mesh and on inundation maps interpolated onto a common $25\,\mathrm{m}$ regular grid and compared against the original high-resolution Telemac solution. On the studied case, the learned surrogate produces 6-hour predictions in about $0.4\,\mathrm{s}$ on a single NVIDIA A100 GPU, compared with about $180\,\mathrm{min}$ on 56 CPU cores for the reference simulation. These results support graph-based surrogates as practical complements to industrial hydraulic solvers for operational flood mapping.
【2】YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches
标题:JC Bench:预测Y Combinator批次初创公司表现优异的实时基准
链接:https://arxiv.org/abs/2604.02378
作者:Mostapha Benhenda
摘要:预测创业公司的成功是出了名的困难,部分原因是有意义的结果,如退出、大规模融资和持续的收入增长,是罕见的,可能需要数年时间才能实现。因此,信号稀疏,评估周期缓慢。Y Combinator的批次提供了一种独特的缓解方法:每一批都包括大约200家初创公司,同时获得资助,仅在三个月后的演示日进行评估。我们引入YC Bench,这是一个实时基准,用于预测YC批次中的早期表现。使用YC W26批次作为案例研究(196家初创公司),我们使用Pre-Demo Day Score来衡量表现,这是一个结合了公开可用的牵引信号和网络可见性的KPI。这种短期指标可以快速评估预测模型。作为基线,我们在YC W26申请截止日期之前获得了Google提及,这是先前品牌认知度的简单代理,在YC演示日恢复了11个表现最佳者中的6个(55%召回率)。YC Bench为研究创业成功预测提供了一个实时基准,迭代周期以月而不是年为单位。代码和数据可在GitHub上获得:https://github.com/benstaf/ycbench
摘要:Forecasting startup success is notoriously difficult, partly because meaningful outcomes, such as exits, large funding rounds, and sustained revenue growth, are rare and can take years to materialize. As a result, signals are sparse and evaluation cycles are slow. Y Combinator batches offer a unique mitigation: each batch comprises around 200 startups, funded simultaneously, with evaluation at Demo Day only three months later. We introduce YC Bench, a live benchmark for forecasting early outperformance within YC batches. Using the YC W26 batch as a case study (196 startups), we measure outperformance with a Pre-Demo Day Score, a KPI combining publicly available traction signals and web visibility. This short-term metric enables rapid evaluation of forecasting models. As a baseline, we take Google mentions prior to the YC W26 application deadline, a simple proxy for prior brand recognition, recovering 6 of 11 top performers at YC Demo Day (55% recall). YC Bench provides a live benchmark for studying startup success forecasting, with iteration cycles measured in months rather than years. Code and Data are available on GitHub: https://github.com/benstaf/ycbench
其他神经网络|深度学习|模型|建模(19篇)
【1】Hierarchical Planning with Latent World Models
标题:潜世界模型的分层规划
链接:https://arxiv.org/abs/2604.03208
作者
:Wancong Zhang,Basile Terver,Artem Zholus,Soham Chitnis,Harsh Sutaria,Mido Assran,Randall Balestriero,Amir Bar,Adrien Bardes,Yann LeCun,Nicolas Ballas
摘要:具有学习世界模型的模型预测控制(MPC)已经成为一种有前途的体现控制范例,特别是当部署在新环境中时,其推广zero-shot的能力。然而,由于预测误差的积累和指数增长的搜索空间,学习世界模型通常难以进行长期控制。在这项工作中,我们通过在多个时间尺度上学习潜在世界模型并在这些尺度上执行分层规划来解决这些挑战,从而实现长时间推理,同时大大降低推理时间规划的复杂性。我们的方法作为一个模块化的规划抽象,适用于不同的潜在的世界模型架构和域。我们证明,这种分层的方法,使zero-shot控制在现实世界中的非贪婪的机器人任务,实现了70%的成功率上挑选和放置只使用一个最终的目标规范,相比0%的单级世界模型。此外,在基于物理的模拟环境中,包括推送操作和迷宫导航,分层规划实现了更高的成功率,同时需要多达4倍的规划时间计算。
摘要:Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its ability to generalize zero-shot when deployed in new environments. However, learned world models often struggle with long-horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challenges by learning latent world models at multiple temporal scales and performing hierarchical planning across these scales, enabling long-horizon reasoning while substantially reducing inference-time planning complexity. Our approach serves as a modular planning abstraction that applies across diverse latent world-model architectures and domains. We demonstrate that this hierarchical approach enables zero-shot control on real-world non-greedy robotic tasks, achieving a 70% success rate on pick-&-place using only a final goal specification, compared to 0% for a single-level world model. In addition, across physics-based simulated environments including push manipulation and maze navigation, hierarchical planning achieves higher success while requiring up to 4x less planning-time compute.
【2】Learning Contractive Integral Operators with Fredholm Integral Neural Operators
标题:用Fredholm积分神经运算符学习压缩积分运算符
链接:https://arxiv.org/abs/2604.03034
作者:Kyriakos C. Georgiou,Constantinos Siettos,Athanasios N. Yannacopoulos
摘要:我们推广了Fredholm神经网络的框架,学习任意维的第二类Fredholm积分方程(FIE)中的非扩张积分算子。我们首先提出了建议Fredholm积分神经操作(FREDINO),FIEs和证明他们是通用的线性和非线性积分算子和相应的解决方案运营商的逼近。我们还证明了学习的运营商保证是压缩的,从而严格满足所需的不动点计划的收敛性的数学性质。最后,我们还展示了如何FREDINO可以用来学习非线性椭圆偏微分方程的解决方案运营商,通过边界积分方程(BIE)制定。我们评估所提出的方法数值,通过几个基准问题:线性和非线性外商投资企业在任意尺寸,以及一个非线性椭圆偏微分方程在二维。基于量身定制的数学/数值分析理论,FREDINO提供高精度近似和可解释的方案,使其非常适合科学机器学习/数值分析计算。
摘要:We generalize the framework of Fredholm Neural Networks, to learn non-expansive integral operators arising in Fredholm Integral Equations (FIEs) of the second kind in arbitrary dimensions. We first present the proposed Fredholm Integral Neural Operators (FREDINOs), for FIEs and prove that they are universal approximators of linear and non-linear integral operators and corresponding solution operators. We furthermore prove that the learned operators are guaranteed to be contractive, thereby strictly satisfying the mathematical property required for the convergence of the fixed point scheme. Finally, we also demonstrate how FREDINOs can be used to learn the solution operator of non-linear elliptic PDEs, via a Boundary Integral Equation (BIE) formulation. We assess the proposed methodology numerically, via several benchmark problems: linear and non-linear FIEs in arbitrary dimensions, as well as a non-linear elliptic PDE in 2D. Built on tailored mathematical/numerical analysis theory, FREDINOs offer high-accuracy approximations and interpretable schemes, making them well suited for scientific machine learning/numerical analysis computations.
【3】Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
标题:通过基于出处的输入梯度指导从合成数据中学习
链接:https://arxiv.org/abs/2604.02946
作者:Koshiro Nagano,Ryo Fujii,Ryo Hachiuma,Fumiaki Sato,Taiki Sekii,Hideo Saito
备注:CVPR 2026
摘要:使用合成数据的学习方法作为一种有效的方法引起了人们的注意,这种方法可以增加训练数据的多样性,同时降低收集成本,从而提高模型判别的鲁棒性。然而,许多现有的方法只能通过训练样本的多样化间接地提高鲁棒性,并且没有明确地教导模型输入空间中的哪些区域真正有助于区分;因此,模型可能会学习由合成偏差和伪影引起的虚假相关性。出于这种局限性,本文提出了一种学习框架,它使用训练数据合成过程中获得的出处信息,指示输入空间中的每个区域是否来自目标对象,作为辅助监督信号,以促进采集集中在目标区域的表示。具体地,在合成期间基于关于目标和非目标区域的信息来分解输入梯度,并且引入输入梯度引导以抑制非目标区域上的梯度。这抑制了模型对非目标区域的依赖,并直接促进了对目标区域的区分性表示的学习。实验证明了所提出的方法在多个任务和模式,包括弱监督对象定位,时空动作定位和图像分类的有效性和通用性。
摘要:Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.
【4】Structure-Aware Commitment Reduction for Network-Constrained Unit Commitment with Solver-Preserving Guarantees
标题:具有解算器保留保证的网络约束单元承诺的结构感知承诺约简
链接:https://arxiv.org/abs/2604.02788
作者:Guangwen Wang,Jiaqi Wu,Yang Weng,Baosen Zhang
备注:10 pages
摘要:单个发电机组、混合资源和安全约束的数量不断增加,显着增加了网络约束机组组合(UC)的计算负担,其中大部分求解时间都花在探索机组小时二进制变量的分支定界树上。为了减少这种组合负担,最近的方法探索了基于学习的指导,以协助承诺的决定。然而,直接使用诸如大型语言模型(LLM)之类的工具来预测完全承诺时间表是不可靠的,因为不可行或不一致的二元决策可能违反跨时间约束并降低经济最优性。本文提出了一个求解器兼容的降维框架UC利用结构化的承诺决策。而不是生成完整的时间表,该框架确定了一个稀疏的子集结构稳定的承诺二进制文件,以修复优化之前。一种实现使用LLM来选择这些变量。LLM不会取代优化过程,但提供部分变量限制,而所有约束和剩余决策由原始MILP求解器处理,该求解器继续执行网络,斜坡,储备和安全约束。我们正式表明,掩蔽问题定义了原始UC模型的简化可行域,从而保留了可行性,并在受限空间内实现了求解器认证的最优性。IEEE 57总线,RTS 73总线,IEEE 118总线,和增强的大规模的情况下,包括安全约束的变种,实验表明一致的减少分支定界节点和解决方案的时间,实现高复杂性的情况下,同时保持接近最优的目标值的数量级的加速。
摘要
:The growing number of individual generating units, hybrid resources, and security constraints has significantly increased the computational burden of network-constrained unit commitment (UC), where most solution time is spent exploring branch-and-bound trees over unit-hour binary variables. To reduce this combinatorial burden, recent approaches have explored learning-based guidance to assist commitment decisions. However, directly using tools such as large language models (LLMs) to predict full commitment schedules is unreliable, as infeasible or inconsistent binary decisions can violate inter-temporal constraints and degrade economic optimality. This paper proposes a solver-compatible dimensionality reduction framework for UC that exploits structural regularities in commitment decisions. Instead of generating complete schedules, the framework identifies a sparse subset of structurally stable commitment binaries to fix prior to optimization. One implementation uses an LLM to select these variables. The LLM does not replace the optimization process but provides partial variable restriction, while all constraints and remaining decisions are handled by the original MILP solver, which continues to enforce network, ramping, reserve, and security constraints. We formally show that the masked problem defines a reduced feasible region of the original UC model, thereby preserving feasibility and enabling solver-certified optimality within the restricted space. Experiments on IEEE 57-bus, RTS 73-bus, IEEE 118-bus, and augmented large-scale cases, including security-constrained variants, demonstrate consistent reductions in branch-and-bound nodes and solution time, achieving order-of-magnitude speedups on high-complexity instances while maintaining near-optimal objective values.
【5】Towards Realistic Class-Incremental Learning with Free-Flow Increments
标题:面向现实的类增量学习
链接:https://arxiv.org/abs/2604.02765
作者:Zhiming Xu,Baile Xu,Jian Zhao,Furao Shen,Suorong Yang
备注:15pages, 5figures, 3 tables
摘要:类增量学习(CIL)通常在预定义的时间表下进行评估,任务大小相等,留下更多的现实和复杂的情况下未探索。然而,一个实用的CIL系统应该在任何数量的新类到达时立即学习,而不强制执行固定大小的任务。我们将这种设置形式化为自由流类增量学习(FFCIL),其中数据作为更真实的流到达,每一步都有高度可变的不可见类数量。它将使许多现有的CIL方法变得脆弱,并导致明显的性能下降。我们提出了一个模型无关的框架下自由流到达强大的CIL学习。它包括一个类明智的平均值(CWM)的目标,取代采样频率加权损失与均匀聚合类条件监督,从而稳定的学习信号在自由流类增量,以及方法明智的调整,提高代表CIL范例的鲁棒性。具体来说,我们将蒸馏限制在重放数据上,规范化对比和知识转移损失的规模,并引入动态干预权重调整(DIWA)来防止由于小班增量的不稳定统计数据而导致的过度调整。实验证实,在FFCIL下,各种CIL基线的性能明显下降,而我们的策略产生一致的收益。
摘要:Class-incremental learning (CIL) is typically evaluated under predefined schedules with equal-sized tasks, leaving more realistic and complex cases unexplored. However, a practical CIL system should learns immediately when any number of new classes arrive, without forcing fixed-size tasks. We formalize this setting as Free-Flow Class-Incremental Learning (FFCIL), where data arrives as a more realistic stream with a highly variable number of unseen classes each step. It will make many existing CIL methods brittle and lead to clear performance degradation. We propose a model-agnostic framework for robust CIL learning under free-flow arrivals. It comprises a class-wise mean (CWM) objective that replaces sample frequency weighted loss with uniformly aggregated class-conditional supervision, thereby stabilizing the learning signal across free-flow class increments, as well as method-wise adjustments that improve robustness for representative CIL paradigms. Specifically, we constrain distillation to replayed data, normalize the scale of contrastive and knowledge transfer losses, and introduce Dynamic Intervention Weight Alignment (DIWA) to prevent over-adjustment caused by unstable statistics from small class increments. Experiments confirm a clear performance degradation across various CIL baselines under FFCIL, while our strategies yield consistent gains.
【6】STDDN: A Physics-Guided Deep Learning Framework for Crowd Simulation
标题:STDDN:用于人群模拟的物理引导深度学习框架
链接:https://arxiv.org/abs/2604.02756
作者:Zijin Liu,Xu Geng,Wenshuai Xu,Xiang Zhao,Yan Xia,You Song
摘要:准确的人群模拟对于公共安全管理、紧急疏散规划和智能交通系统至关重要。然而,现有的方法,通常模型的人群作为一个独立的个人轨迹的集合,是有限的,在他们的能力来捕捉宏观物理规律。这种微观方法往往会导致误差积累,并损害模拟的稳定性。此外,深度学习驱动的方法往往存在推理效率低和计算开销高的问题,使得它们对于大规模、高效的模拟来说不切实际。为了应对这些挑战,我们提出了时空解耦微分方程网络(STDDN),这是一种新的框架,可以用宏观物理指导微观轨迹预测。我们创新性地引入了流体动力学的连续性方程作为强物理约束。一个神经常微分方程(神经ODE)被用来模拟由个体运动驱动的宏观密度演化,从而物理正则化的微观轨迹预测模型。我们设计了一个密度-速度耦合的动态图学习模块,用于在神经常微分方程中表达密度场的导数,有效地减少了误差积累。我们还提出了一个可微密度映射模块,以消除离散化造成的不连续梯度,并引入了跨网格检测模块,以准确地模拟个人跨网格运动对局部密度变化的影响。与四个真实世界数据集上的长期任务的最先进方法相比,所提出的STDDN方法具有显着优越的模拟性能,并且大大减少了推理延迟。
摘要:Accurate crowd simulation is crucial for public safety management, emergency evacuation planning, and intelligent transportation systems. However, existing methods, which typically model crowds as a collection of independent individual trajectories, are limited in their ability to capture macroscopic physical laws. This microscopic approach often leads to error accumulation and compromises simulation stability. Furthermore, deep learning-driven methods tend to suffer from low inference efficiency and high computational overhead, making them impractical for large-scale, efficient simulations. To address these challenges, we propose the Spatio-Temporal Decoupled Differential Equation Network (STDDN), a novel framework that guides microscopic trajectory prediction with macroscopic physics. We innovatively introduce the continuity equation from fluid dynamics as a strong physical constraint. A Neural Ordinary Differential Equation (Neural ODE) is employed to model the macroscopic density evolution driven by individual movements, thereby physically regularizing the microscopic trajectory prediction model. We design a density-velocity coupled dynamic graph learning module to formulate the derivative of the density field within the Neural ODE, effectively mitigating error accumulation. We also propose a differentiable density mapping module to eliminate discontinuous gradients caused by discretization and introduce a cross-grid detection module to accurately model the impact of individual cross-grid movements on local density changes. The proposed STDDN method has demonstrated significantly superior simulation performance compared to state-of-the-art methods on long-term tasks across four real-world datasets, as well as a major reduction in inference latency.
【7】MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
标题:MOMO:火星轨道模型火星轨道应用基础模型
链接:https://arxiv.org/abs/2604.02719
作者:Mirali Purohit,Bimal Gajera,Irish Mehta,Bhanu Tokas,Jacob Adler,Steven Lu,Scott Dickenshied,Serina Diniega,Brian Bue,Umaa Rebbapragada,Hannah Kerner
备注:Accepted at CVPR 2026 (Main Track)
摘要:我们介绍MOMO,第一个多传感器火星遥感基础模型。MOMO使用模型合并来整合从三个关键的火星传感器(HiRISE,CTX和THEMIS)独立学习的表示,分辨率从0.25米/像素到100米/像素。我们的方法的核心是我们新的相等验证损失(EVL)策略,它通过任务算法融合之前,根据验证损失相似性在传感器之间对齐检查点。这确保了模型在兼容的收敛阶段合并,从而提高了稳定性和泛化能力。我们训练MOMO上的大规模,高质量的语料库的$\sim 12$百万样本策划从火星轨道数据和评估它的9个下游任务从火星台。与ImageNet预训练、地球观测基础模型、传感器特定的预训练和完全监督的基线相比,MOMO实现了更好的整体性能。特别是在分割任务上,MOMO表现出一致和显着的性能改进。我们的研究结果表明,模型合并通过一个最佳的检查点选择策略提供了一个有效的方法来建立多分辨率数据的基础模型。模型权重、预训练代码、预训练数据和评估代码可在https://github.com/kerner-lab/MOMO上获得。
摘要
:We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of $\sim 12$ million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: https://github.com/kerner-lab/MOMO.
【8】LieTrunc-QNN: Lie Algebra Truncation and Quantum Expressivity Phase Transition from LiePrune to Provably Stable Quantum Neural Networks
标题:LieTrunc-QNN:从LiePrune到可证明稳定的量子神经网络的李代数截断和量子表现性相转变
链接:https://arxiv.org/abs/2604.02697
作者:Haijian Shao,Dalong Zhao,Xing Deng,Wenzheng Zhu,Yingtao Jiang
备注:9 pages, 4 figures, 1 table
摘要:量子机器学习(QML)从根本上受到两个挑战的限制:贫瘠的高原(指数消失的梯度)和噪声下参数化量子电路的脆弱性。尽管进行了大量的实证研究,但仍然缺乏统一的理论框架。 我们介绍LieTrunc-QNN,一个代数几何框架,通过Lie生成的动力学来表征可训练性。参数化的量子电路被建模为u(2^n)的李子代数,其作用诱导出可达量子态的黎曼流形。表达性被重新解释为内在的流形维数和几何。 我们建立了一个几何容量平台原理:增加有效维数导致指数梯度抑制由于浓度的措施。通过限制到结构化李子代数(LieTrunc),流形被收缩,防止集中和保持非退化梯度。 我们证明了两个主要结果:(1)LieTrunc-QNN的可训练性下界,以及(2)Fubini-Study度量秩由生成器的代数跨度限制,表明表达性由结构而不是参数计数决定。紧李子代数也提供了固有的鲁棒性扰动。 重要的是,我们建立了一个多项式可训练性机制,其中梯度方差多项式衰减而不是指数衰减。 实验(n=2-6)验证了理论:LieTrunc-QNN保持了稳定的梯度和高有效维数,而随机截断导致度量秩崩溃。在n=6时,保留全度量秩(秩=16)。结果支持梯度方差和有效维数之间的标度律。 这项工作提供了一个统一的几何框架QNN设计,连接李代数,流形几何和优化。
摘要:Quantum Machine Learning (QML) is fundamentally limited by two challenges: barren plateaus (exponentially vanishing gradients) and the fragility of parameterized quantum circuits under noise. Despite extensive empirical studies, a unified theoretical framework remains lacking. We introduce LieTrunc-QNN, an algebraic-geometric framework that characterizes trainability via Lie-generated dynamics. Parameterized quantum circuits are modeled as Lie subalgebras of u(2^n), whose action induces a Riemannian manifold of reachable quantum states. Expressivity is reinterpreted as intrinsic manifold dimension and geometry. We establish a geometric capacity-plateau principle: increasing effective dimension leads to exponential gradient suppression due to concentration of measure. By restricting to structured Lie subalgebras (LieTrunc), the manifold is contracted, preventing concentration and preserving non-degenerate gradients. We prove two main results: (1) a trainability lower bound for LieTrunc-QNN, and (2) that the Fubini-Study metric rank is bounded by the algebraic span of generators, showing expressivity is governed by structure rather than parameter count. Compact Lie subalgebras also provide inherent robustness to perturbations. Importantly, we establish a polynomial trainability regime where gradient variance decays polynomially instead of exponentially. Experiments (n=2-6) validate the theory: LieTrunc-QNN maintains stable gradients and high effective dimension, while random truncation leads to metric rank collapse. At n=6, full metric rank is preserved (rank=16). Results support a scaling law between gradient variance and effective dimension. This work provides a unified geometric framework for QNN design, linking Lie algebra, manifold geometry, and optimization.
【9】A Numerical Method for Coupling Parameterized Physics-Informed Neural Networks and FDM for Advanced Thermal-Hydraulic System Simulation
标题:一种将参数化物理信息神经网络和有限距离分解相结合的数值方法,用于高级热力-水力系统模拟
链接:https://arxiv.org/abs/2604.02663
作者:Jeesuk Shin,Donggyun Seo,Sihyeong Yu,Joongoo Jeon
备注:37 pages, 7 figures
摘要:使用系统级程序(如MELCOR)进行严重事故分析对于核安全评估是不可或缺的,但重复模拟的计算成本对参数研究和不确定性量化构成了一个重要的瓶颈。现有的代理模型加速了这些分析,但依赖于大量的仿真数据,而物理信息神经网络(PINN)可以实现无数据训练,但必须针对问题参数的每次变化进行重新训练。本研究通过开发参数化PINNs与FDM(P2 F)方法,MELCOR的控制体积流体动力学/流路(CVH/FP)模块的节点分配的混合框架,解决了这两个限制。在P2 F方法中,参数化的Node-PINN(NA-PINN)接受水位差,初始速度和时间作为输入,学习解决方案流形,以便单个经过训练的网络作为所有流动路径上的动量守恒方程的无数据代理,而无需重新训练。该PINN与有限差分法(FDM)求解器相结合,该求解器在每个时间步长推进质量守恒方程,确保精确的离散质量守恒,同时用单次向前传递代替迭代非线性动量求解。通过对六水箱重力排水方案的验证,在标称条件下($Δt = 1.0$ s),水位平均绝对误差为7.85 × 10^{-5}$ m,流速平均绝对误差为3.21 × 10^{-3}$ m/s。该框架在0.2至1.0 s的时间步长范围内保持一致的精度,并推广到五个不同的初始条件,所有这些都没有再训练或模拟数据。本文介绍了一种将参数化PINN与FDM相结合的数值耦合方法。
摘要:Severe accident analysis using system-level codes such as MELCOR is indispensable for nuclear safety assessment, yet the computational cost of repeated simulations poses a significant bottleneck for parametric studies and uncertainty quantification. Existing surrogate models accelerate these analyses but depend on large volumes of simulation data, while physics-informed neural networks (PINNs) enable data-free training but must be retrained for every change in problem parameters. This study addresses both limitations by developing the Parameterized PINNs coupled with FDM (P2F) method, a node-assigned hybrid framework for MELCOR's Control Volume Hydrodynamics/Flow Path (CVH/FP) module. In the P2F method, a parameterized Node-Assigned PINN (NA-PINN) accepts the water-level difference, initial velocity, and time as inputs, learning a solution manifold so that a single trained network serves as a data-free surrogate for the momentum conservation equation across all flow paths without retraining. This PINN is coupled with a finite difference method (FDM) solver that advances the mass conservation equation at each time step, ensuring exact discrete mass conservation while replacing the iterative nonlinear momentum solve with a single forward pass. Verification on a six-tank gravity-driven draining scenario yields a water level mean absolute error of $7.85 \times 10^{-5}$ m and a velocity mean absolute error of $3.21 \times 10^{-3}$ m/s under the nominal condition with $Δt = 1.0$ s. The framework maintains consistent accuracy across time steps ranging from 0.2 to 1.0 s and generalizes to five distinct initial conditions, all without retraining or simulation data. This work introduces a numerical coupling methodology for integrating parameterized PINNs with FDM within a nuclear thermal-hydraulic system code framework.
【10】Low-Rank Compression of Pretrained Models via Randomized Subspace Iteration
标题:通过随机子空间迭代对预训练模型进行低等级压缩
链接:https://arxiv.org/abs/2604.02659
作者:Farhad Pourkamali-Anaraki
备注:13 pages
摘要:大规模的预训练模型使得有效的压缩对于实际部署至关重要。基于奇异值分解(SVD)的低秩分解为模型降阶提供了一种原则性的方法,但对于大的权矩阵,其精确计算是昂贵的。随机SVD(RSVD)等随机化替代方案提高了效率,但当奇异值谱缓慢衰减时,它们可能会受到近似质量差的影响,这是现代预训练模型中常见的一种情况。在这项工作中,我们从理论和经验的角度来解决这个限制。首先,我们通过分析softmax扰动,建立了低秩近似误差和预测性能之间的联系,表明类概率的偏差由压缩权重的谱误差控制。其次,我们证明了RSVD是不够的,我们提出了随机子空间迭代(RSI)作为一个更有效的替代方案。通过合并多个幂迭代,RSI改善了频谱分离,并提供了一个可控的机制,以提高近似质量。我们在卷积网络和基于transformer的架构上评估了我们的方法。我们的研究结果表明,RSI实现了接近最佳的近似质量,同时在积极压缩下的预测精度优于RSVD,从而实现了有效的模型压缩。
摘要
:The massive scale of pretrained models has made efficient compression essential for practical deployment. Low-rank decomposition based on the singular value decomposition (SVD) provides a principled approach for model reduction, but its exact computation is expensive for large weight matrices. Randomized alternatives such as randomized SVD (RSVD) improve efficiency, yet they can suffer from poor approximation quality when the singular value spectrum decays slowly, a regime commonly observed in modern pretrained models. In this work, we address this limitation from both theoretical and empirical perspectives. First, we establish a connection between low-rank approximation error and predictive performance by analyzing softmax perturbations, showing that deviations in class probabilities are controlled by the spectral error of the compressed weights. Second, we demonstrate that RSVD is inadequate, and we propose randomized subspace iteration (RSI) as a more effective alternative. By incorporating multiple power iterations, RSI improves spectral separation and provides a controllable mechanism for enhancing approximation quality. We evaluate our approach on both convolutional networks and transformer-based architectures. Our results show that RSI achieves near-optimal approximation quality while outperforming RSVD in predictive accuracy under aggressive compression, enabling efficient model compression.
【11】WGFINNs: Weak formulation-based GENERIC formalism informed neural networks'
标题:WGFINN:基于弱公式的通用形式主义为神经网络提供了信息'
链接:https://arxiv.org/abs/2604.02601
作者:Jun Sur Richard Park,Auroni Huque Hashim,Siu Wun Cheung,Youngsoo Choi,Yeonjong Shin
摘要:从噪声观测中发现数据驱动的控制方程仍然是科学机器学习中的一个基本挑战。虽然GENERIC形式主义通知神经网络(GFINN)提供了一个原则性的框架,通过构造来执行热力学定律,但它们对强形式损失公式的依赖使它们对测量噪声高度敏感。为了解决这个问题,我们提出了基于弱公式化的GENERIC形式主义通知神经网络(WGFINNs),它集成了弱公式化的动力系统与GFINNs的结构保持架构。WGFINNs显着增强了对噪声数据的鲁棒性,同时保留了GENERIC退化和对称条件的精确满足。我们进一步结合了状态加权损失和基于残差的注意力机制,以减轻状态变量之间的规模不平衡。理论分析对比了强形式和弱形式估计量之间的定量差异。主要是,强形式的估计发散的时间步长减少噪声的存在下,而弱形式的估计可以是准确的,即使有噪声的数据,如果测试功能满足一定的条件。数值实验表明,WGFINNs始终优于GFINNs在不同的噪声水平,实现更准确的预测和可靠的恢复物理量。
摘要:Data-driven discovery of governing equations from noisy observations remains a fundamental challenge in scientific machine learning. While GENERIC formalism informed neural networks (GFINNs) provide a principled framework that enforces the laws of thermodynamics by construction, their reliance on strong-form loss formulations makes them highly sensitive to measurement noise. To address this limitation, we propose weak formulation-based GENERIC formalism informed neural networks (WGFINNs), which integrate the weak formulation of dynamical systems with the structure-preserving architecture of GFINNs. WGFINNs significantly enhance robustness to noisy data while retaining exact satisfaction of GENERIC degeneracy and symmetry conditions. We further incorporate a state-wise weighted loss and a residual-based attention mechanism to mitigate scale imbalance across state variables. Theoretical analysis contrasts quantitative differences between the strong-form and the weak-form estimators. Mainly, the strong-form estimator diverges as the time step decreases in the presence of noise, while the weak-form estimator can be accurate even with noisy data if test functions satisfy certain conditions. Numerical experiments demonstrate that WGFINNs consistently outperform GFINNs at varying noise levels, achieving more accurate predictions and reliable recovery of physical quantities.
【12】ROMAN: A Multiscale Routing Operator for Convolutional Time Series Models
标题:ROMAN:卷积时间序列模型的多尺度路由操作器
链接:https://arxiv.org/abs/2604.02577
作者:Gonzalo Uribarri
备注:16 pages, appendix, 4 figures, 3 tables
摘要:我们引入ROMAN(ROouting Multiscale representAtioN),时间序列的确定性操作符,映射到一个明确的通道结构的时间尺度和粗略的时间位置,同时减少序列长度。ROMAN构建了一个抗锯齿的多尺度金字塔,从每个尺度中提取固定长度的窗口,并将它们堆叠为伪通道,从而产生一个紧凑的表示,标准卷积分类器可以在其上操作。通过这种方式,ROMAN提供了一种简单的机制来控制下游模型的归纳偏差:它可以减少时间不变性,使时间池隐式地具有粗位置感知,并通过通道混合暴露多尺度交互,同时通常通过缩短处理的时间轴来提高计算效率。我们正式分析了ROMAN算子,然后通过测量其作为四个代表性卷积分类器的预处理步骤的影响,以两种互补的方式对其进行评估:MiniRocket,MultiRocket,标准的基于CNN的分类器和全卷积网络(FCN)分类器。首先,我们设计合成的时间序列分类任务,隔离粗糙的位置意识,远程相关性,多尺度相互作用,和完整的位置不变性,表明罗马的行为与其预期的机制一致,是最有用的类信息依赖于时间结构时,标准的合并卷积往往会抑制。第二,我们基准相同的模型,有和没有罗马的UCR和UEA档案的长序列子集,罗马提供了一个实际有用的替代表示,其准确性的影响是任务相关的,但其对效率的影响往往是有利的。代码可在https://github.com/gon-uri/ROMAN上获得
摘要:We introduce ROMAN (ROuting Multiscale representAtioN), a deterministic operator for time series that maps temporal scale and coarse temporal position into an explicit channel structure while reducing sequence length. ROMAN builds an anti-aliased multiscale pyramid, extracts fixed-length windows from each scale, and stacks them as pseudochannels, yielding a compact representation on which standard convolutional classifiers can operate. In this way, ROMAN provides a simple mechanism to control the inductive bias of downstream models: it can reduce temporal invariance, make temporal pooling implicitly coarse-position-aware, and expose multiscale interactions through channel mixing, while often improving computational efficiency by shortening the processed time axis. We formally analyze the ROMAN operator and then evaluate it in two complementary ways by measuring its impact as a preprocessing step for four representative convolutional classifiers: MiniRocket, MultiRocket, a standard CNN-based classifier, and a fully convolutional network (FCN) classifier. First, we design synthetic time series classification tasks that isolate coarse position awareness, long-range correlation, multiscale interaction, and full positional invariance, showing that ROMAN behaves consistently with its intended mechanism and is most useful when class information depends on temporal structure that standard pooled convolution tends to suppress. Second, we benchmark the same models with and without ROMAN on long-sequence subsets of the UCR and UEA archives, showing that ROMAN provides a practically useful alternative representation whose effect on accuracy is task-dependent, but whose effect on efficiency is often favorable. Code is available at https://github.com/gon-uri/ROMAN
【13】Communication-Efficient Distributed Learning with Differential Privacy
标题:具有差异隐私的通信高效分布式学习
链接:https://arxiv.org/abs/2604.02558
作者:Xiaoxing Ren,Yuwen Ma,Nicola Bastianello,Karl H. Johansson,Thomas Parisini,Andreas A. Malikopoulos
摘要:我们解决无向网络上的非凸学习问题。特别是,我们专注于设计一个算法,既通信效率和保证代理的数据的隐私的挑战。第一个目标是通过减少通信频率的当地培训办法实现的。第二个目标是通过在局部训练期间扰动梯度来实现的,特别是通过梯度裁剪和加性噪声。我们证明了所得到的算法收敛到一个有界距离内的问题的稳定点。此外,我们在差分隐私框架内提供理论隐私保证,确保代理的训练数据不能从网络上共享的训练模型中推断出来。我们展示了该算法在相同的隐私预算下的分类任务上的优越性能,与最先进的方法相比。
摘要:We address nonconvex learning problems over undirected networks. In particular, we focus on the challenge of designing an algorithm that is both communication-efficient and that guarantees the privacy of the agents' data. The first goal is achieved through a local training approach, which reduces communication frequency. The second goal is achieved by perturbing gradients during local training, specifically through gradient clipping and additive noise. We prove that the resulting algorithm converges to a stationary point of the problem within a bounded distance. Additionally, we provide theoretical privacy guarantees within a differential privacy framework that ensure agents' training data cannot be inferred from the trained model shared over the network. We show the algorithm's superior performance on a classification task under the same privacy budget, compared with state-of-the-art methods.
【14】Do We Need Frontier Models to Verify Mathematical Proofs?
标题:我们需要边界模型来验证数学证明吗?
链接:https://arxiv.org/abs/2604.02450
作者:Aaditya Naik,Guruprerana Shabadi,Rajeev Alur,Mayur Naik
备注:21 pages, 11 figures
摘要:训练、后训练和推理时间方法的进步使前沿推理模型能够在数学竞赛中赢得金牌,并解决具有挑战性的开放性问题。要获得对这些模型响应的信任,需要检查自然语言证明的错误。法学硕士法官越来越多地被采用,以满足日益增长的需求,评估这些证据。虽然验证被认为比生成更容易,但可靠的验证实际上需要什么样的模型能力?我们系统地评估了四个开源和两个前沿LLM的人类分级的自然语言证明的竞争水平的问题的数据集。我们考虑两个关键指标:验证者准确性和自我一致性(对同一证明重复判断的一致率)。我们观察到,较小的开源模型在准确性上仅落后于前沿模型约10%,但它们的不一致性高达约25%。此外,我们看到验证者的准确性对所有模型的即时选择都很敏感。然后,我们证明,较小的模型,事实上,确实具有数学能力,以验证证明在前沿模型的水平,但他们努力可靠地引出这些能力与一般的判断提示。通过LLM引导的提示搜索,我们合成了一系列专门的提示,克服了较小模型的特定故障模式,使其性能在准确性和自我一致性方面提高了9.1%和15.9%。这些收益是在模型和数据集上实现的,允许像Qwen3.5- 35 B这样的模型与Gemini 3.1 Pro这样的前沿模型进行证明验证。
摘要:Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but they are up to ~25% more inconsistent. Furthermore, we see that verifier accuracy is sensitive to prompt choice across all models. We then demonstrate that the smaller models, in fact, do possess the mathematical capabilities to verify proofs at the level of frontier models, but they struggle to reliably elicit these capabilities with general judging prompts. Through an LLM-guided prompt search, we synthesize an ensemble of specialized prompts that overcome the specific failure modes of smaller models, boosting their performance by up to 9.1% in accuracy and 15.9% in self-consistency. These gains are realized across models and datasets, allowing models like Qwen3.5-35B to perform on par with frontier models such as Gemini 3.1 Pro for proof verification.
【15】Photonic convolutional neural network with pre-trained in-situ training
标题:具有预先训练的现场训练的光卷积神经网络
链接:https://arxiv.org/abs/2604.02429
作者:Saurabh Ranjan,Sonika Thakral,Amit Sehgal
备注:7 pages, 3 figures, 4 tables
摘要:光子计算是克服电子冯诺依曼体系结构能量瓶颈的一种计算范式。吞吐量和功耗是互补金属氧化物半导体(CMOS)芯片的基本限制,因此卷积神经网络(CNN)正在彻底改变机器学习、计算机视觉和其他基于图像的应用。在这项工作中,我们提出并验证了一个完全光子卷积神经网络(PCNN),该网络完全在光域中执行MNIST图像分类,实现了94%的测试准确率。与现有的架构,依赖于频繁之间的转换,从光到电,并回到光(O/E/O),我们的系统保持相干处理利用马赫-曾德尔干涉仪(MZI)网格,波分复用(WDM)池,和微谐振器为基础的非线性。最大池化单元完全在硅光子学上实现,不需要光电或电气转换。为了克服训练物理移相器参数的挑战,我们引入了一种混合训练方法,该方法部署了一个数学上精确可微的数字孪生子用于非原位反向传播,然后通过同步扰动随机近似(SPSA)算法进行原位微调。我们的评估表明,它对热串扰具有显著的鲁棒性(在严重耦合的情况下,精度仅下降0.43%),并且在单图像推理方面,其能效比最先进的电子GPU高出100至242倍。
摘要:Photonic computing is a computing paradigm which have great potential to overcome the energy bottlenecks of electronic von Neumann architecture. Throughput and power consumption are fundamental limitations of Complementary-metal-oxide-semiconductor (CMOS) chips, therefore convolutional neural network (CNN) is revolutionising machine learning, computer vision and other image based applications. In this work, we propose and validate a fully photonic convolutional neural network (PCNN) that performs MNIST image classification entirely in the optical domain, achieving 94 percent test accuracy. Unlike existing architectures that rely on frequent in-between conversions from optical to electrical and back to optical (O/E/O), our system maintains coherent processing utilizing Mach-Zehnder interferometer (MZI) meshes, wavelength-division multiplexed (WDM) pooling, and microring resonator-based nonlinearities. The max pooling unit is fully implemented on silicon photonics, which does not require opto-electrical or electrical conversions. To overcome the challenges of training physical phase shifter parameters, we introduce a hybrid training methodology deploying a mathematically exact differentiable digital twin for ex-situ backpropagation, followed by in-situ fine-tuning via Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm. Our evaluation demonstrates significant robustness to thermal crosstalk (only 0.43 percent accuracy degradation at severe coupling) and achieves 100 to 242 times better energy efficiency than state-of-the-art electronic GPUs for single-image inference.
【16】Modeling and Controlling Deployment Reliability under Temporal Distribution Shift
标题:时间分布变化下的部署可靠性建模与控制
链接:https://arxiv.org/abs/2604.02351
作者:Naimur Rahman,Naazreen Tabassum
备注:19 pages, 5 figures, 7 tables. Empirical study on temporally indexed credit-risk dataset (1.35M samples, 2007-2018)
摘要:部署在非静态环境中的机器学习模型会受到时间分布变化的影响,这会随着时间的推移削弱预测的可靠性。虽然常见的缓解策略(如定期重新训练和重新校准)旨在保持性能,但它们通常专注于在孤立的时间点评估的平均指标,并且不会明确建模部署期间的可靠性如何演变。 我们提出了一个以部署为中心的框架,把可靠性作为一个动态的歧视和校准组成的状态。这种状态在连续评估窗口的轨迹诱导一个可测量的概念的波动性,允许部署适应制定为一个多目标控制问题,平衡可靠性稳定性对累积干预成本。 在这个框架内,我们定义了一个家庭的国家依赖的干预政策和经验的特点所产生的成本波动帕累托边界。在一个大规模的、时间索引的信用风险数据集(135万笔贷款,2007-2018年)上的实验表明,选择性的、漂移触发的干预可以实现比连续滚动再培训更平滑的可靠性轨迹,同时大幅降低运营成本。 这些研究结果的位置部署可靠性下的时间转移作为一个可控的多目标系统,并突出了政策设计的作用,在高风险的表格应用程序中形成稳定性成本权衡。
摘要:Machine learning models deployed in non-stationary environments are exposed to temporal distribution shift, which can erode predictive reliability over time. While common mitigation strategies such as periodic retraining and recalibration aim to preserve performance, they typically focus on average metrics evaluated at isolated time points and do not explicitly model how reliability evolves during deployment. We propose a deployment-centric framework that treats reliability as a dynamic state composed of discrimination and calibration. The trajectory of this state across sequential evaluation windows induces a measurable notion of volatility, allowing deployment adaptation to be formulated as a multi-objective control problem that balances reliability stability against cumulative intervention cost. Within this framework, we define a family of state-dependent intervention policies and empirically characterize the resulting cost-volatility Pareto frontier. Experiments on a large-scale, temporally indexed credit-risk dataset (1.35M loans, 2007-2018) show that selective, drift-triggered interventions can achieve smoother reliability trajectories than continuous rolling retraining while substantially reducing operational cost. These findings position deployment reliability under temporal shift as a controllable multi-objective system and highlight the role of policy design in shaping stability-cost trade-offs in high-stakes tabular applications.
【17】SIEVE: Sample-Efficient Parametric Learning from Natural Language
标题:SIEVE:自然语言中的样本高效参数学习
链接
:https://arxiv.org/abs/2604.02339
作者:Parth Asawa,Alexandros G. Dimakis,Matei Zaharia
摘要:自然语言上下文(如指令、知识或反馈)包含丰富的信号,可用于调整语言模型。虽然上下文学习通过提示提供自适应,但参数学习可以持续到模型权重中,并可以进一步提高性能,尽管数据量很大,并且严重依赖于高质量的跟踪或自动验证器。我们提出了SIEVE,一种从自然语言环境中进行样本高效参数学习的方法,只需三个查询示例。SIEVE使用了一种新的合成数据生成管道SIEVE-GEN,它利用了上下文可分解的洞察力。分解上下文使我们能够通过将合成查询仅与适用的上下文而不是整体配对来生成更高质量的推出,然后使用上下文蒸馏将上下文内化到模型中。我们在需要上下文的推理设置中进行评估,包括自定义域和RuleArena以及One Book任务的机器翻译。我们的研究结果表明,SIEVE优于先前的上下文蒸馏方法,仅使用三个查询示例,演示了如何从自然语言中实现样本有效的参数学习。
摘要:Natural language context-such as instructions, knowledge, or feedback-contains rich signal for adapting language models. While in-context learning provides adaptation via the prompt, parametric learning persists into model weights and can improve performance further, though is data hungry and heavily relies on either high-quality traces or automated verifiers. We propose SIEVE, a method for sample-efficient parametric learning from natural language context that requires as few as three query examples. SIEVE uses a novel synthetic data generation pipeline, SIEVE-GEN, that leverages the insight that context is decomposable. Decomposing context allows us to generate higher quality rollouts by pairing synthetic queries with only the applicable context rather than the entirety, then using context distillation to internalize context into the model. We evaluate in reasoning settings where context is necessary, including custom domains and the RuleArena and Machine Translation from One Book tasks. Our results show that SIEVE outperforms prior context distillation methods using just three query examples, demonstrating how to achieve sample-efficient parametric learning from natural language.
【18】LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning
标题:LiME:用于高效多模式多任务学习的轻量级专家混合
链接:https://arxiv.org/abs/2604.02338
作者:Md Kowsher,Haris Mansoor,Nusrat Jahan Prottasha,Ozlem Garibay,Victor Zhu,Zhengping Ji,Chen Chen
摘要:MoE-PEFT方法将专家混合与用于多任务适应的参数高效微调相结合,但需要每个专家单独的适配器,从而导致可训练参数随着专家计数线性缩放,并限制了对基于适配器的架构的适用性。我们提出了LiME(轻量级混合专家),它通过轻量级调制而不是适配器复制来实现专家专业化。LiME没有使用单独的适配器,而是使用单个共享PEFT模块,并使用轻量级专家向量调整其输出,减少专家参数,同时推广到任何PEFT方法。值得注意的是,LiME通过利用现有的冻结和适应表示来引入零参数路由,从而消除了每层通常需要的学习路由器参数。理论上,我们证明了(i)更多的专家保留更多的任务相关的信息和(ii)调制近似完全专家特定的PEFT有界误差。LiME还结合了n-gram窗口路由和基于路由置信度的自适应专家选择(Auto Top-K)。MMT-47是一个多模态多任务基准测试,包含47个跨越文本、图像和视频的任务,实验表明,与相应的MoE-PEFT基线相比,LiME在使用多达4倍的可训练参数和高达29%的训练速度的同时,实现了具有竞争力或优越的性能。
摘要:MoE-PEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4x fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.
【19】Learning interacting particle systems from unlabeled data
标题:从未标记数据学习交互粒子系统
链接:https://arxiv.org/abs/2604.02581
作者:Viska Wei,Fei Lu
备注:39 pages, 7 figures
摘要:学习相互作用粒子系统的潜力是跨各种科学学科的基本任务。一个主要的挑战是,由于数据收集方法或隐私约束的限制,在离散时间点收集的未标记数据缺乏轨迹信息。我们通过引入一个无故障的自测损失函数来解决这个挑战,该函数利用了经验分布的弱形式随机演化方程。损失函数是二次势函数,支持参数和非参数回归算法,用于扩展到具有大数据的大型高维系统的稳健估计。系统的数值试验表明,我们的方法优于基线方法,通过标签匹配恢复轨迹回归,容忍大的观测时间步长。我们建立了收敛的参数估计的样本容量的增加,所提出的方法提供了一个理论基础。
摘要:Learning the potentials of interacting particle systems is a fundamental task across various scientific disciplines. A major challenge is that unlabeled data collected at discrete time points lack trajectory information due to limitations in data collection methods or privacy constraints. We address this challenge by introducing a trajectory-free self-test loss function that leverages the weak-form stochastic evolution equation of the empirical distribution. The loss function is quadratic in potentials, supporting parametric and nonparametric regression algorithms for robust estimation that scale to large, high-dimensional systems with big data. Systematic numerical tests show that our method outperforms baseline methods that regress on trajectories recovered via label matching, tolerating large observation time steps. We establish the convergence of parametric estimators as the sample size increases, providing a theoretical foundation for the proposed approach.
其他(29篇)
【1】Gradient Boosting within a Single Attention Layer
标题:单一注意力层内的梯度提升
链接:https://arxiv.org/abs/2604.03190
作者:Saleh Sargolzaei
摘要:Transformer attention计算值的单个软最大加权平均值--无法纠正自身错误的一次性估计。我们引入了梯度提升注意力,它在单个注意力层内应用梯度提升的原理:第二个注意力通道,带有自己学习的投影,注意第一个的预测误差并应用门控校正。在平方重建目标下,构造映射到弗里德曼的梯度提升机上,每个注意力通道作为基础学习器,每个维度的门作为收缩参数。我们表明,一个单一的Hopfield式更新擦除所有查询信息正交的存储模式子空间,并进一步迭代下的局部收缩可以崩溃不同的查询在同一区域相同的固定点。我们还表明,单独的预测的校正通过可以恢复剩余信息无法共享的投影方法的Tukey的twicing。在WikiText-103的10 M令牌子集上,梯度提升的注意力实现了67.9 $的测试困惑,而标准注意力为72.2 $,加倍注意力为69.6 $,参数匹配的更宽基线为69.0 $,两轮获得了大部分好处。
摘要
:Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of $67.9$ compared to $72.2$ for standard attention, $69.6$ for Twicing Attention, and $69.0$ for a parameter-matched wider baseline, with two rounds capturing most of the benefit.
【2】HyperFitS -- Hypernetwork Fitting Spectra for metabolic quantification of ${}^1$H MR spectroscopic imaging
标题:HyperFitS --用于${}' 1$H MR光谱成像的代谢量化的HyperNetwork Fitting Spectrum
链接:https://arxiv.org/abs/2604.03150
作者:Paul J. Weiser,Gulnur Ungan,Amirmohammad Shamaei,Georg Langs,Wolfgang Bogner,Malte Hoffmann,Antoine Klauser,Ovidiu C. Andronesi
摘要:目的:质子磁共振波谱成像($^1$H MRSI)能够在体内绘制全脑代谢物浓度。然而,其临床适用性的一个长期存在的问题是代谢定量,这可能需要大量的时间进行光谱拟合。最近,深度学习方法已经能够在几秒钟内提供全脑代谢量化。然而,神经网络实现通常缺乏可配置性,并且需要重新训练以改变预定义的参数设置。研究方法:我们引入了HyperFitS,这是一个用于全脑$^1$H MRSI中代谢物量化的光谱拟合超网络,可灵活适应广泛的基线校正和水抑制因子。采用HyperFitS对在3T和7T下通过水抑制和水非抑制MRSI采集的各向同性分辨率为10 mm、3.4 mm和2 mm的人类受试者代谢物图谱进行定量,并与传统LCModel拟合进行比较。结果:代谢图显示新方法和金标准方法之间的基本一致性,HyperFitS的拟合时间明显更快。定量结果进一步强调了基线参数化对代谢定量的影响,这可能会改变高达30%的结果。结论:HyperFitS与最先进的传统方法高度一致,同时将处理时间从数小时缩短到几秒钟。与之前基于深度学习的光谱拟合方法相比,HyperFitS具有广泛的可配置性,并且可以适应使用多种协议和场强获取的数据质量,而无需重新训练。
摘要:Purpose: Proton magnetic resonance spectroscopic imaging ($^1$H MRSI) enables the mapping of whole-brain metabolites concentrations in-vivo. However, a long-standing problem for its clinical applicability is the metabolic quantification, which can require extensive time for spectral fitting. Recently, deep learning methods have been able to provide whole-brain metabolic quantification in only a few seconds. However, neural network implementations often lack configurability and require retraining to change predefined parameter settings. Methods: We introduce HyperFitS, a hypernetwork for spectral fitting for metabolite quantification in whole-brain $^1$H MRSI that flexibly adapts to a broad range of baseline corrections and water suppression factors. Metabolite maps of human subjects acquired at 3T and 7T with isotropic resolutions of 10 mm, 3.4 mm and 2 mm by water-suppressed and water-unsuppressed MRSI were quantified with HyperFitS and compared to conventional LCModel fitting. Results: Metabolic maps show a substantial agreement between the new and gold-standard methods, with significantly faster fitting times by HyperFitS. Quantitative results further highlight the impact of baseline parametrization on metabolic quantification, which can alter results by up to 30%. Conclusion: HyperFitS shows strong agreement with state-of-the-art conventional methods, while reducing processing times from hours to a few seconds. Compared to prior deep learning based spectral fitting methods, HyperFitS enables a wide range of configurability and can adapt to data quality acquired with multiple protocols and field strengths without retraining.
【3】Self-Distilled RLVR
标题:自蒸馏RLVR
链接:https://arxiv.org/abs/2604.03128
作者:Chenxu Yang,Chuanyu Qin,Qingyi Si,Minghui Chen,Naibin Gu,Dingyu Yao,Zheng Lin,Weiping Wang,Jiaqi Wang,Nan Duan
备注:Work in progress
摘要:政策蒸馏(OPD)已成为LLM社区中流行的培训模式。这种范式选择了一个更大的模型作为教师,为每个采样轨迹提供密集的细粒度信号,与具有可验证奖励的强化学习(RLVR)相反,RLVR只从环境中的可验证结果中获得稀疏信号。最近,社区已经探索了策略自升华(OPSD),其中相同的模型同时作为教师和学生,教师接收额外的特权信息,如参考答案,以实现自我进化。本文证明了仅来自特权教师的学习信号会导致严重的信息泄漏和不稳定的长期训练。因此,我们确定了自蒸馏的最佳小生境,并提出了\textbf{RLSD}(\textbf{RL}VR with \textbf{S}elf-\textbf{D}蒸馏)。具体来说,我们利用自蒸馏来获得令牌级别的策略差异,以确定细粒度的更新幅度,同时继续使用RLVR从环境反馈中获得可靠的更新方向(例如,响应正确性)。这使RLSD能够同时利用RLVR和OPSD的优势,实现更高的收敛上限和卓越的训练稳定性。
摘要:On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.
【4】Co-Evolution of Policy and Internal Reward for Language Agents
标题:语言代理的政策与内部奖励的共同演变
链接:https://arxiv.org/abs/2604.03098
作者:Xinyu Wang,Hanwei Wu,Jingwei Song,Shuyuan Zhang,Jiayi Zhang,Fanqi Kong,Tung Sum Thomas Kwok,Xiao-Wen Chang,Yuyu Luo,Chenglin Wu,Bang Liu
备注:20 pages, 13 figures
摘要:大型语言模型(LLM)代理通过与环境交互来学习,但长期训练仍然受到稀疏和延迟奖励的根本性阻碍。现有的方法通常通过事后信用分配或外部奖励模型来解决这一挑战,这些模型在推理时提供有限的指导,并且通常将奖励改进与政策改进分开。我们提出了自我指导,一个自我生成的内部奖励语言代理,支持推理时间的指导和训练时间的监督。具体来说,代理使用Self-Guide作为简短的自我指导信号来引导推理过程中的下一个动作,并将相同的信号转换为步进级内部奖励,以在训练过程中进行更密集的策略优化。这创造了一个共同进化的循环:更好的政策产生更好的指导,更好的指导进一步改善政策作为内部奖励。在三个代理基准中,推理时间自我指导已经产生了明显的收益,而与GRPO联合发展策略和内部奖励,与仅使用环境奖励训练的基线相比,带来了进一步的改进(8%)。总的来说,我们的研究结果表明,语言代理不仅可以通过收集更多的经验来提高,还可以通过学习在行动和学习过程中产生和完善自己的内部奖励来提高。
摘要:Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
【5】SkillRT: Compiling Skills for Efficient Execution Everywhere
标题:SkillRT:编写技能以实现无处不在的高效执行
链接:https://arxiv.org/abs/2604.03088
作者:Le Chen,Erhu Feng,Yubin Xia,Haibo Chen
摘要:LLM代理越来越多地采用技能作为可重用的组合单元。虽然技能在不同的代理平台上共享,但当前的系统将它们视为原始上下文,导致相同的技能在不同的代理中表现不一致。这种脆弱性破坏了技能的可移植性和执行效率。 为了应对这一挑战,我们分析了118,000种技能,并从传统的编译器设计中汲取灵感。我们将技能视为代码,将LLM视为异构处理器。为了使可移植性可行,我们将技能需求分解为一组基本功能,并测量每个模型-线束对支持它们的程度。基于这些能力配置文件,我们提出了SkillRT,一个编译和运行时系统设计的便携式和高效的技能执行。在编译时,SkillRT执行基于功能的编译、环境绑定和并发提取。在运行时,SkillRT应用JIT代码固化和自适应重新编译来优化性能。 我们评估SkillRT在八个不同规模的LLM和三个代理线束,涵盖SkillsBench和代表性的技能任务。结果表明,SkillRT显著提高了不同模型和环境中的任务完成率,同时将令牌消耗减少了40%。在性能方面,SkillRT通过增强的并行性实现了高达3.2倍的加速,并通过代码固化减少了19- 50倍的延迟。
摘要:LLM agents increasingly adopt skills as a reusable unit of composition. While skills are shared across diverse agent platforms, current systems treat them as raw context, causing the same skill to behave inconsistently for different agents. This fragility undermines skill portability and execution efficiency. To address this challenge, we analyze 118,000 skills and draw inspiration from traditional compiler design. We treat skills as code and LLMs as heterogeneous processors. To make portability actionable, we decompose a skill's requirements into a set of primitive capabilities, and measure how well each model-harness pair supports them. Based on these capability profiles, we propose SkillRT, a compilation and runtime system designed for portable and efficient skill execution. At compile time, SkillRT performs capability-based compilation, environment binding, and concurrency extraction. At runtime, SkillRT applies JIT code solidification and adaptive recompilation for performance optimization. We evaluate SkillRT across eight LLMs of varying scales and three agent harnesses, covering SkillsBench and representative skill tasks. Results demonstrate that SkillRT significantly improves task completion rates across different models and environments while reducing token consumption by up to 40%. In terms of performance, SkillRT achieves up to 3.2x speedup with enhanced parallelism, and 19-50x latency reduction through code solidification.
【6】Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
标题:通过优势符号鲁棒性减轻RLHF中的奖励黑客攻击
链接:https://arxiv.org/abs/2604.02986
作者:Shinnosuke Ono,Johannes Ackermann,Soichiro Nishimori,Takashi Ishida,Masashi Sugiyama
备注:27 pages, 7 figures
摘要:用于从人类反馈进行强化学习(RLHF)的奖励模型(RM)容易受到奖励黑客的攻击:当策略最大化学习的代理奖励时,真实质量会趋于稳定或下降。我们假设奖励黑客通常是由翻转的优势符号引起的:翻转的符号不会减少不良响应的可能性,而是会导致更新增加它。通过考虑RM参数空间中的对抗性扰动,我们可以推导出认证的符号保持半径,这是在策略优化期间可以翻转优势符号的最小扰动。基于这个公式,我们提出了签名认证策略优化(SignCert-PO),在策略梯度更新中降低非鲁棒完成的权重。与需要多个RM或对RM训练数据的访问的先前方法不同,SignCert-PO是轻量级的,并且仅使用RM参数和策略上的完成来纯粹地在策略优化阶段操作。在TL;DR总结和AlpacaFarm基准测试中,SignCert-PO始终实现比基线更好的胜率,并减少奖励黑客攻击。
摘要:Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.
【7】Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms
标题:利用神经路由算法实现近实时遥感路由
链接:https://arxiv.org/abs/2604.02927
作者:Andreas Boltres,Niklas Freymuth,Benjamin Schichtholz,Michael König,Gerhard Neumann
备注:Submitted to TMLR
摘要:路由算法对于有效的计算机网络操作至关重要,在许多设置中,它们必须能够在毫秒内对流量突发做出反应。实时遥测数据可以为路由算法提供信息信号,最近的工作已经训练神经网络利用这些信号进行流量感知路由。然而,聚合网络范围的信息受到通信延迟的影响,现有的神经方法要么假设不切实际的无延迟全局状态,要么将路由器限制为纯粹的本地遥测。这使得它们在现实环境中的可部署性不清楚。我们将遥测感知路由作为延迟感知闭环控制问题,并引入一个训练和评估神经路由算法的框架,同时明确建模通信和推理延迟。在这个框架之上,我们提出了LOGGIA,一个可扩展的图神经路由算法,预测属性拓扑和遥测图的日志空间链接权重。它利用数据驱动的预训练阶段,然后是基于策略的强化学习。在合成和真实网络拓扑以及看不见的混合TCP/UDP流量序列中,LOGGIA始终优于最短路径基线,而一旦实施实际延迟,神经基线就会失败。我们的实验进一步表明,像LOGGIA这样的神经路由算法在完全本地部署时表现最好,即,观察网络状态并在每个路由器上单独推断动作,而不是集中决策。
摘要:Routing algorithms are crucial for efficient computer network operations, and in many settings they must be able to react to traffic bursts within milliseconds. Live telemetry data can provide informative signals to routing algorithms, and recent work has trained neural networks to exploit such signals for traffic-aware routing. Yet, aggregating network-wide information is subject to communication delays, and existing neural approaches either assume unrealistic delay-free global states, or restrict routers to purely local telemetry. This leaves their deployability in real-world environments unclear. We cast telemetry-aware routing as a delay-aware closed-loop control problem and introduce a framework that trains and evaluates neural routing algorithms, while explicitly modeling communication and inference delays. On top of this framework, we propose LOGGIA, a scalable graph neural routing algorithm that predicts log-space link weights from attributed topology-and-telemetry graphs. It utilizes a data-driven pre-training stage, followed by on-policy Reinforcement Learning. Across synthetic and real network topologies, and unseen mixed TCP/UDP traffic sequences, LOGGIA consistently outperforms shortest-path baselines, whereas neural baselines fail once realistic delays are enforced. Our experiments further suggest that neural routing algorithms like LOGGIA perform best when deployed fully locally, i.e., observing network states and inferring actions at every router individually, as opposed to centralized decision making.
【8】Efficient Logistic Regression with Mixture of Sigmoids
标题:使用Sigmoids混合物的有效逻辑回归
链接:https://arxiv.org/abs/2604.02920
作者:Federico Di Gennaro,Saptarshi Chakraborty,Nikita Zhivotovskiy
摘要
:本文研究了具有各向同性高斯先验的指数加权(EW)在线Logistic回归算法。我们证明了Kakade和Ng(2005)针对范数最大为$B$的最佳线性预测建立的EW的近最优最坏情况后悔界$O(d\log(Bn))$可以以总最坏情况计算复杂度$O(B^3 n^5)$实现。这大大改善了之前实现相同保证的工作的O(B^{18}n^{37})$复杂性(Foster等人,2018年)。除了效率之外,我们还分析了线性可分性下的大$B$机制:在重新缩放$B$之后,EW后验收敛为$B\to\infty$到截断到版本锥的标准高斯。因此,预测器收敛到立体角表决分离方向,并且在该圆锥的每个固定边缘切片上,对应的截断高斯的模式与硬边缘SVM方向对齐。使用这种几何形状,我们推导出非渐近遗憾界显示,一旦$B$超过一个保证金依赖的阈值,遗憾成为独立的$B$和增长只有数学与逆保证金。总体而言,我们的研究结果表明,EW可以在计算上易于处理和几何自适应在线分类。
摘要:This paper studies the Exponential Weights (EW) algorithm with an isotropic Gaussian prior for online logistic regression. We show that the near-optimal worst-case regret bound $O(d\log(Bn))$ for EW, established by Kakade and Ng (2005) against the best linear predictor of norm at most $B$, can be achieved with total worst-case computational complexity $O(B^3 n^5)$. This substantially improves on the $O(B^{18}n^{37})$ complexity of prior work achieving the same guarantee (Foster et al., 2018). Beyond efficiency, we analyze the large-$B$ regime under linear separability: after rescaling by $B$, the EW posterior converges as $B\to\infty$ to a standard Gaussian truncated to the version cone. Accordingly, the predictor converges to a solid-angle vote over separating directions and, on every fixed-margin slice of this cone, the mode of the corresponding truncated Gaussian is aligned with the hard-margin SVM direction. Using this geometry, we derive non-asymptotic regret bounds showing that once $B$ exceeds a margin-dependent threshold, the regret becomes independent of $B$ and grows only logarithmically with the inverse margin. Overall, our results show that EW can be both computationally tractable and geometrically adaptive in online classification.
【9】Split and Conquer Partial Deepfake Speech
标题:分裂并征服部分Deepfake语音
链接:https://arxiv.org/abs/2604.02913
作者:Inbal Rimon,Oren Gal,Haim Permuter
摘要:部分deepfake语音检测需要识别可能发生在其他真实话语的短时间部分内的操纵区域,这使得该任务对于传统的话语级分类器来说特别具有挑战性。我们提出了一个分裂和征服的框架,分解成两个阶段的问题:边界检测和段级分类。专用边界检测器首先识别时间转变点,从而允许将音频信号划分成预期包含声学上一致的内容的段。然后对每个结果片段进行独立评估,以确定它是否对应于真实或虚假的语音。 该公式通过明确地将时间本地化与真实性评估分离来简化学习目标,允许每个组件专注于定义明确的任务。为了进一步提高鲁棒性,我们引入了一种基于反射的多长度训练策略,该策略将可变持续时间段转换为几个固定的输入长度,从而产生不同的特征空间表示。每个阶段都使用具有不同特征提取器和增强策略的多个配置进行训练,并且它们的互补预测被融合以获得改进的最终模型。 在PartialSpoof基准测试上的实验表明,在多个时间分辨率以及话语级别上具有最先进的性能,在准确检测和定位欺骗区域方面有了实质性的改进。此外,该方法在Half-Truth数据集上实现了最先进的性能,进一步证实了框架的鲁棒性和泛化能力。
摘要:Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.
【10】FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving
标题:FluxMoE:脱钩专家驻场,实现高性能MoE服务
链接:https://arxiv.org/abs/2604.02715
作者:Qingxiu Liu,Cyril Y. He,Hanser Jiang,Zion Wang,Alan Zhao,Patrick P. C. Lee
摘要:混合专家(MoE)模型已成为扩展大型语言模型的主要范例,但其快速增长的参数大小在推理过程中引入了根本的低效率:大多数专家权重在GPU内存中保持空闲,同时与性能关键的运行时状态(如键值(KV)缓存)竞争。由于KV缓存容量直接决定服务吞吐量,因此这种不匹配会导致内存利用不足和性能下降。在本文中,我们提出了FluxMoE,一个新的MoE推理系统,从持久的GPU驻留专家参数。FluxMoE引入了一种专家分页抽象,将专家权重视为流传输的瞬态资源,按需将其具体化并在使用后立即将其驱逐,从而允许将GPU内存优先分配给吞吐量关键的运行时状态。我们实现FluxMoE vLLM顶部,使有效的MoE推理严重的内存限制。实验结果表明,FluxMoE实现了高达3.0$\times$吞吐量增益超过vLLM在内存密集型制度,而不影响模型的保真度。
摘要:Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0$\times$ throughput gains over vLLM in memory-intensive regimes, without compromising model fidelity.
【11】Finding Belief Geometries with Sparse Autoencoders
标题:使用稀疏自动编码器寻找信念几何
链接:https://arxiv.org/abs/2604.02685
作者:Matthew Levinson
摘要:理解内部表征的几何结构是机械可解释性的中心目标。先前的工作已经表明,在隐马尔可夫模型生成的序列上训练的Transformers将概率置信状态编码为其剩余流中的单纯形几何形状,顶点对应于潜在的生成状态。在自然主义文本上训练的大型语言模型是否会发展出类似的几何表示仍然是一个悬而未决的问题。 我们引入了一个管道发现候选单纯形结构的子空间在Transformer表示,结合稀疏自编码器(SAE),k$-子空间聚类SAE功能,并使用AANet的单纯形拟合。我们验证管道上的Transformer训练的多部分隐马尔可夫模型与已知的信念状态几何。应用到Gemma-2- 9 B,我们确定了13个优先级集群表现出候选单形几何($K \geq 3$)。 一个关键的挑战是区分真正的信念状态编码和平铺伪影:潜在的可以跨越一个单纯形的子空间,而没有混合坐标携带预测信号超出任何单个特征。因此,我们采用重心预测作为我们的主要判别测试。在13个优先级聚类中,3个在近顶点样本上表现出非常显著的优势(Wilcoxon $p < 10^{-14}$),4个在单纯形内部样本上表现出非常显著的优势。5个不同的真实集群一起通过至少一次分裂,而没有空集群通过。一个聚类768_596还在数据集中获得了最高的因果转向得分。这是被动预测和主动干预相结合的唯一情况。我们提出这些研究结果作为初步证据,真正的信仰几何存在于Gemma-2- 9 B的表示空间,并确定结构化的评价,将需要确认这种解释。
摘要
:Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), $k$-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry ($K \geq 3$). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space, and identify the structured evaluation that would be required to confirm this interpretation.
【12】Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training
标题:无通信采样和4D混合并行主义,用于可扩展的小批量GNN训练
链接:https://arxiv.org/abs/2604.02651
作者:Cunyang Wei,Siddharth Singh,Aishwarya Sarkar,Daniel Nichols,Tisha Patel,Aditya K. Ranjan,Sayan Ghosh,Ali Jannesari,Nathan R. Tallent,Abhinav Bhatele
摘要:图神经网络(GNN)被广泛用于学习来自各种真实世界场景的图数据集。从非常大的图中学习需要分布式训练,而带有采样的mini-texture是并行GNN训练的流行方法。现有的分布式小批量方法具有显着的性能瓶颈,由于昂贵的采样方法和有限的扩展时,使用数据并行。在这项工作中,我们提出了ScaleGNN,这是一个用于可扩展小批量GNN训练的4D并行框架,它结合了无通信分布式采样,3D并行矩阵乘法(PMM)和数据并行性。ScaleGNN引入了统一的顶点采样算法,使每个进程(GPU设备)能够构建其本地mini-batch,即,子图分区没有任何进程间通信。3D PMM能够将小批量训练扩展到比普通数据并行更大的GPU数量,同时显著降低通信开销。我们还提出了额外的优化,以重叠采样与训练,通过以较低精度发送数据,内核融合和通信计算重叠来减少通信开销。我们在五个图形数据集上评估了ScaleGNN,并在Perlmutter上展示了高达2048个GPU的强大扩展能力,在Frontier上展示了2048个GCD,在Tuolumne上展示了1024个GPU。在Perlmutter上,ScaleGNN在ogbn产品上的SOTA基线上实现了3.5倍的端到端训练加速。
摘要:Graph neural networks (GNNs) are widely used for learning on graph datasets derived from various real-world scenarios. Learning from extremely large graphs requires distributed training, and mini-batching with sampling is a popular approach for parallelizing GNN training. Existing distributed mini-batch approaches have significant performance bottlenecks due to expensive sampling methods and limited scaling when using data parallelism. In this work, we present ScaleGNN, a 4D parallel framework for scalable mini-batch GNN training that combines communication-free distributed sampling, 3D parallel matrix multiplication (PMM), and data parallelism. ScaleGNN introduces a uniform vertex sampling algorithm, enabling each process (GPU device) to construct its local mini-batch, i.e., subgraph partitions without any inter-process communication. 3D PMM enables scaling mini-batch training to much larger GPU counts than vanilla data parallelism with significantly lower communication overheads. We also present additional optimizations to overlap sampling with training, reduce communication overhead by sending data in lower precision, kernel fusion, and communication-computation overlap. We evaluate ScaleGNN on five graph datasets and demonstrate strong scaling up to 2048 GPUs on Perlmutter, 2048 GCDs on Frontier, and 1024 GPUs on Tuolumne. On Perlmutter, ScaleGNN achieves 3.5x end-to-end training speedup over the SOTA baseline on ogbn-products.
【13】Conditional Sampling via Wasserstein Autoencoders and Triangular Transport
标题:通过Wasserstein自动编码器和三角传输进行条件采样
链接:https://arxiv.org/abs/2604.02644
作者:Mohammad Al-Jarrah,Michele Martino,Marcus Yim,Bamdad Hosseini,Amirhossein Taghvaei
备注:8 pages, 5 figures
摘要:我们提出了条件Wasserstein自动编码器(CWAE),一个框架的条件模拟,利用低维结构的条件和条件变量。其关键思想是修改Wasserstein自编码器,使用(块)三角形解码器,并施加适当的独立性假设的潜在变量。我们表明,由此产生的模型给出了一个自动编码器,可以利用低维结构,同时解码器可以用于条件模拟。我们探讨CWAE的各种理论特性,包括它们与条件最优运输(OT)问题的联系。我们还提出了替代配方,导致我们的算法的基础上形成三个架构的变体。我们提出了一系列的数值实验表明,我们不同的CWAE变体实现大幅减少近似误差相对于低秩集合卡尔曼滤波器(LREnKF),特别是在支持的条件措施是真正低维的问题。
摘要:We present Conditional Wasserstein Autoencoders (CWAEs), a framework for conditional simulation that exploits low-dimensional structure in both the conditioned and the conditioning variables. The key idea is to modify a Wasserstein autoencoder to use a (block-) triangular decoder and impose an appropriate independence assumption on the latent variables. We show that the resulting model gives an autoencoder that can exploit low-dimensional structure while simultaneously the decoder can be used for conditional simulation. We explore various theoretical properties of CWAEs, including their connections to conditional optimal transport (OT) problems. We also present alternative formulations that lead to three architectural variants forming the foundation of our algorithms. We present a series of numerical experiments that demonstrate that our different CWAE variants achieve substantial reductions in approximation error relative to the low-rank ensemble Kalman filter (LREnKF), particularly in problems where the support of the conditional measures is truly low-dimensional.
【14】AXELRAM: Quantize Once, Never Dequantize
标题:Axelram:量化一次,绝不解除量化
链接:https://arxiv.org/abs/2604.02638
作者:Yasushi Nishida
备注:6 pages, 3 figures, 3 tables. Code: https://github.com/Axelidea/AXELRAM
摘要:我们提出AXELRAM,一个智能SRAM的宏架构,计算注意力分数直接从量化KV缓存索引,而无需去量化。关键使能器是设计时固定码本:基于正交变换的量化将每个坐标的分布集中到N(0,1/d),因此最佳量化器仅取决于维度d和位宽b,而不取决于输入数据。非对称路径设计--写时转换,读时查表,没有逆转换--将每个查询的乘法减少了102.4倍(数学恒等式)。 通过多种子评估(10个种子x 3个模型),我们发现符号模式敏感性在某些模型(Qwen2.5-3B)上导致灾难性的PPL尖峰(Delta > 50),而其他模型(LLaMA-3.1-8B)则完全稳定。这种现象将SpinQuant对权重量化中的旋转方差的观察扩展到KV缓存域,在KV缓存域中,效果在质量上更严重。我们追踪的根本原因逐层规范的异质性,并提出了一个无梯度的符号模式选择(200个候选人,8个校准样本,一次性),消除灾难性的尖峰与零额外的硬件成本。所有源代码都可以在https://github.com/Axelidea/AXELRAM上找到。
摘要:We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.
【15】Complex-Valued GNNs for Distributed Basis-Invariant Control of Planar Systems
标题:平面系统分布式基本不变控制的复值GNN
链接:https://arxiv.org/abs/2604.02615
作者:Samuel Honor,Mohamed Abdelnaby,Kevin Leahy
备注:8 pages, 6 figures, submitted to CDC 2026 main track
摘要:图神经网络(GNNs)是一个备受推崇的工具,学习控制的网络动态系统,由于它们的能力,以分布式方式部署。然而,当前的分布式GNN架构假设网络中的所有节点都在兼容的基础上收集几何观测,这限制了此类控制器在GPS拒绝和指南针拒绝环境中的有用性。本文提出了一种对局部基的选择具有全局不变性的GNN参数化方法。二维几何特征和基之间的变换在复数域中表达。在每个GNN层内部,使用具有相位等变激活函数的复值线性层。当从一个固定的全局框架来看,所有的策略学习的这种架构是严格不变的选择本地框架。这种架构被证明是提高数据效率,跟踪性能,和泛化的学习控制相比,一个真正的价值基准上的模仿学习群集任务。
摘要:Graph neural networks (GNNs) are a well-regarded tool for learned control of networked dynamical systems due to their ability to be deployed in a distributed manner. However, current distributed GNN architectures assume that all nodes in the network collect geometric observations in compatible bases, which limits the usefulness of such controllers in GPS-denied and compass-denied environments. This paper presents a GNN parametrization that is globally invariant to choice of local basis. 2D geometric features and transformations between bases are expressed in the complex domain. Inside each GNN layer, complex-valued linear layers with phase-equivariant activation functions are used. When viewed from a fixed global frame, all policies learned by this architecture are strictly invariant to choice of local frames. This architecture is shown to increase the data efficiency, tracking performance, and generalization of learned control when compared to a real-valued baseline on an imitation learning flocking task.
【16】Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
标题:可操纵但不可解码:函数载体的运作超出了Logit镜头
链接:https://arxiv.org/abs/2604.02608
作者:Mohammed Suhail B Nadaf
备注:30 pages, 7 figures
摘要:函数向量(FV)--从上下文学习演示中提取的平均差方向--在添加到剩余流中时可以控制大型语言模型的行为。我们假设FV转向失败反映了缺乏任务相关信息:logit镜片将与转向一起失败。我们错了在迄今为止最全面的跨模板FV转移研究中--在12个任务中有4,032对,来自3个家族的6个模型(Llama-3.1-8B,Gemma-2- 9 B,Mistral-7 B-v0.3;基础和调整),每个任务8个模板--我们发现了相反的解离:FV转向成功,即使logit镜头无法在任何层解码正确答案。这种无解码性的可操控性模式是通用的:每种型号的每项任务的操控精度都超过了logit镜头精度,差距高达-0.91。72个任务模型实例中只有3个显示了预测的可解码无转向模式,所有这些都是在Mistral中。FV词汇投影表明,达到0.90以上的转向精度的FV仍然投影到不连贯的令牌分布,表明FV编码计算指令,而不是答案方向。FV在早期层(L2-L 8)进行最佳干预; logit镜头仅在后期层(L28-L32)检测正确答案。先前报道的负余弦转移相关性(r=-0.572)在规模上消失:合并的r范围从-0.199到+0.126,余弦在R平方中增加小于0.011。转向后分析揭示了一个模型家族的分歧:米斯特拉尔FVs重写中间表示;美洲驼/杰玛FVs产生近零的变化,尽管成功的转向。激活修补确认了因果定位:简单任务在目标层实现完美恢复;硬任务在任何地方都显示零恢复。
摘要:Function vectors (FVs) -- mean-difference directions extracted from in-context learning demonstrations -- can steer large language model behavior when added to the residual stream. We hypothesized that FV steering failures reflect an absence of task-relevant information: the logit lens would fail alongside steering. We were wrong. In the most comprehensive cross-template FV transfer study to date - 4,032 pairs across 12 tasks, 6 models from 3 families (Llama-3.1-8B, Gemma-2-9B, Mistral-7B-v0.3; base and instruction-tuned), 8 templates per task - we find the opposite dissociation: FV steering succeeds even when the logit lens cannot decode the correct answer at any layer. This steerability-without-decodability pattern is universal: steering exceeds logit lens accuracy for every task on every model, with gaps as large as -0.91. Only 3 of 72 task-model instances show the predicted decodable-without-steerable pattern, all in Mistral. FV vocabulary projection reveals that FVs achieving over 0.90 steering accuracy still project to incoherent token distributions, indicating FVs encode computational instructions rather than answer directions. FVs intervene optimally at early layers (L2-L8); the logit lens detects correct answers only at late layers (L28-L32). The previously reported negative cosine-transfer correlation (r=-0.572) dissolves at scale: pooled r ranges from -0.199 to +0.126, and cosine adds less than 0.011 in R-squared beyond task identity. Post-steering analysis reveals a model-family divergence: Mistral FVs rewrite intermediate representations; Llama/Gemma FVs produce near-zero changes despite successful steering. Activation patching confirms causal localization: easy tasks achieve perfect recovery at targeted layers; hard tasks show zero recovery everywhere.
【17】A Spectral Framework for Multi-Scale Nonlinear Dimensionality Reduction
标题:多尺度非线性模糊性约简的谱框架
链接:https://arxiv.org/abs/2604.02535
作者:Zeyang Huang,Angelos Chatzimparmpas,Thomas Höllt,Takanori Fujiwara
摘要:减少非线性(DR)的特点是两个长期的权衡。首先,存在全局-局部保持张力:t-SNE和UMAP等方法优先考虑局部邻域保持,但可能会扭曲全局流形结构,而拉普拉斯特征映射等方法保持全局几何,但通常会产生有限的局部分离。其次,在表现力和分析透明度之间存在差距:许多非线性DR方法产生的嵌入没有与底层高维结构的显式连接,限制了对嵌入过程的深入了解。在本文中,我们介绍了一个光谱框架的非线性DR,解决这些挑战。我们的方法使用谱基结合交叉熵优化来嵌入高维数据,从而实现桥接全局和局部结构的多尺度表示。利用线性谱分解,该框架进一步支持通过图频率角度分析嵌入,从而能够检查谱模式如何影响最终的嵌入。我们补充这个分析与基于字形的散点图增强视觉探索。定量评估和案例研究表明,我们的框架提高了流形的连续性,同时使更深入的分析嵌入结构,通过频谱模式的贡献。
摘要:Dimensionality reduction (DR) is characterized by two longstanding trade-offs. First, there is a global-local preservation tension: methods such as t-SNE and UMAP prioritize local neighborhood preservation, yet may distort global manifold structure, while methods such as Laplacian Eigenmaps preserve global geometry but often yield limited local separation. Second, there is a gap between expressiveness and analytical transparency: many nonlinear DR methods produce embeddings without an explicit connection to the underlying high-dimensional structure, limiting insight into the embedding process. In this paper, we introduce a spectral framework for nonlinear DR that addresses these challenges. Our approach embeds high-dimensional data using a spectral basis combined with cross-entropy optimization, enabling multi-scale representations that bridge global and local structure. Leveraging linear spectral decomposition, the framework further supports analysis of embeddings through a graph-frequency perspective, enabling examination of how spectral modes influence the resulting embedding. We complement this analysis with glyph-based scatterplot augmentations for visual exploration. Quantitative evaluations and case studies demonstrate that our framework improves manifold continuity while enabling deeper analysis of embedding structure through spectral mode contributions.
【18】Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?
标题:特征归因稳定性套件:事后归因的稳定性如何?
链接:https://arxiv.org/abs/2604.02532
作者:Kamalasankari Subramaniakuppusamy,Jugal Gajjar
备注
:Accepted in the proceedings track of XAI4CV Workshop at CVPR 2026. It has 2 images, 5 tables, 6 equations, and 35 references in the main paper and 12 figures, 15 tables, and 3 references in the supplementary material
摘要:事后特征属性方法被广泛部署在安全关键的视觉系统中,但它们在现实输入扰动下的稳定性仍然很差。现有的指标主要在加性噪声下评估解释,将稳定性崩溃为单个标量,并且未能以预测保存为条件,将解释脆弱性与模型敏感性混为一谈。我们介绍的功能属性稳定性套件(FASS),一个基准,强制执行预测不变性过滤,分解成三个互补的指标稳定性:结构相似性,秩相关性,和top-k Jaccard的,并评估几何,光度和压缩扰动。在四个架构和三个平台ImageNet-1 K,MS COCO和CIFAR-10上评估四种属性方法(Integrated Concentrents,ConcentrentSHAP,Grad-CAM,LIME),FASS表明稳定性估计关键取决于扰动家族和预测不变性过滤。几何扰动暴露更大的属性不稳定性比光度变化,没有条件的预测保存,高达99%的评估对涉及改变预测。在这种受控评估下,我们观察到一致的方法级趋势,Grad-CAM在数据集上实现了最高的稳定性。
摘要:Post-hoc feature attribution methods are widely deployed in safety-critical vision systems, yet their stability under realistic input perturbations remains poorly characterized. Existing metrics evaluate explanations primarily under additive noise, collapse stability to a single scalar, and fail to condition on prediction preservation, conflating explanation fragility with model sensitivity. We introduce the Feature Attribution Stability Suite (FASS), a benchmark that enforces prediction-invariance filtering, decomposes stability into three complementary metrics: structural similarity, rank correlation, and top-k Jaccard overlap-and evaluates across geometric, photometric, and compression perturbations. Evaluating four attribution methods (Integrated Gradients, GradientSHAP, Grad-CAM, LIME) across four architectures and three datasets-ImageNet-1K, MS COCO, and CIFAR-10, FASS shows that stability estimates depend critically on perturbation family and prediction-invariance filtering. Geometric perturbations expose substantially greater attribution instability than photometric changes, and without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions. Under this controlled evaluation, we observe consistent method-level trends, with Grad-CAM achieving the highest stability across datasets.
【19】AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
标题:AdaHOP:通过异常模式感知旋转进行快速准确的低精度训练
链接:https://arxiv.org/abs/2604.02525
作者:Seonggon Kim,Alireza Khodamoradi,Kristof Denolf,Eunhyeok Park
备注:21 pages, 7 figures
摘要:低精度训练(LPT)通常采用Hadamard变换来抑制离群值并减轻大型语言模型(LLM)中的量化误差。然而,现有方法均匀地应用固定变换,尽管张量上的离群值结构有很大变化。通过对LLM的权重,激活和梯度的离群模式的首次系统研究,我们表明这种策略从根本上存在缺陷:基于Hadamard的抑制的有效性取决于变换的平滑方向如何与每个操作数的离群结构对齐-这是一个在层和计算路径上变化很大的属性。我们将这些模式分为三种类型:Row-wise,Column-wise和None。每一对都需要定制的变换方向或离群值处理策略来最小化量化误差。基于这一见解,我们提出了AdaHOP(具有离群模式感知策略的自适应Hadamard变换),它为每个矩阵乘法分配了最佳策略:内部Hadamard变换(IHT),其中内部维度平滑是有效的,或者IHT与选择性离群值提取(OE)相结合-将主要离群值路由到高精度路径-它不是。结合硬件感知的Triton内核,AdaHOP在MXFP 4精度下实现了BF 16训练质量,同时提供高达3.6倍的内存压缩和1.8倍的内核加速,超过BF 16全精度训练。
摘要:Low-precision training (LPT) commonly employs Hadamard transforms to suppress outliers and mitigate quantization error in large language models (LLMs). However, prior methods apply a fixed transform uniformly, despite substantial variation in outlier structures across tensors. Through the first systematic study of outlier patterns across weights, activations, and gradients of LLMs, we show that this strategy is fundamentally flawed: the effectiveness of Hadamard-based suppression depends on how the transform's smoothing direction aligns with the outlier structure of each operand -- a property that varies substantially across layers and computation paths. We characterize these patterns into three types: Row-wise, Column-wise, and None. Each pair requires a tailored transform direction or outlier handling strategy to minimize quantization error. Based on this insight, we propose AdaHOP (Adaptive Hadamard transform with Outlier-Pattern-aware strategy), which assigns each matrix multiplication its optimal strategy: Inner Hadamard Transform (IHT) where inner-dimension smoothing is effective, or IHT combined with selective Outlier Extraction (OE) -- routing dominant outliers to a high-precision path -- where it is not. Combined with hardware-aware Triton kernels, AdaHOP achieves BF16 training quality at MXFP4 precision while delivering up to 3.6X memory compression and 1.8X kernel acceleration} over BF16 full-precision training.
【20】Causal-Audit: A Framework for Risk Assessment of Assumption Violations in Time-Series Causal Discovery
标题:因果审计:时间序列因果发现中假设违反的风险评估框架
链接:https://arxiv.org/abs/2604.02488
作者:Marco Ruiz,Miguel Arana-Catania,David R. Ardila,Rodrigo Ventura
备注:28 pages, 10 figures, 15 tables. Being submitted to Journal of Causal Inference JCI
摘要:时间序列因果发现方法依赖于诸如平稳性、规则采样和有界时间依赖性等假设。当这些假设被违反时,结构学习可以在没有警告的情况下产生自信但误导的因果图。我们介绍了Cajet-Audit,一个将假设验证形式化为校准风险评估的框架。该框架跨五个假设族(平稳性、不规则性、持久性、非线性和混杂代理)计算效应大小诊断,将它们聚合成具有不确定性区间的四个校准风险分数,并应用推荐方法的预防意识决策策略(例如,PCMCI+,基于VAR的Granger因果关系)仅在证据支持可靠推断时。半自动诊断阶段也可以独立用于个体研究中的结构化假设审核。对跨越10个违规家族的500个数据生成过程(DGP)的合成图谱进行评估,显示出校准良好的风险评分(AUROC > 0.95),推荐数据集的假阳性减少了62%,严重违规案例的预防率为78%。在来自TimeGraph(18个类别)和CauseTime(3个域)的21个外部评估中,在所有情况下,选择或放弃的决定都与基准规范一致。我们的框架的开源实现是可用的。
摘要:Time-series causal discovery methods rely on assumptions such as stationarity, regular sampling, and bounded temporal dependence. When these assumptions are violated, structure learning can produce confident but misleading causal graphs without warning. We introduce Causal-Audit, a framework that formalizes assumption validation as calibrated risk assessment. The framework computes effect-size diagnostics across five assumption families (stationarity, irregularity, persistence, nonlinearity, and confounding proxies), aggregates them into four calibrated risk scores with uncertainty intervals, and applies an abstention-aware decision policy that recommends methods (e.g., PCMCI+, VAR-based Granger causality) only when evidence supports reliable inference. The semi-automatic diagnostic stage can also be used independently for structured assumption auditing in individual studies. Evaluation on a synthetic atlas of 500 data-generating processes (DGPs) spanning 10 violation families demonstrates well-calibrated risk scores (AUROC > 0.95), a 62% false positive reduction among recommended datasets, and 78% abstention on severe-violation cases. On 21 external evaluations from TimeGraph (18 categories) and CausalTime (3 domains), recommend-or-abstain decisions are consistent with benchmark specifications in all cases. An open-source implementation of our framework is available.
【21】Dynamical structure of vanishing gradient and overfitting in multi-layer perceptrons
标题:多层感知器中消失梯度和过拟适的动态结构
链接:https://arxiv.org/abs/2604.02393
作者:Alex Alì Maleknia,Yuzuru Sato
摘要:消失梯度和过拟合是机器学习中研究最广泛的两个问题。然而,他们经常被认为是在一些渐近设置,这掩盖了潜在的动力学机制,负责他们的出现。在本文中,我们的目标是提供一个清晰的动态描述学习的多层感知器。为此,我们引入了一个最小模型,灵感来自于Jumumizu和Amari的研究,以研究通过梯度下降训练的MLP中的消失梯度和过拟合。在这个模型中,我们证明了学习动力学在训练过程中可能会经过高原区域和近最优区域,这两个区域都由鞍形结构组成,然后最终收敛到过拟合区域。在训练数据集上的适当条件下,我们证明了,以高概率,过拟合区域崩溃到一个单一的吸引子模对称,这对应于过拟合。此外,我们证明了在有限噪声数据集上训练的任何MLP都不能收敛到理论最优值,而是必然收敛到过拟合解。
摘要
:Vanishing gradient and overfitting are two of the most extensively studied problems in the literature about machine learning. However, they are frequently considered in some asymptotic setting, which obscure the underlying dynamical mechanisms responsible for their emergence. In this paper, we aim to provide a clear dynamical description of learning in multi-layer perceptrons. To this end, we introduce a minimal model, inspired by studies by Fukumizu and Amari, to investigate vanishing gradients and overfitting in MLPs trained via gradient descent. Within this model, we show that the learning dynamics may pass through plateau regions and near-optimal regions during training, both of which consist of saddle structures, before ultimately converging to the overfitting region. Under suitable conditions on the training dataset, we prove that, with high probability, the overfitting region collapses to a single attractor modulo symmetry, which corresponds to the overfitting. Moreover, we show that any MLP trained on a finite noisy dataset cannot converge to the theoretical optimum and instead necessarily converges to an overfitting solution.
【22】An Initial Exploration of Contrastive Prompt Tuning to Generate Energy-Efficient Code
标题:对比即时调优生成节能代码的初步探索
链接:https://arxiv.org/abs/2604.02352
作者:Sophie Weidmann,Fernando Castor
备注:Published at the Third International Workshop on Large Language Models for Code (LLM4Code 2026)
摘要:尽管LLM能够生成功能正确的代码,但与人类编写的解决方案相比,它们也往往生成能效较低的代码。由于这些低效率导致更高的计算开销,它们与旨在减少代码能耗的绿色软件开发(GSD)工作直接冲突。为了支持这些努力,本研究旨在研究是否以及如何优化LLM以促进节能代码的生成。为此,我们采用对比提示调优(CPT)。CPT结合了对比学习技术,帮助模型区分高效和低效的代码,以及快速调优,这是一种参数高效微调(PEFT)方法,只需要传统微调成本的一小部分。这项研究评估了CPT在Python,Java和C++编码问题上的三种不同模型,以提供全面的评估。该方法实现了两种模型的代码准确性的一致改进,但效率增益因模型,语言和任务复杂度而异,这表明改进并不一致可靠。
摘要:Although LLMs are capable of generating functionally correct code, they also tend to produce less energy-efficient code in comparison to human-written solutions. As these inefficiencies lead to higher computational overhead, they are in direct conflict with Green Software Development (GSD) efforts, which aim to reduce the energy consumption of code. To support these efforts, this study aims to investigate whether and how LLMs can be optimized to promote the generation of energy-efficient code. To this end, we employ Contrastive Prompt Tuning (CPT). CPT combines Contrastive Learning techniques, which help the model to distinguish between efficient and inefficient code, and Prompt Tuning, a Parameter-Efficient Fine Tuning (PEFT) approach that requires only a fraction of the cost of traditional fine tuning. This study evaluates CPT on Python, Java and C++ coding problems across three different models to provide a comprehensive evaluation. The method achieves consistent improvements in code accuracy for two models but efficiency gains vary by model, language and task complexity, indicating that improvements are not uniformly reliable.
【23】UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
标题:UI-Oceanus:利用合成环境动力学扩展图形用户界面代理
链接:https://arxiv.org/abs/2604.02345
作者:Mengzhou Wu,Yuzhe Guo,Yuan Cao,Haochuan Lu,Songhe Zhu,Pingzhe Qu,Xin Chen,Kang Qin,Zhongpu Wang,Xiaode Zhang,Xinyi Wang,Wei Dai,Gang Cao,Yuetang Deng,Zhi Gong,Dezhi Ran,Linyi Li,Wei Yang,Tao Xie
摘要:扩展通才GUI代理阻碍了昂贵的人力演示和合成教师监督的“蒸馏天花板”的数据可扩展性瓶颈。为了超越这些限制,我们提出了UI-Oceanus,这是一个将学习重点从模仿高级轨迹转移到通过地面真实环境反馈掌握交互物理的框架。通过对自我监督目标的系统调查,我们确定了正向动态,定义为未来界面状态的生成预测,作为可扩展性的主要驱动力,并显着超过反向推理。UI-Oceanus利用这一洞察力,将低成本的自主探索(直接通过系统执行进行验证)转换为高密度的生成监督,以构建一个强大的内部世界模型。对一系列模型的实验评估证明了我们方法的决定性优势:在合成动力学上使用持续预训练(CPT)的模型优于非CPT基线,离线基准测试的平均成功率提高了7%,这放大了现实世界在线导航的16.8%增益。此外,我们观察到导航性能与合成数据量的比例。这些结果证实,在前向预测建模接地代理提供了一个优越的途径,可扩展的GUI自动化与强大的跨域适应性和组合泛化。
摘要:Scaling generalist GUI agents is hindered by the data scalability bottleneck of expensive human demonstrations and the "distillation ceiling" of synthetic teacher supervision. To transcend these limitations, we propose UI-Oceanus, a framework that shifts the learning focus from mimicking high-level trajectories to mastering interaction physics via ground-truth environmental feedback. Through a systematic investigation of self-supervised objectives, we identify that forward dynamics, defined as the generative prediction of future interface states, acts as the primary driver for scalability and significantly outweighs inverse inference. UI-Oceanus leverages this insight by converting low-cost autonomous exploration, which is verified directly by system execution, into high-density generative supervision to construct a robust internal world model. Experimental evaluations across a series of models demonstrate the decisive superiority of our approach: models utilizing Continual Pre-Training (CPT) on synthetic dynamics outperform non-CPT baselines with an average success rate improvement of 7% on offline benchmarks, which amplifies to a 16.8% gain in real-world online navigation. Furthermore, we observe that navigation performance scales with synthetic data volume. These results confirm that grounding agents in forward predictive modeling offers a superior pathway to scalable GUI automation with robust cross-domain adaptability and compositional generalization.
【24】Characterization of Gaussian Universality Breakdown in High-Dimensional Empirical Risk Minimization
标题:多维经验风险最小化中高斯普遍性分解的特征
链接:https://arxiv.org/abs/2604.03146
作者:Chiheb Yaakoubi,Cosme Louart,Malik Tiomoko,Zhenyu Liao
备注:27 pages, 4 figues
摘要:研究了一般非高斯数据设计下的高维凸经验风险最小化问题。通过将凸高斯最小-最大定理(CGMT)扩展到非高斯设置,我们推导出关键统计量的渐近最小-最大特征,从而能够近似ERM估计量$\hatθ$的均值$μ_{\hat θ}$和协方差$C_{\hatθ}$。具体来说,在数据矩阵的集中假设和损失和正则化的标准正则性条件下,我们证明了对于独立于训练数据的测试协变量x,投影$\hatθ^\top x$近似遵循(通常为非高斯)分布,具有独立中心高斯方差变量$\text{Tr}(C_{\hatθ}\mathbb{E}[xx^\top])$.这一结果澄清了ERMs的高斯普适性的范围和限制。此外,我们证明了任何$\mathcal{C}^2$正则化子渐近等价于仅由其在零点的Hessian和在$μ_{\hatθ}$的梯度确定的二次型。提供了不同损失和模型的数值模拟,以验证我们的理论预测和定性见解。
摘要:We study high-dimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussian Min-Max Theorem (CGMT) to non-Gaussian settings, we derive an asymptotic min-max characterization of key statistics, enabling approximation of the mean $μ_{\hatθ}$ and covariance $C_{\hatθ}$ of the ERM estimator $\hatθ$. Specifically, under a concentration assumption on the data matrix and standard regularity conditions on the loss and regularizer, we show that for a test covariate $x$ independent of the training data, the projection $\hatθ^\top x$ approximately follows the convolution of the (generally non-Gaussian) distribution of $μ_{\hatθ}^\top x$ with an independent centered Gaussian variable of variance $\text{Tr}(C_{\hatθ}\mathbb{E}[xx^\top])$. This result clarifies the scope and limits of Gaussian universality for ERMs. Additionally, we prove that any $\mathcal{C}^2$ regularizer is asymptotically equivalent to a quadratic form determined solely by its Hessian at zero and gradient at $μ_{\hatθ}$. Numerical simulations across diverse losses and models are provided to validate our theoretical predictions and qualitative insights.
【25】A semicontinuous relaxation of Saito's criterion and freeness as angular minimization
标题:齐藤准则的连续放宽和自由度作为角度最小化
链接:https://arxiv.org/abs/2604.02995
作者:Tomás S. R. Silva
备注:This manuscript is a working paper, and an updated version will be posted later. 26 pages
摘要:我们在$\mathbb{P}^2 $中的线排列空间上引入一个非负泛函,它在自由排列上精确地为零,作为斋藤自由度准则的连续松弛而得到。给定一个排列$\mathcal{A}$的$n$行的候选指数$(d_1,d_2)$,我们通过相关导子矩阵的零空间来参数化d_1 $和d_2 $度的对数导子空间,并将Saito行列式表示为到$n$度多项式空间的双线性映射.该泛函于是接受了一个自然的几何解释:它测量这个双线性映射的像与系数空间中定义多项式$Q(\mathcal{A})$的方向之间的角度的正弦平方,并且等于零当且仅当它的像包含由$Q(\mathcal{A})$所张成的线。这提供了一个可计算的措施,有多远,一个给定的安排是从承认一个自由基的对数导数的预期程度。 使用这个功能作为奖励信号,我们开发了一个顺序的建设过程中,线被添加一次,以尽量减少角距离的自由度,通过强化学习与自适应课程的安排大小和指数类型。 我们的研究结果表明,连续松弛技术,接地在多项式系数空间的几何形状,提供了一个可行的方法来计算探索线安排理论中的自由度。
摘要:We introduce a nonnegative functional on the space of line arrangements in $\mathbb{P}^2$ that vanishes precisely on free arrangements, obtained as a semicontinuous relaxation of Saito's criterion for freeness. Given an arrangement $\mathcal{A}$ of $n$ lines with candidate exponents $(d_1, d_2)$, we parameterize the spaces of logarithmic derivations of degrees $d_1$ and $d_2$ via the null spaces of the associated derivation matrices and express the Saito determinant as a bilinear map into the space of degree $n$ polynomials. The functional then admits a natural geometric interpretation: it measures the squared sine of the angle between the image of this bilinear map and the direction of the defining polynomial $Q(\mathcal{A})$ in coefficient space, and equals zero if and only if its image contains the line spanned by $Q(\mathcal{A})$. This provides a computable measure of how far a given arrangement is from admitting a free basis of logarithmic derivations of the expected degrees. Using this functional as a reward signal, we develop a sequential construction procedure in which lines are added one at a time so as to minimize the angular distance to freeness, implemented via reinforcement learning with an adaptive curriculum over arrangement sizes and exponent types. Our results suggest that semicontinuous relaxation techniques, grounded in the geometry of polynomial coefficient spaces, offer a viable approach to the computational exploration of freeness in the theory of line arrangements.
【26】Inversion-Free Natural Gradient Descent on Riemannian Manifolds
标题:Riemannious上的无逆自然梯度下降
链接:https://arxiv.org/abs/2604.02969
作者:Dario Draca,Takuo Matsubara,Minh-Ngoc Tran
备注:73 pages, 3 figures
摘要:自然梯度法广泛应用于统计优化,但其标准公式假设欧氏参数空间。本文对参数位于黎曼流形上的概率分布提出了一种无反演的随机自然梯度法。流形设置提供了几个优点:可以隐式地强制参数约束,如正定性和正交性,确保参数是可识别的,或保证目标的正则性,如测地线凸性。建立在流形上的Fisher信息矩阵(fisherinformation matrix,fisherinformation matrix)的内在公式上,我们的方法保持了逆fisherinformation的在线近似,该逆fisherinformation的在线近似使用在连续迭代中采样的得分向量以二次成本有效地更新。在黎曼设置中,这些得分向量属于不同的切空间,必须使用传输操作组合。当步长指数α>2/3$时,证明了算法对于到极小点的平方距离的几乎处处收敛速度为O(\log{s}/s^α)$.我们还建立了几乎肯定率的近似的平均值,现在积累的运输为基础的错误。提出了一种具有次二次存储复杂度的有限内存算法。最后,我们证明了我们的方法的有效性,相对于它的欧几里得同行变分贝叶斯高斯近似和规范化流。
摘要:The natural gradient method is widely used in statistical optimization, but its standard formulation assumes a Euclidean parameter space. This paper proposes an inversion-free stochastic natural gradient method for probability distributions whose parameters lie on a Riemannian manifold. The manifold setting offers several advantages: one can implicitly enforce parameter constraints such as positive definiteness and orthogonality, ensure parameters are identifiable, or guarantee regularity properties of the objective like geodesic convexity. Building on an intrinsic formulation of the Fisher information matrix (FIM) on a manifold, our method maintains an online approximation of the inverse FIM, which is efficiently updated at quadratic cost using score vectors sampled at successive iterates. In the Riemannian setting, these score vectors belong to different tangent spaces and must be combined using transport operations. We prove almost-sure convergence rates of $O(\log{s}/s^α)$ for the squared distance to the minimizer when the step size exponent $α>2/3$. We also establish almost-sure rates for the approximate FIM, which now accumulates transport-based errors. A limited-memory variant of the algorithm with sub-quadratic storage complexity is proposed. Finally, we demonstrate the effectiveness of our method relative to its Euclidean counterparts on variational Bayes with Gaussian approximations and normalizing flows.
【27】Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions
标题:重新思考基于分数的多维数据同化的前瞻过程
链接:https://arxiv.org/abs/2604.02889
作者:Eunbi Yoon,Donghan Kim,Dae Wook Kim
摘要:数据同化是通过整合模式预测和噪声观测来估计动力系统的时间演化状态的过程。它通常被表述为贝叶斯过滤,但经典过滤器通常在高维度的准确性或计算可行性方面遇到困难。最近,基于分数的生成模型已经成为一种可扩展的方法,用于高维数据同化,使复杂分布的准确建模和采样。然而,现有的基于分数的过滤器通常独立于数据同化来指定前向过程。因此,测量更新步骤取决于似然分数的启发式近似,这可能会随着时间的推移积累错误并降低性能。在这里,我们提出了一个测量感知的分数为基础的过滤器(MASF),定义了一个测量感知的前向过程直接从测量方程。这种构造使得似然得分易于分析:对于线性测量,我们导出精确的似然得分,并将其与学习的先验得分相结合,以获得后验得分。数值实验涵盖了一系列的设置,包括高维数据集,证明了现有的基于分数的过滤器的精度和稳定性提高。
摘要:Data assimilation is the process of estimating the time-evolving state of a dynamical system by integrating model predictions and noisy observations. It is commonly formulated as Bayesian filtering, but classical filters often struggle with accuracy or computational feasibility in high dimensions. Recently, score-based generative models have emerged as a scalable approach for high-dimensional data assimilation, enabling accurate modeling and sampling of complex distributions. However, existing score-based filters often specify the forward process independently of the data assimilation. As a result, the measurement-update step depends on heuristic approximations of the likelihood score, which can accumulate errors and degrade performance over time. Here, we propose a measurement-aware score-based filter (MASF) that defines a measurement-aware forward process directly from the measurement equation. This construction makes the likelihood score analytically tractable: for linear measurements, we derive the exact likelihood score and combine it with a learned prior score to obtain the posterior score. Numerical experiments covering a range of settings, including high-dimensional datasets, demonstrate improved accuracy and stability over existing score-based filters.
【28】Lipschitz bounds for integral kernels
标题:整核的Lipschitz界
链接:https://arxiv.org/abs/2604.02887
作者:Justin Reverdi,Sixin Zhang,Fabrice Gamboa,Serge Gratton
摘要
:与正定核相关的特征映射在核方法和学习理论中起着核心作用,其中正则性(如Lipschitz连续性)与鲁棒性和稳定性保证密切相关。尽管它们的重要性,明确的特征的Lipschitz常数的核特征映射是在有限的情况下,只有在有限的数量。本文在可微性假设下研究了与积分核相关的特征映射的Lipschitz正则性。首先,我们提供了充分条件,确保Lipschitz连续性,并推导出相应的Lipschitz常数的显式公式。然后,我们确定了一个条件下的特征映射未能Lipschitz连续和应用这些结果的几个重要类别的内核。对于具有各向同性高斯权重分布的无限宽双层神经网络,我们证明了相关内核的Lipschitz常数可以表示为二维积分的上确界,从而明确表征了高斯内核和ReLU随机神经网络内核。我们还研究了连续和平移不变的内核,如高斯,拉普拉斯和Matérn内核,它们可以解释为具有余弦激活函数的神经网络。在这种情况下,我们证明了特征映射是Lipschitz连续的当且仅当权重分布具有有限的二阶矩,然后我们推导出它的Lipschitz常数。最后,我们提出了一个公开的问题,在有限宽度的神经网络的Lipschitz常数的收敛性的渐近行为。数值实验提供了支持这种行为。
摘要:Feature maps associated with positive definite kernels play a central role in kernel methods and learning theory, where regularity properties such as Lipschitz continuity are closely related to robustness and stability guarantees. Despite their importance, explicit characterizations of the Lipschitz constant of kernel feature maps are available only in a limited number of cases. In this paper, we study the Lipschitz regularity of feature maps associated with integral kernels under differentiability assumptions. We first provide sufficient conditions ensuring Lipschitz continuity and derive explicit formulas for the corresponding Lipschitz constants. We then identify a condition under which the feature map fails to be Lipschitz continuous and apply these results to several important classes of kernels. For infinite width two-layer neural network with isotropic Gaussian weight distributions, we show that the Lipschitz constant of the associated kernel can be expressed as the supremum of a two-dimensional integral, leading to an explicit characterization for the Gaussian kernel and the ReLU random neural network kernel. We also study continuous and shift-invariant kernels such as Gaussian, Laplace, and Matérn kernels, which admit an interpretation as neural network with cosine activation function. In this setting, we prove that the feature map is Lipschitz continuous if and only if the weight distribution has a finite second-order moment, and we then derive its Lipschitz constant. Finally, we raise an open question concerning the asymptotic behavior of the convergence of the Lipschitz constant in finite width neural networks. Numerical experiments are provided to support this behavior.
【29】AQVolt26: High-Temperature r$^2$SCAN Halide Dataset for Universal ML Potentials and Solid-State Batteries
标题:AQVolt 26:高温r$2$SCAN Halide数据集通用ML势和固态电池
链接:https://arxiv.org/abs/2604.02524
作者:Jiyoon Kim,Chuhong Wang,Aayush R. Singh,Tyler Sours,Shivang Agarwal,AJ Nish,Paul Abruzzo,Ang Xiao,Omar Allam
摘要:对安全、高能量密度电池的需求使卤化物固态电解质成为焦点,其提供了增强离子迁移率、电化学稳定性和界面变形性的潜力。加速它们的发现需要广泛的分子动力学,这已经越来越多地通过在基础数据集上训练的通用机器学习原子间势来实现。然而,卤化物的动态柔软性提出了一个严格的测试,通用模型是否可以可靠地取代第一原理计算下的高度扭曲,升高的温度制度必要的探针离子传输。在这里,我们介绍了AQVolt26,这是一个锂卤化物的322,656 r^2$SCAN单点计算数据集,通过5K结构的高温构型采样生成。我们证明,基础数据集提供了一个强大的基线稳定的卤化物化学和转移当地的力量,但绝对能量预测降低扭曲的高温制度。与AQVolt26的联合训练解决了这个盲点。此外,结合材料项目松弛数据提高了近平衡性能,但降低了极端应变的鲁棒性,而没有提高高温力的准确性。这些结果表明,域特定的构型采样是必不可少的可靠的动态筛选卤化物电解质。此外,我们的研究结果表明,虽然基础模型提供了一个强大的基础,他们是最有效的动态软固态化学增强时,有针对性的,高温数据。最后,我们表明,近平衡放松数据作为一个特定的任务的补充,而不是一个普遍有益的补充。
摘要:The demand for safe, high-energy-density batteries has spotlighted halide solid-state electrolytes, which offer the potential for enhanced ionic mobility, electrochemical stability, and interfacial deformability. Accelerating their discovery requires extensive molecular dynamics, which has been increasingly enabled by universal machine learning interatomic potentials trained on foundational datasets. However, the dynamic softness of halides poses a stringent test of whether general-purpose models can reliably replace first-principles calculations under the highly distorted, elevated-temperature regimes necessary to probe ion transport. Here, we present AQVolt26, a dataset of 322,656 r$^2$SCAN single-point calculations for lithium halides, generated via high-temperature configurational sampling across $\sim$5K structures. We demonstrate that foundational datasets provide a strong baseline for stable halide chemistries and transfer local forces well, however absolute energy predictions degrade in distorted higher-temperature regimes. Co-training with AQVolt26 resolves this blind spot. Furthermore, incorporating Materials Project relaxation data improves near-equilibrium performance but degrades extreme-strain robustness without enhancing high-temperature force accuracy. These results demonstrate that domain-specific configurational sampling is essential for the reliable dynamic screening of halide electrolytes. Furthermore, our findings suggest that while foundational models provide a robust base, they are most effective for dynamically soft solid-state chemistries when augmented with targeted, high-temperature data. Finally, we show that near-equilibrium relaxation data serves as a task-specific complement rather than a universally beneficial addition.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递