Py学习  »  机器学习算法

机器学习学术速递[1.12]

arXiv每日学术速递 • 3 周前 • 242 次点击  

点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!


cs.LG 方向,今日共计111篇


大模型相关(21篇)

【1】Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
标题:信心的幻觉?通过邻里一致性诊断LLM真实性
链接:https://arxiv.org/abs/2601.05905

作者:Haoming Xu,Ningyuan Zhao,Yunzhi Yao,Weihong Xu,Hongru Wang,Xinle Deng,Shumin Deng,Jeff Z. Pan,Huajun Chen,Ningyu Zhang
备注:Work in progress
摘要:随着大型语言模型(LLM)越来越多地部署在现实世界的环境中,仅靠正确性是不够的。可靠的部署需要在环境扰动下保持真实的信念。现有的评估在很大程度上依赖于像自我一致性这样的点式信心,这可能会掩盖脆弱的信念。我们发现,即使是完美的自我一致性回答的事实可以迅速崩溃下温和的上下文干扰。为了解决这一差距,我们提出了邻居一致性信念(NCB),一个结构性的信念稳健性措施,评估响应的一致性在一个概念的邻居。为了验证NCB的有效性,我们引入了一个新的认知压力测试协议,探测输出稳定性的上下文干扰。跨多个LLM的实验表明,高NCB数据的性能相对更抗干扰。最后,我们提出了结构感知训练(SAT),它优化了上下文不变的信念结构,并减少了约30%的长尾知识脆性。代码将在https://github.com/zjunlp/belief上提供。
摘要:As Large Language Models (LLMs) are increasingly deployed in real-world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point-wise confidence like Self-Consistency, which can mask brittle belief. We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor-Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high-NCB data is relatively more resistant to interference. Finally, we present Structure-Aware Training (SAT), which optimizes context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%. Code will be available at https://github.com/zjunlp/belief.


【2】EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
标题:EnvScaler:通过程序合成为LLM代理扩展工具交互环境
链接:https://arxiv.org/abs/2601.05808

作者:Xiaoshuai Song,Haofei Chang,Guanting Dong,Yutao Zhu,Zhicheng Dou,Ji-Rong Wen
备注:Working in progress
摘要:大型语言模型(LLM)预计将被训练为在各种真实世界环境中充当代理,但这个过程依赖于丰富多样的工具交互沙箱。然而,对真实系统的访问往往受到限制; LLM模拟的环境容易产生幻觉和不一致;手动构建的沙箱很难扩展。在本文中,我们提出了EnvScaler,可扩展的工具交互环境,通过编程合成的自动化框架。EnvScaler包含两个组件。首先,通过主题挖掘、逻辑建模和质量评估,构建了不同的环境骨架。然后,ScenGenerator为每个环境生成多个任务场景和基于规则的轨迹验证函数。使用EnvScaler,我们合成了191个环境和大约7 K个场景,并将它们应用于Qwen 3系列模型的监督微调(SFT)和强化学习(RL)。三个基准测试的结果表明,EnvScaler显着提高了LLM在涉及多回合,多工具交互的复杂环境中解决任务的能力。我们在https://github.com/RUC-NLPIR/EnvScaler上发布我们的代码和数据。
摘要:Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.


【3】Simplify-This: A Comparative Analysis of Prompt-Based and Fine-Tuned LLMs
标题:Simply-This:基于预算和精调LLM的比较分析
链接:https://arxiv.org/abs/2601.05794

作者:Eilam Cohen,Itamar Bul,Danielle Inbar,Omri Loewenbach
摘要:大型语言模型(LLM)支持强大的文本生成,一般来说,在微调和快速工程之间存在实际的权衡。我们介绍Simplify,这是一项比较研究,使用一系列评估指标,在多个基准测试中使用编码器-解码器LLM评估文本简化的两种范式。微调的模型始终提供更强的结构简化,而提示往往获得更高的语义相似性分数,但往往复制输入。人工评估总体上倾向于微调输出。我们发布代码,我们研究中使用的清洁衍生数据集,微调模型的检查点,并提示模板以促进可重复性和未来的工作。
摘要:Large language models (LLMs) enable strong text generation, and in general there is a practical tradeoff between fine-tuning and prompt engineering. We introduce Simplify-This, a comparative study evaluating both paradigms for text simplification with encoder-decoder LLMs across multiple benchmarks, using a range of evaluation metrics. Fine-tuned models consistently deliver stronger structural simplification, whereas prompting often attains higher semantic similarity scores yet tends to copy inputs. A human evaluation favors fine-tuned outputs overall. We release code, a cleaned derivative dataset used in our study, checkpoints of fine-tuned models, and prompt templates to facilitate reproducibility and future work.


【4】FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching
标题:FLPQ:通过灵活的低等级矩阵草图实现更快的LLM量化
链接:https://arxiv.org/abs/2601.05684

作者:Hongyaoxing Gul,Lijuan Hu,Shuzi Niu,Fangfang Liu
摘要 :传统的后训练量化(PTQ)被认为是一种有效的方法,以减少模型的大小和加速推理的大规模语言模型(LLM)。然而,现有的低秩PTQ方法需要昂贵的微调来确定大型模型中不同数据和层的折衷秩,未能充分利用其潜力。此外,当前基于SVD的低秩近似增加了计算开销。在这项工作中,我们彻底分析了不同层次的代表性模型中的低秩近似的不同有效性。因此,我们引入了\underline{F} flexible\underline{L} low-\underline{R}ank \underline{Q}量化(FLRQ),这是一种新的解决方案,旨在快速识别准确性最佳的秩并将其聚合以实现最小的存储组合。FLRQ包括两个强大的组件,基于Rank 1草图的灵活秩选择(R1-FLR)和裁剪下的最佳低秩近似(BLC)。R1-FLR将R1-Sketch与高斯投影应用于快速低秩近似,从而为每个层启用离群感知秩提取。同时,BLC的目标是最小化低秩量化误差下的缩放和裁剪策略,通过迭代的方法。FLRQ在综合实验中表现出强大的有效性和鲁棒性,在量化质量和算法效率方面都达到了最先进的性能。
摘要:Traditional post-training quantization (PTQ) is considered an effective approach to reduce model size and accelerate inference of large-scale language models (LLMs). However, existing low-rank PTQ methods require costly fine-tuning to determine a compromise rank for diverse data and layers in large models, failing to exploit their full potential. Additionally, the current SVD-based low-rank approximation compounds the computational overhead. In this work, we thoroughly analyze the varying effectiveness of low-rank approximation across different layers in representative models. Accordingly, we introduce \underline{F}lexible \underline{L}ow-\underline{R}ank \underline{Q}uantization (FLRQ), a novel solution designed to quickly identify the accuracy-optimal ranks and aggregate them to achieve minimal storage combinations. FLRQ comprises two powerful components, Rank1-Sketch-based Flexible Rank Selection (R1-FLR) and Best Low-rank Approximation under Clipping (BLC). R1-FLR applies the R1-Sketch with Gaussian projection for the fast low-rank approximation, enabling outlier-aware rank extraction for each layer. Meanwhile, BLC aims at minimizing the low-rank quantization error under the scaling and clipping strategy through an iterative method. FLRQ demonstrates strong effectiveness and robustness in comprehensive experiments, achieving state-of-the-art performance in both quantization quality and algorithm efficiency.


【5】Do Sparse Autoencoders Identify Reasoning Features in Language Models?
标题:稀疏自动编码器可以识别语言模型中的推理特征吗?
链接:https://arxiv.org/abs/2601.05679

作者:George Ma,Zhongyuan Liang,Irene Y. Chen,Somayeh Sojoudi
摘要:我们研究稀疏自动编码器(SAE)是否能识别大型语言模型(LLM)中的真正推理特征。从使用标准的对比激活方法选择的功能开始,我们引入了一个以证伪为导向的框架,该框架结合了因果令牌注入实验和LLM指导的证伪,以测试功能激活是否反映了推理过程或表面语言相关性。在跨越多个模型系列,层和推理数据集的20个配置中,我们发现识别的推理功能对令牌级干预高度敏感。在非推理文本中注入少量与特征相关的标记就足以引起59%到94%的特征的强烈激活,这表明对词汇工件的依赖。对于未被简单令牌触发器解释的其余特征,LLM引导的证伪一致地产生激活特征的非推理输入和不激活特征的推理输入,没有分析的特征满足我们的真正推理行为的标准。控制这些特性会使基准性能产生最小的变化或轻微的降级。总之,这些结果表明,通过对比方法确定的SAE特征主要捕获推理的语言相关性,而不是基本的推理计算本身。
摘要:We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). Starting from features selected using standard contrastive activation methods, we introduce a falsification-oriented framework that combines causal token injection experiments and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that identified reasoning features are highly sensitive to token-level interventions. Injecting a small number of feature-associated tokens into non-reasoning text is sufficient to elicit strong activation for 59% to 94% of features, indicating reliance on lexical artifacts. For the remaining features that are not explained by simple token triggers, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields minimal changes or slight degradations in benchmark performance. Together, these results suggest that SAE features identified by contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves.


【6】Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs
标题:多语言学习:多语言法学硕士中取消学习的可移植性
链接:https://arxiv.org/abs/2601.05641

作者:Alireza Dehghanpour Farashah,Aditi Khandelwal,Marylou Fauchard,Zhuan Shi,Negar Rostamzadeh,Golnoosh Farnadi
摘要:随着多语言大型语言模型越来越广泛地使用,确保其在不同语言环境中的安全性和公平性提出了独特的挑战。虽然现有的机器学习研究主要集中在单语环境中,通常是英语,但由于跨语言知识转移和预训练和微调数据中的偏见,多语言环境会引入额外的复杂性。在这项工作中,我们使用Aya-Expanse 8B模型在两种设置下研究多语言学习:(1)数据学习和(2)概念学习。我们通过翻译将事实知识和刻板印象的基准扩展到十种语言:英语,法语,阿拉伯语,日语,俄语,波斯语,韩语,印地语,希伯来语和印度尼西亚语。这些语言跨越五个语系和广泛的资源水平。我们的实验表明,在高资源的语言中,遗忘通常更稳定,在类型相关的语言之间观察到不对称的迁移效应。此外,我们对语言距离的分析表明,句法相似性是跨语言非学习行为的最强预测因子。
摘要:As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts presents unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.


【7】Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks
标题:双阶段LLM推理:自我进化的数学框架
链接:https://arxiv.org/abs/2601.05616

作者:ShaoZhen Liu,Xinting Huang,Houwen Peng,Xin Chen,Xinyang Song,Qi Li,Zhenan Sun
摘要:近年来,大型语言模型(LLM)在数学问题解决等复杂推理任务中表现出了巨大的潜力。然而,现有的研究主要依赖于强化学习(RL)框架,而忽略了监督微调(SFT)方法。本文提出了一种新的两阶段训练框架,通过自我生成的长思想链(CoT)数据来增强模型的自我校正能力。在第一阶段,多轮对话策略指导模型生成包含验证、回溯、子目标分解和反向推理的CoT数据,并使用预定义的规则过滤高质量样本进行监督微调。第二阶段采用困难感知拒绝抽样机制来动态优化数据分布,增强模型处理复杂问题的能力。该方法生成的推理链扩展了4倍以上,同时保持了强大的可扩展性,证明了SFT有效地激活了模型的内在推理能力,并为复杂任务优化提供了一条资源有效的途径。实验结果表明,在包括GSM 8K和MATH500在内的数学基准测试中,性能有所改善,微调后的模型在AIME24等竞赛级问题上取得了实质性的改善。代码将是开源的。
摘要 :In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models' self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model's ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models' intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.


【8】Understanding LLM-Driven Test Oracle Generation
标题:了解法学硕士驱动的测试Oracle生成
链接:https://arxiv.org/abs/2601.05542

作者:Adam Bodicoat,Gunel Jahangirova,Valerio Terragni
备注:Accepted for presentation at the 2nd ACM/IEEE International Conference on AI-powered Software (AIware 2025)
摘要:自动化单元测试生成旨在提高软件质量,同时减少手动创建测试所需的时间和精力。然而,现有的技术主要是生成回归预言的实现行为的类在测试中。它们没有解决预言问题:区分正确和不正确的程序行为的挑战。随着基础模型(FM),特别是大型语言模型(LLM)的兴起,有一个新的机会来生成反映预期行为的测试预言。这将LLM定位为软件的推动者,其中软件创建和测试由自然语言提示驱动。本文提出了一个实证研究的LLM在生成测试神谕,暴露软件故障的有效性。我们调查如何不同的提示策略和上下文输入水平的影响LLM生成的神谕的质量。我们的研究结果提供了对FM时代基于LLM的Oracle生成的优势和局限性的见解,提高了我们对其能力的理解,并促进了该领域未来的研究。
摘要:Automated unit test generation aims to improve software quality while reducing the time and effort required for creating tests manually. However, existing techniques primarily generate regression oracles that predicate on the implemented behavior of the class under test. They do not address the oracle problem: the challenge of distinguishing correct from incorrect program behavior. With the rise of Foundation Models (FMs), particularly Large Language Models (LLMs), there is a new opportunity to generate test oracles that reflect intended behavior. This positions LLMs as enablers of Promptware, where software creation and testing are driven by natural-language prompts. This paper presents an empirical study on the effectiveness of LLMs in generating test oracles that expose software failures. We investigate how different prompting strategies and levels of contextual input impact the quality of LLM-generated oracles. Our findings offer insights into the strengths and limitations of LLM-based oracle generation in the FM era, improving our understanding of their capabilities and fostering future research in this area.


【9】Over-Searching in Search-Augmented Large Language Models
标题:搜索增强大型语言模型中的过度搜索
链接:https://arxiv.org/abs/2601.05503

作者:Roy Xie,Deepak Gopinath,David Qiu,Dong Lin,Haitian Sun,Saloni Potdar,Bhuwan Dhingra
备注:Accepted to EACL 2026 Main Conference
摘要:搜索增强的大型语言模型(LLM)通过集成外部检索在知识密集型任务中表现出色。然而,他们经常过度搜索-不必要地调用搜索工具,即使它没有提高响应质量,这会导致计算效率低下,并通过合并不相关的上下文产生幻觉。在这项工作中,我们进行了系统的评估过度搜索在多个维度,包括查询类型,模型类别,检索条件,和多轮对话。我们的发现表明:(i)搜索通常会提高可回答查询的答案准确性,但会损害无法回答的查询;(ii)过度搜索在复杂的推理模型和深入的研究系统中更为明显,被嘈杂的检索加剧,并在多轮对话中交叉复合;(iii)检索证据的组成至关重要,因为负面证据的存在会提高排除。为了量化过度搜索,我们引入了令牌每正确性(TPC),这是一个评估指标,可以捕获搜索增强LLM的性能-成本权衡。最后,我们在查询和检索级别研究缓解方法,并发布OverSearchQA,以促进对高效搜索增强LLM的持续研究。
摘要:Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search -- unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.


【10】Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection
标题:Hi-ZFO:通过导入引导张量选择进行分层零阶和一阶LLM微调
链接:https://arxiv.org/abs/2601.05501

作者:Feihu Jin,Ying Tan
备注:13 pages, 4 figures
摘要:使用标准的一阶(FO)优化来微调大型语言模型(LLM)通常会导致训练朝着尖锐的,泛化能力差的最小值发展。相反,零阶(ZO)方法提供了更强的探索行为,而不依赖于显式梯度,但收敛速度慢。更关键的是,我们的分析表明,在生成任务中,巨大的输出和搜索空间显着放大了估计方差,使ZO方法既嘈杂又低效。为了解决这些挑战,我们提出了\textbf{Hi-ZFO}(\textbf{Hi}空间\textbf{Z}eroth-和\textbf{F}irst-\textbf{O}rder优化),这是一个混合框架,旨在协同FO梯度的精度和ZO估计的探索能力。Hi-ZFO通过逐层重要性分析自适应地对模型进行分区,对关键层应用精确的FO更新,同时对不太敏感的层利用ZO优化。值得注意的是,Hi-ZFO中的ZO不仅仅是一个节省内存的替代品;它被有意引入作为“有益的随机性”的来源,以帮助模型摆脱纯FO优化趋于停滞的局部最小值。Hi-ZFO在各种生成、数学和代码推理任务中得到验证,在显著减少训练时间的同时,始终实现卓越的性能。这些结果证明了分层混合优化LLM微调的有效性。
摘要:Fine-tuning large language models (LLMs) using standard first-order (FO) optimization often drives training toward sharp, poorly generalizing minima. Conversely, zeroth-order (ZO) methods offer stronger exploratory behavior without relying on explicit gradients, yet suffer from slow convergence. More critically, our analysis reveals that in generative tasks, the vast output and search space significantly amplify estimation variance, rendering ZO methods both noisy and inefficient. To address these challenges, we propose \textbf{Hi-ZFO} (\textbf{Hi}erarchical \textbf{Z}eroth- and \textbf{F}irst-\textbf{O}rder optimization), a hybrid framework designed to synergize the precision of FO gradients with the exploratory capability of ZO estimation. Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while leveraging ZO optimization for less sensitive ones. Notably, ZO in Hi-ZFO is not merely a memory-saving surrogate; it is intentionally introduced as a source of "beneficial stochasticity" to help the model escape the local minima where pure FO optimization tends to stagnate. Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing the training time. These results demonstrate the effectiveness of hierarchical hybrid optimization for LLM fine-tuning.


【11】Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models
标题:知识驱动的大型语言模型多回合越狱
链接:https://arxiv.org/abs/2601.05445

作者:Songze Li,Ruishi He,Xiaojun Jia,Jun Wang,Zhihui Fu
摘要 :大型语言模型(LLM)面临着来自多回合越狱攻击的重大威胁,其中对手逐渐引导对话以引出有害输出。然而,现有攻击的实际有效性受到几个关键限制的破坏:它们难以在长时间的交互中保持连贯的进展,经常失去对已经完成的工作和仍然需要做的工作的跟踪;它们依赖于刚性或预定义的模式,并且无法适应LLM的动态和不可预测的会话状态。为了解决这些缺点,我们引入了Mastermind,这是一个多回合越狱框架,采用了动态和自我改进的方法。Mastermind在规划、执行和反思的闭环中运行,使其能够通过交互自主构建和完善其模型漏洞知识。它采用分层规划架构,将高级别攻击目标与低级别战术执行相结合,确保长期的重点和一致性。该规划由知识库指导,该知识库通过反思交互式体验来自主发现和改进有效的攻击模式。Mastermind利用这些积累的知识来动态重组和调整攻击向量,从而大大提高效率和弹性。我们对最先进的模型进行了全面的实验,包括GPT-5和Claude 3.7 Sonnet。结果表明,Mastermind显著优于现有的基线,实现了更高的攻击成功率和危害性评级。此外,我们的框架对多种先进的防御机制表现出显着的弹性。
摘要:Large Language Models (LLMs) face a significant threat from multi-turn jailbreak attacks, where adversaries progressively steer conversations to elicit harmful outputs. However, the practical effectiveness of existing attacks is undermined by several critical limitations: they struggle to maintain a coherent progression over long interactions, often losing track of what has been accomplished and what remains to be done; they rely on rigid or pre-defined patterns, and fail to adapt to the LLM's dynamic and unpredictable conversational state. To address these shortcomings, we introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self-improving approach. Mastermind operates in a closed loop of planning, execution, and reflection, enabling it to autonomously build and refine its knowledge of model vulnerabilities through interaction. It employs a hierarchical planning architecture that decouples high-level attack objectives from low-level tactical execution, ensuring long-term focus and coherence. This planning is guided by a knowledge repository that autonomously discovers and refines effective attack patterns by reflecting on interactive experiences. Mastermind leverages this accumulated knowledge to dynamically recombine and adapt attack vectors, dramatically improving both effectiveness and resilience. We conduct comprehensive experiments against state-of-the-art models, including GPT-5 and Claude 3.7 Sonnet. The results demonstrate that Mastermind significantly outperforms existing baselines, achieving substantially higher attack success rates and harmfulness ratings. Moreover, our framework exhibits notable resilience against multiple advanced defense mechanisms.


【12】Efficient Inference for Noisy LLM-as-a-Judge Evaluation
标题:有效推理喧闹的LLM作为法官评估
链接:https://arxiv.org/abs/2601.05420

作者:Yiqun T Chen,Sizhu Lu,Sijia Li,Moran Guo,Shengyi Li
摘要:大型语言模型(LLM)越来越多地被用作生成AI输出的自动评估器,这种范式通常被称为“LLM作为法官”。“在实践中,LLM法官对潜在真相的预测是不完美的,可能会出现系统性的、非随机性的错误。最近提出了两种主要方法来解决这个问题:(i)基于错误分类模型(如Rogan-Gladen风格估计)的直接测量误差校正,以及(ii)替代结果方法,如预测动力推理(PPI),通过校准一小组金标准人类标签上的预测残差来校正偏差。在本文中,我们系统地研究了这两种方法估计均值参数的性能(例如,平均基准分数或成对获胜率)。利用半参数有效性理论的工具,我们统一了两类估计,推导出明确的形式,有效的影响函数(EIF)为基础的有效估计和特征条件下,PPI风格的估计达到严格较小的渐近方差比测量误差校正。我们验证了我们的理论结果在模拟和演示的方法上的真实数据的例子。我们在www.example.com上提供了基准方法和比较实用程序的实现。
摘要:Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge.


【13】Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models
标题:迷失在执行中:大型语言模型中工具调用的多语言鲁棒性
链接:https://arxiv.org/abs/2601.05366

作者:Zheng Luo,T Pranav Kutralingam,Ogochukwu N Okoani,Wanpeng Xu,Hua Wei,Xiyang Hu
摘要:大型语言模型(LLM)越来越多地被部署为通过结构化函数调用调用外部工具的代理。虽然最近的工作报告强大的工具调用性能标准英语为中心的评估,多语言用户交互下的工具调用的鲁棒性仍然有待探索。在这项工作中,我们介绍MLCL,诊断基准,并进行系统的评估,跨中文,印地语,低资源语言伊博语的多语言工具调用。通过细粒度的错误分析,我们发现,尽管正确的意图理解和工具选择,许多故障发生。我们确定参数值语言不匹配作为一个主要的故障模式,模型生成语义适当的参数值在用户的语言,违反语言不变的执行惯例。我们进一步评估了几种推理时间系统策略,发现虽然这些策略大大减少了语言引起的执行错误,但没有一种策略可以完全恢复英语水平的性能。
摘要:Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we show that many failures occur despite correct intent understanding and tool selection. We identify parameter value language mismatch as a dominant failure mode, where models generate semantically appropriate parameter values in the user's language, violating language-invariant execution conventions. We further evaluate several inference-time system strategies and find that while these strategies substantially reduce language-induced execution errors, none of them can fully recover English-level performance.


【14】On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis
标题:论LLM自我改进的局限性以及为什么AGI、ATI和奇异性在没有符号模型合成的情况下并不存在
链接:https://arxiv.org/abs/2601.05280

作者:Hector Zenil
备注:26 pages
摘要:我们将大型语言模型(LLM)和生成式AI中的递归自训练形式化为离散时间动力系统,并证明随着训练数据越来越多地自我生成($α_t \to 0$),系统不可避免地经历退化动力学。我们推导出两种基本的失效模式:(1)熵衰减,其中有限的抽样效应导致分布多样性的单调损失(模式崩溃),以及(2)方差放大,其中外部接地的损失导致模型的真实表示漂移为随机游走,仅受支持直径的限制。我们表明,这些行为是不是偶然的架构,但分布学习有限样本的后果。我们进一步认为,强化学习与不完美的验证者遭受类似的语义崩溃。为了克服这些限制,我们提出了一个路径,涉及符号回归和程序合成的指导下,数学概率。编码定理方法(CTM)允许识别生成机制,而不仅仅是相关性,避免了绑定标准统计学习的数据处理不平等。我们的结论是,虽然纯粹的分布式学习导致模型崩溃,混合神经符号的方法提供了一个连贯的框架,持续的自我完善。
摘要 :We formalise recursive self-training in Large Language Models (LLMs) and Generative AI as a discrete-time dynamical system and prove that, as training data become increasingly self-generated ($α_t \to 0$), the system undergoes inevitably degenerative dynamics. We derive two fundamental failure modes: (1) Entropy Decay, where finite sampling effects cause a monotonic loss of distributional diversity (mode collapse), and (2) Variance Amplification, where the loss of external grounding causes the model's representation of truth to drift as a random walk, bounded only by the support diameter. We show these behaviours are not contingent on architecture but are consequences of distributional learning on finite samples. We further argue that Reinforcement Learning with imperfect verifiers suffers similar semantic collapse. To overcome these limits, we propose a path involving symbolic regression and program synthesis guided by Algorithmic Probability. The Coding Theorem Method (CTM) allows for identifying generative mechanisms rather than mere correlations, escaping the data-processing inequality that binds standard statistical learning. We conclude that while purely distributional learning leads to model collapse, hybrid neurosymbolic approaches offer a coherent framework for sustained self-improvement.


【15】Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings
标题:利用基于LLM的句子嵌入增强交易理解的基础模型
链接:https://arxiv.org/abs/2601.05271

作者:Xiran Fan,Zhimeng Jiang,Chin-Chia Michael Yeh,Yuzhong Chen,Yingtong Dou,Menghai Pan,Yan Zheng
摘要:无处不在的支付网络产生了大量的交易数据,这些数据编码了丰富的消费者和商家行为模式。最近的交易分析基础模型顺序处理表格数据,但依赖于基于索引的表示分类商家字段,导致大量的语义信息丢失,丰富的文本数据转换为离散的令牌。虽然大型语言模型(LLM)可以通过出色的语义理解来解决这一限制,但它们的计算开销对实时财务部署提出了挑战。我们引入了一个混合框架,使用LLM生成的嵌入作为轻量级事务模型的语义初始化,平衡可解释性和操作效率。我们的方法采用多源数据融合,以丰富商家分类字段和一个字的约束原则,一致的嵌入生成跨LLM架构。我们通过噪声过滤和上下文感知丰富系统地解决数据质量问题。在大规模事务数据集上的实验表明,在多个事务理解任务中,性能得到了显着提高。
摘要:The ubiquity of payment networks generates vast transactional data encoding rich consumer and merchant behavioral patterns. Recent foundation models for transaction analysis process tabular data sequentially but rely on index-based representations for categorical merchant fields, causing substantial semantic information loss by converting rich textual data into discrete tokens. While Large Language Models (LLMs) can address this limitation through superior semantic understanding, their computational overhead challenges real-time financial deployment. We introduce a hybrid framework that uses LLM-generated embeddings as semantic initializations for lightweight transaction models, balancing interpretability with operational efficiency. Our approach employs multi-source data fusion to enrich merchant categorical fields and a one-word constraint principle for consistent embedding generation across LLM architectures. We systematically address data quality through noise filtering and context-aware enrichment. Experiments on large-scale transaction datasets demonstrate significant performance improvements across multiple transaction understanding tasks.


【16】Transforming User Defined Criteria into Explainable Indicators with an Integrated LLM AHP System
标题:利用集成的LLM层次分析法系统将用户定义的标准转化为可解释的指标
链接:https://arxiv.org/abs/2601.05267

作者:Geonwoo Bang,Dongho Kim,Moohong Min
摘要:跨领域评估复杂文本需要将用户定义的标准转换为定量的、可解释的指标,这是搜索和推荐系统中的一个持续挑战。单提示LLM评估遭受复杂性和延迟问题,而标准特定的分解方法依赖于朴素的平均或不透明的黑盒聚合方法。我们提出了一个可解释的聚合框架结合LLM评分与层次分析法。我们的方法通过LLM作为判断生成标准特定的分数,使用Jensen Shannon距离测量区分能力,并通过AHP成对比较矩阵导出统计接地权重。亚马逊评论质量评估和抑郁症相关的文本评分的实验表明,我们的方法实现了高的可解释性和操作效率,同时保持相当的预测能力,使其适合于实时延迟敏感的Web服务。
摘要:Evaluating complex texts across domains requires converting user defined criteria into quantitative, explainable indicators, which is a persistent challenge in search and recommendation systems. Single prompt LLM evaluations suffer from complexity and latency issues, while criterion specific decomposition approaches rely on naive averaging or opaque black-box aggregation methods. We present an interpretable aggregation framework combining LLM scoring with the Analytic Hierarchy Process. Our method generates criterion specific scores via LLM as judge, measures discriminative power using Jensen Shannon distance, and derives statistically grounded weights through AHP pairwise comparison matrices. Experiments on Amazon review quality assessment and depression related text scoring demonstrate that our approach achieves high explainability and operational efficiency while maintaining comparable predictive power, making it suitable for real time latency sensitive web services.


【17】Retrieval-Augmented Multi-LLM Ensemble for Industrial Part Specification Extraction
标题:基于检索增强的Multi-LLM分类器的工业零件规格信息提取
链接:https://arxiv.org/abs/2601.05266

作者:Muzakkiruddin Ahmed Mohammed,John R. Talburt,Leon Claasssens,Adriaan Marais
备注:The 17th International Conference on Knowledge and Systems Engineering
摘要:从非结构化文本中提取工业零件规格仍然是制造,采购和维护中的一个持续挑战,手动处理既耗时又容易出错。本文介绍了一种检索增强的多LLM集成框架,该框架在结构化的三阶段管道中编排了九个最先进的大型语言模型(LLM)。RAGSTOM通过结合Gemini(2.0,2.5,1.5),OpenAI(GPT-4 o,o 4-mini),Mistral Large和Gemma(1B,4 B,3 n-e4 b)等模型家族的互补优势,同时使用基于FAISS的语义检索将输出结果与事实数据相结合,从而解决了单模型系统的关键限制。该系统架构包括三个阶段:(1)通过不同的LLM进行并行提取,(2)利用高性能模型进行有针对性的研究增强,以及(3)具有冲突解决和信心感知评分的智能合成。RAG集成提供了对结构化零件数据库的实时访问,使系统能够通过基于相似性的参考检索来验证、改进和丰富输出。使用真实工业数据集的实验结果表明,与领先的单LLM基线相比,提取准确性,技术完整性和结构化输出质量有显着提高。主要贡献包括为工业领域提供可扩展的集成架构,整个管道中的无缝RAG集成,全面的质量评估机制,以及适合在知识密集型制造环境中部署的生产就绪解决方案。
摘要:Industrial part specification extraction from unstructured text remains a persistent challenge in manufacturing, procurement, and maintenance, where manual processing is both time-consuming and error-prone. This paper introduces a retrieval-augmented multi-LLM ensemble framework that orchestrates nine state-of-the-art Large Language Models (LLMs) within a structured three-phase pipeline. RAGsemble addresses key limitations of single-model systems by combining the complementary strengths of model families including Gemini (2.0, 2.5, 1.5), OpenAI (GPT-4o, o4-mini), Mistral Large, and Gemma (1B, 4B, 3n-e4b), while grounding outputs in factual data using FAISS-based semantic retrieval. The system architecture consists of three stages: (1) parallel extraction by diverse LLMs, (2) targeted research augmentation leveraging high-performing models, and (3) intelligent synthesis with conflict resolution and confidence-aware scoring. RAG integration provides real-time access to structured part databases, enabling the system to validate, refine, and enrich outputs through similarity-based reference retrieval. Experimental results using real industrial datasets demonstrate significant gains in extraction accuracy, technical completeness, and structured output quality compared to leading single-LLM baselines. Key contributions include a scalable ensemble architecture for industrial domains, seamless RAG integration throughout the pipeline, comprehensive quality assessment mechanisms, and a production-ready solution suitable for deployment in knowledge-intensive manufacturing environments.


【18】Quantifying Document Impact in RAG-LLMs
标题:量化RAG-LLM中的文档影响
链接:https://arxiv.org/abs/2601.05260

作者:Armin Gerami,Kazem Faghih,Ramani Duraiswami
摘要 :检索增强生成(RAG)通过将大型语言模型(LLM)连接到外部知识,提高准确性并减少过时信息来增强LLM。然而,这引入了诸如事实不一致、源冲突、偏差传播和安全漏洞等挑战,这些挑战破坏了RAG系统的可信度。目前RAG评估的一个关键差距是缺乏一个衡量标准来量化单个检索到的文档对最终输出的贡献。为了解决这个问题,我们引入了影响分数(IS),一种基于部分信息分解的新度量,它可以衡量每个检索到的文档对生成的响应的影响。我们通过两个实验来验证IS。首先,在三个数据集的毒药攻击模拟表明,IS正确识别的恶意文档中最有影响力的86\%$的情况下。其次,一项消融研究表明,IS仅使用排名靠前的文档生成的响应始终被认为比从其余文档生成的响应更类似于原始响应。这些结果证实了IS在隔离和量化文档影响方面的功效,为提高RAG系统的透明度和可靠性提供了一个有价值的工具。
摘要:Retrieval Augmented Generation (RAG) enhances Large Language Models (LLMs) by connecting them to external knowledge, improving accuracy and reducing outdated information. However, this introduces challenges such as factual inconsistencies, source conflicts, bias propagation, and security vulnerabilities, which undermine the trustworthiness of RAG systems. A key gap in current RAG evaluation is the lack of a metric to quantify the contribution of individual retrieved documents to the final output. To address this, we introduce the Influence Score (IS), a novel metric based on Partial Information Decomposition that measures the impact of each retrieved document on the generated response. We validate IS through two experiments. First, a poison attack simulation across three datasets demonstrates that IS correctly identifies the malicious document as the most influential in $86\%$ of cases. Second, an ablation study shows that a response generated using only the top-ranked documents by IS is consistently judged more similar to the original response than one generated from the remaining documents. These results confirm the efficacy of IS in isolating and quantifying document influence, offering a valuable tool for improving the transparency and reliability of RAG systems.


【19】Automating Deception: Scalable Multi-Turn LLM Jailbreaks
标题:自动欺骗:可扩展多回合LLM越狱
链接:https://arxiv.org/abs/2511.19517

作者:Adarsh Kumarappan,Ananya Mujoo
摘要:多回合会话攻击利用了像Foot-in-the-Door(FITD)这样的心理学原理,其中一个小的初始请求为更重要的请求铺平了道路,以绕过安全对齐,对大型语言模型(LLM)构成了持续的威胁。防御这些攻击的进展受到依赖手动,难以扩展的数据集创建的阻碍。本文介绍了一种新的自动化管道,用于生成大规模的、基于心理学的多轮越狱数据集。我们系统地将FITD技术操作到可复制的模板中,创建了一个涵盖非法活动和攻击性内容的1,500个场景的基准。我们评估了七个模型,从三个主要的LLM家庭在多转(有历史)和单转(无历史)的条件下。我们的研究结果揭示了上下文鲁棒性的明显差异:GPT家族中的模型表现出对会话历史的显著脆弱性,攻击成功率(ASR)增加了32个百分点。相比之下,谷歌的Gemini 2.5 Flash表现出了非凡的弹性,证明几乎对这些攻击免疫,而Anthropic的Claude 3 Haiku表现出了强大但不完美的抵抗力。这些发现突出了当前安全架构如何处理会话上下文的关键分歧,并强调了防御可以抵御基于叙述的操纵的需求。
摘要:Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT family demonstrate a significant vulnerability to conversational history, with Attack Success Rates (ASR) increasing by as much as 32 percentage points. In contrast, Google's Gemini 2.5 Flash exhibits exceptional resilience, proving nearly immune to these attacks, while Anthropic's Claude 3 Haiku shows strong but imperfect resistance. These findings highlight a critical divergence in how current safety architectures handle conversational context and underscore the need for defenses that can resist narrative-based manipulation.


【20】Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training
标题:具有稳健跨模式细胞语言预训练的开放世界知识辅助单细胞基础模型
链接:https://arxiv.org/abs/2601.05648

作者:Haoran Wang,Xuanyi Zhang,Shuangsang Fang,Longke Ran,Ziqing Deng,Yong Zhang,Yuxiang Li,Shaoshuai Li
备注:41 pages
摘要:单细胞多组学的最新进展,特别是RNA-seq,为细胞异质性和基因调控提供了深刻的见解。虽然基于预训练语言模型(PLM)范式的单细胞基础模型已经显示出了希望,但它们仍然受到深度个体配置文件整合不足以及忽视多模态数据中噪声影响的限制。为了解决这两个问题,我们提出了一个开放世界语言知识辅助的鲁棒单细胞基础模型(OKR-细胞)。它基于跨模态的细胞语言预训练框架,包括两个关键创新:(1)利用基于大语言模型(LLM)的工作流与检索增强生成(RAG),丰富细胞文本描述使用开放世界知识;(2)设计一个跨模态鲁棒对准(CRA)目标,该目标包括样本可靠性评估、课程学习、以及耦合动量对比学习,以增强模型对噪声数据的抵抗力。在对3200万个单元格文本对进行预训练后,OKR-CELL在6个评估任务中获得了最先进的结果。除了标准的基准,如细胞聚类,细胞类型注释,批量效应校正,和Few-Shot注释,该模型还表现出更广泛的多模式应用程序,包括zero-shot细胞类型注释和双向细胞文本检索的优越性能。
摘要:Recent advancements in single-cell multi-omics, particularly RNA-seq, have provided profound insights into cellular heterogeneity and gene regulation. While pre-trained language model (PLM) paradigm based single-cell foundation models have shown promise, they remain constrained by insufficient integration of in-depth individual profiles and neglecting the influence of noise within multi-modal data. To address both issues, we propose an Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL). It is built based on a cross-modal Cell-Language pre-training framework, which comprises two key innovations: (1) leveraging Large Language Models (LLMs) based workflow with retrieval-augmented generation (RAG) enriches cell textual descriptions using open-world knowledge; (2) devising a Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and coupled momentum contrastive learning to strengthen the model's resistance to noisy data. After pretraining on 32M cell-text pairs, OKR-CELL obtains cutting-edge results across 6 evaluation tasks. Beyond standard benchmarks such as cell clustering, cell-type annotation, batch-effect correction, and few-shot annotation, the model also demonstrates superior performance in broader multi-modal applications, including zero-shot cell-type annotation and bidirectional cell-text retrieval.


【21】DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models
标题:DNA Tokenizer:一个用于高吞吐量DNA语言模型的GOP优先字节到标识符Tokenizer
链接:https://arxiv.org/abs/2601.05531

作者:Eliatan Niktab,Hardip Patel
摘要:令牌化位于高吞吐量基因组输入和GPU计算之间的边界,在算法设计和系统吞吐量方面都提出了挑战。重叠的k-mer标记化可能会在掩码语言建模(MLM)下引入信息泄漏,并可能降低下游准确性。单核苷酸标记化避免了泄漏并保留了每个碱基的保真度,但它大大增加了基于注意力的架构的序列长度。非重叠k-mer和字节对编码(BPE)提供压缩并避免泄漏,但以边界敏感性或降低的可解释性为代价。根据经验,标记化的选择与模型架构和任务需求有很强的交互作用。然而,在系统级别,一旦输入达到数十亿个碱基,标准字符串标记器和主机绑定词汇表查找就占据了挂钟时间,而不管标记化算法如何。我们提出了DNATok,一个高性能的GPU优先令牌化系统,它使用基于字节查找表(LUT)的标识符流和重叠的主机到设备(H2 D)/计算流水线使用固定内存和架构并行性来取代通用字符串处理。DNATok与词汇无关:它加速了单核苷酸、非重叠k-mer和BPE标记化,并集成为基因组基础模型下的插入式系统层。DNATok的编码吞吐量比优化的Hugging Face基线高84- 95倍,H2 D吞吐量高1.9倍。端到端流传输速度可达1.27-1.84e8 tokens/s,具体取决于配置,有效地消除了令牌化作为生产规模训练和推理的瓶颈。
摘要 :Tokenization sits at the boundary between high-throughput genomic input and GPU compute, posing challenges in both algorithm design and system throughput. Overlapping k-mer tokenization can introduce information leakage under masked language modeling (MLM) and may degrade downstream accuracy. Single-nucleotide tokenization avoids leakage and preserves per-base fidelity, but it greatly increases sequence length for attention-based architectures. Non-overlapping k-mers and byte-pair encoding (BPE) provide compression and avoid leakage, at the cost of boundary sensitivity or reduced interpretability. Empirically, the choice of tokenization interacts strongly with model architecture and task requirements. At the system level, however, standard string tokenizers and host-bound vocabulary lookups dominate wall-clock time once inputs reach billions of bases, regardless of the tokenization algorithm. We present DNATok, a high-performance, GPU-first tokenization system that replaces general-purpose string processing with byte lookup table (LUT)-based identifier streaming and an overlapped host-to-device (H2D)/compute pipeline using pinned memory and architectural parallelism. DNATok is vocabulary-agnostic: it accelerates single-nucleotide, non-overlapping k-mer, and BPE tokenization, and integrates as a drop-in systems layer beneath genomic foundation models. DNATok achieves 84-95x higher encoding throughput than optimized Hugging Face baselines and up to 1.9x higher H2D throughput. End-to-end streaming reaches 1.27-1.84e8 tokens/s depending on configuration, effectively removing tokenization as a bottleneck for production-scale training and inference.


Graph相关(图学习|图神经网络|图优化等)(5篇)

【1】CyberGFM: Graph Foundation Models for Lateral Movement Detection in Enterprise Networks
标题:CyberGFM:企业网络中横向移动检测的图形基础模型
链接:https://arxiv.org/abs/2601.05988

作者:Isaiah J. King,Bernardo Trindade,Benjamin Bowman,H. Howie Huang
备注:17 pages; 11 figures; 8 tables
摘要:将网络表示为图,并利用良性连接训练链接预测模型是基于异常的入侵检测的一种有效方法。使用这种技术的现有作品已经取得了巨大的成功,使用时间图神经网络和基于跳跃图的方法随机游走。然而,基于随机游走的方法无法合并丰富的边缘数据,而基于GNN的方法需要大量的存储器来进行训练。在这项工作中,我们建议将原始的见解从基于随机行走的跳跃图-通过图的随机行走类似于语料库中的句子-扩展到更现代的基于transformer的基础模型。使用利用GPU优化的语言模型,我们可以快速训练图形基础模型,以预测通过计算机网络随机遍历中丢失的标记。然后,对图基础模型进行微调,用于链路预测,并用作网络异常检测器。这种新方法使我们能够结合基于随机行走的方法的效率和深度学习方法的丰富语义表示。这个系统,我们称之为CyberGFM,在三个广泛使用的网络异常检测数据集上取得了最先进的结果,平均精度提高了2$\times$。我们发现,CyberGFM优于所有以前的作品在无监督的链接预测网络异常检测,使用相同数量的参数,并具有相同或更好的效率比以前的最佳方法。
摘要:Representing networks as a graph and training a link prediction model using benign connections is an effective method of anomaly-based intrusion detection. Existing works using this technique have shown great success using temporal graph neural networks and skip-gram-based approaches on random walks. However, random walk-based approaches are unable to incorporate rich edge data, while the GNN-based approaches require large amounts of memory to train. In this work, we propose extending the original insight from random walk-based skip-grams--that random walks through a graph are analogous to sentences in a corpus--to the more modern transformer-based foundation models. Using language models that take advantage of GPU optimizations, we can quickly train a graph foundation model to predict missing tokens in random walks through a network of computers. The graph foundation model is then finetuned for link prediction and used as a network anomaly detector. This new approach allows us to combine the efficiency of random walk-based methods and the rich semantic representation of deep learning methods. This system, which we call CyberGFM, achieved state-of-the-art results on three widely used network anomaly detection datasets, delivering a up to 2$\times$ improvement in average precision. We found that CyberGFM outperforms all prior works in unsupervised link prediction for network anomaly detection, using the same number of parameters, and with equal or better efficiency than the previous best approaches.


【2】SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes
标题:SceneAlign:将多模式推理与复杂视觉场景中的场景图对齐
链接:https://arxiv.org/abs/2601.05600

作者:Chuhan Wang,Xintong Li,Jennifer Yuntong Zhang,Junda Wu,Chengkai Huang,Lina Yao,Julian McAuley,Jingbo Shang
备注:Preprint
摘要:多模态大型语言模型通常在复杂的视觉场景中难以进行忠实的推理,其中复杂的实体和关系在每一步都需要精确的视觉基础。这种推理的不忠实经常表现为幻觉实体,错误的基础关系,跳过的步骤和过度指定的推理。现有的基于偏好的方法,通常依赖于文本扰动或答案条件推理,无法解决这一挑战,因为它们允许模型利用语言先验来绕过视觉基础。为了解决这个问题,我们提出了SceneAlign,一个利用场景图作为结构化视觉信息来执行可控结构干预的框架。通过识别推理关键节点,并通过模拟典型接地故障的四种有针对性的策略对其进行干扰,SceneAlign构建了严格的负面理由,这些理由在语言上仍然合理,但基于不准确的视觉事实。这些对比对用于直接偏好优化,以将模型转向细粒度,结构忠实的推理。在七个视觉推理基准中,SceneAlign始终提高了答案的准确性和推理的忠实性,突出了多模态推理的接地感知对齐的有效性。
摘要:Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that remain linguistically plausible but are grounded in inaccurate visual facts. These contrastive pairs are used in Direct Preference Optimization to steer models toward fine-grained, structure-faithful reasoning. Across seven visual reasoning benchmarks, SceneAlign consistently improves answer accuracy and reasoning faithfulness, highlighting the effectiveness of grounding-aware alignment for multimodal reasoning.


【3】Scalable Heterogeneous Graph Learning via Heterogeneous-aware Orthogonal Prototype Experts
标题:通过具有异类感知的交叉原型专家进行可扩展的异类图学习
链接:https://arxiv.org/abs/2601.05537

作者:Wei Zhou,Hong Huang,Ruize Shi,Bang Liu
摘要:异构图神经网络(HGNN)主要通过更好的编码器取得了进步,但它们的解码/投影阶段仍然依赖于单个共享线性头,假设它可以将丰富的节点嵌入映射到标签。我们称之为线性投影瓶颈:在异构图中,上下文多样性和长尾移位使全局头部错过精细语义,过度拟合枢纽节点,而欠服务尾节点。虽然专家混合(MoE)可能有所帮助,但天真地应用它会与结构失衡发生冲突,并有专家崩溃的风险。我们提出了一个异构感知正交原型专家框架命名为希望,即插即用的标准预测头的替代品。HOPE使用可学习的基于原型的路由来根据相似性将实例分配给专家,让专家使用遵循自然的长尾分布,并添加专家正交化以鼓励多样性并防止崩溃。在四个真实数据集上的实验表明,SOTA HGNN主干的增益一致,开销最小。
摘要:Heterogeneous Graph Neural Networks(HGNNs) have advanced mainly through better encoders, yet their decoding/projection stage still relies on a single shared linear head, assuming it can map rich node embeddings to labels. We call this the Linear Projection Bottleneck: in heterogeneous graphs, contextual diversity and long-tail shifts make a global head miss fine semantics, overfit hub nodes, and underserve tail nodes. While Mixture-of-Experts(MoE) could help, naively applying it clashes with structural imbalance and risks expert collapse. We propose a Heterogeneous-aware Orthogonal Prototype Experts framework named HOPE, a plug-and-play replacement for the standard prediction head. HOPE uses learnable prototype-based routing to assign instances to experts by similarity, letting expert usage follow the natural long-tail distribution, and adds expert orthogonalization to encourage diversity and prevent collapse. Experiments on four real datasets show consistent gains across SOTA HGNN backbones with minimal overhead.


【4】DynaSTy: A Framework for SpatioTemporal Node Attribute Prediction in Dynamic Graphs
标题:DynaSTy:动态图中时空节点属性预测框架
链接:https://arxiv.org/abs/2601.05391

作者:Namrata Banerji,Tanya Berger-Wolf
摘要:动态图上节点级属性的准确多步预测对于从金融信任网络到生物网络的应用都是至关重要的。现有的时空图神经网络通常假设一个静态的邻接矩阵。在这项工作中,我们提出了一个端到端的动态边缘偏置时空模型,摄取一个多维的时间序列的节点属性和时间序列的邻接矩阵,预测多个未来的步骤的节点属性。在每个时间步,我们的基于transformer的模型将给定的邻接作为可适应的注意力偏差注入,使模型能够随着图形的演变而关注相关的邻居。我们进一步部署了一个掩蔽的节点时间预训练目标,该目标使编码器能够重建丢失的特征,并使用预定的采样和水平加权损失进行训练,以减轻长期范围内的复合误差。与以前的工作不同,我们的模型可以适应不同输入样本的动态图,从而能够在多系统环境中进行预测,例如不同主题的大脑网络,不同背景下的金融系统或不断发展的社会系统。实证结果表明,我们的方法始终优于强基线的均方根误差(RMSE)和平均绝对误差(MAE)。
摘要:Accurate multistep forecasting of node-level attributes on dynamic graphs is critical for applications ranging from financial trust networks to biological networks. Existing spatiotemporal graph neural networks typically assume a static adjacency matrix. In this work, we propose an end-to-end dynamic edge-biased spatiotemporal model that ingests a multi-dimensional timeseries of node attributes and a timeseries of adjacency matrices, to predict multiple future steps of node attributes. At each time step, our transformer-based model injects the given adjacency as an adaptable attention bias, allowing the model to focus on relevant neighbors as the graph evolves. We further deploy a masked node-time pretraining objective that primes the encoder to reconstruct missing features, and train with scheduled sampling and a horizon-weighted loss to mitigate compounding error over long horizons. Unlike prior work, our model accommodates dynamic graphs that vary across input samples, enabling forecasting in multi-system settings such as brain networks across different subjects, financial systems in different contexts, or evolving social systems. Empirical results demonstrate that our method consistently outperforms strong baselines on Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).


【5】Manifold limit for the training of shallow graph convolutional neural networks
标题:浅图卷积神经网络训练的总管限制
链接:https://arxiv.org/abs/2601.06025

作者:Johanna Tengler,Christoph Brune,José A. Iglesias
备注:44 pages, 0 figures, 1 table
摘要:我们研究了在流形假设下,浅图卷积神经网络(GCNN)在采样点云的邻近图上训练的离散到连续一致性。图卷积通过图拉普拉斯算子在谱上定义,其低频谱近似于底层光滑流形的Laplace-Beltrami算子的低频谱,并且可能无限宽的浅GCNN是参数空间上的测度空间上的线性泛函。从函数分析的角度来看,图形信号被视为流形上函数的空间离散化,这导致了训练数据在图形分辨率上一致的自然概念。为了实现收敛结果,连续统参数空间被选择为单位球的弱紧积,其中Sobolev正则性强加于输出权重和偏置,但不施加于卷积参数。相应的离散参数空间继承了相应的谱衰减,并且还受到适用于图拉普拉斯算子的信息谱窗口的频率截止的限制。在这些假设下,我们证明了正则化经验风险最小化泛函在紧集上的参数测度弱收敛和函数一致收敛的意义下的$Γ-收敛性和相应的全局极小值的收敛性.这为训练这种网络提供了网格和样本独立性的形式化。
摘要:We study the discrete-to-continuum consistency of the training of shallow graph convolutional neural networks (GCNNs) on proximity graphs of sampled point clouds under a manifold assumption. Graph convolution is defined spectrally via the graph Laplacian, whose low-frequency spectrum approximates that of the Laplace-Beltrami operator of the underlying smooth manifold, and shallow GCNNs of possibly infinite width are linear functionals on the space of measures on the parameter space. From this functional-analytic perspective, graph signals are seen as spatial discretizations of functions on the manifold, which leads to a natural notion of training data consistent across graph resolutions. To enable convergence results, the continuum parameter space is chosen as a weakly compact product of unit balls, with Sobolev regularity imposed on the output weight and bias, but not on the convolutional parameter. The corresponding discrete parameter spaces inherit the corresponding spectral decay, and are additionally restricted by a frequency cutoff adapted to the informative spectral window of the graph Laplacians. Under these assumptions, we prove $Γ$-convergence of regularized empirical risk minimization functionals and corresponding convergence of their global minimizers, in the sense of weak convergence of the parameter measures and uniform convergence of the functions over compact sets. This provides a formalization of mesh and sample independence for the training of such networks.


Transformer(7篇)

【1】LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection
标题:LookAroundNet:使用Transformers扩展时间上下文,以实现临床可行的脑电癫痫发作检测
链接:https://arxiv.org/abs/2601.06016

作者:Þór Sverrisson,Steinn Guðmundsson
摘要:由于患者、记录条件和临床设置之间的癫痫发作动力学的大变化性,从脑电图(EEG)的自动癫痫发作检测仍然是困难的。我们介绍了LookAroundNet,一个基于变换的癫痫发作检测器,它使用更宽的EEG数据时间窗口来模拟癫痫发作活动。癫痫发作检测器在感兴趣的片段之前和之后合并EEG信号,反映临床医生在解释EEG记录时如何使用周围环境。我们评估了所提出的方法在多个EEG数据集跨越不同的临床环境,患者人群和记录方式,包括常规临床EEG和长期动态记录,以研究不同数据分布的性能。评估包括公开可用的数据集以及大量专有的家庭EEG记录,提供受控临床数据和不受约束的家庭监测条件的补充视图。我们的研究结果表明,LookAroundNet在数据集上实现了强大的性能,很好地概括了以前看不见的记录条件,并且计算成本与现实世界的临床部署兼容。结果表明,扩展的时间上下文,增加训练数据的多样性和模型集成是提高性能的关键因素。这项工作有助于将自动癫痫发作检测模型推向临床可行的解决方案。
摘要:Automated seizure detection from electroencephalography (EEG) remains difficult due to the large variability of seizure dynamics across patients, recording conditions, and clinical settings. We introduce LookAroundNet, a transformer-based seizure detector that uses a wider temporal window of EEG data to model seizure activity. The seizure detector incorporates EEG signals before and after the segment of interest, reflecting how clinicians use surrounding context when interpreting EEG recordings. We evaluate the proposed method on multiple EEG datasets spanning diverse clinical environments, patient populations, and recording modalities, including routine clinical EEG and long-term ambulatory recordings, in order to study performance across varying data distributions. The evaluation includes publicly available datasets as well as a large proprietary collection of home EEG recordings, providing complementary views of controlled clinical data and unconstrained home-monitoring conditions. Our results show that LookAroundNet achieves strong performance across datasets, generalizes well to previously unseen recording conditions, and operates with computational costs compatible with real-world clinical deployment. The results indicate that extended temporal context, increased training data diversity, and model ensembling are key factors for improving performance. This work contributes to moving automatic seizure detection models toward clinically viable solutions.


【2】Fusion Matters: Length-Aware Analysis of Positional-Encoding Fusion in Transformers
标题:融合很重要:Transformer中位置编码融合的长度感知分析
链接:https://arxiv.org/abs/2601.05807

作者:Mohamed Amine Hallam,Kuo-Kun Tseng
备注:10 pages, 5 figures. Code and reproduction materials available on GitHub
摘要:Transformers需要位置编码来表示序列顺序,但大多数先前的工作集中在设计新的位置编码,而不是研究位置信息如何与令牌嵌入融合。在本文中,我们研究了融合机制本身是否会影响性能,特别是在长序列设置。我们进行了一个控制的实证研究,比较三个典型的融合策略-元素的加法,级联与投影,标量门控融合-在相同的Transformer架构,数据分割,和随机种子。在短(AG News)、中(IMDB)和长(ArXiv)序列的三个文本分类数据集上的实验表明,融合选择对短文本的影响可以忽略不计,但对长文档产生一致的收益。为了验证这些收益是结构性的,而不是随机的,我们进行配对种子分析和跨序列长度制度的跨数据集比较。在ArXiv数据集上的其他实验表明,可学习融合的好处可以推广到多个位置编码家族。最后,我们探索了一种轻量级的卷积门控机制,该机制在融合级别引入了局部归纳偏差,仅在长文档上进行评估。我们的研究结果表明,位置编码融合是一个非平凡的设计选择长序列的Transformers,应被视为一个明确的建模决策,而不是一个固定的默认值。
摘要:Transformers require positional encodings to represent sequence order, yet most prior work focuses on designing new positional encodings rather than examining how positional information is fused with token embeddings. In this paper, we study whether the fusion mechanism itself affects performance, particularly in long-sequence settings. We conduct a controlled empirical study comparing three canonical fusion strategies--element-wise addition, concatenation with projection, and scalar gated fusion--under identical Transformer architectures, data splits, and random seeds. Experiments on three text classification datasets spanning short (AG News), medium (IMDB), and long (ArXiv) sequences show that fusion choice has negligible impact on short texts but produces consistent gains on long documents. To verify that these gains are structural rather than stochastic, we perform paired-seed analysis and cross-dataset comparison across sequence-length regimes. Additional experiments on the ArXiv dataset indicate that the benefit of learnable fusion generalizes across multiple positional encoding families. Finally, we explore a lightweight convolutional gating mechanism that introduces local inductive bias at the fusion level, evaluated on long documents only. Our results indicate that positional-encoding fusion is a non-trivial design choice for long-sequence Transformers and should be treated as an explicit modeling decision rather than a fixed default.


【3】Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer
标题:权重到代码:从离散Transformer中提取可解释算法
链接:https://arxiv.org/abs/2601.05770

作者:Yifan Zhang,Wei Bi,Kechi Zhang,Dongming Jin,Jie Fu,Zhi Jin
摘要:算法提取旨在直接从特定算法任务训练的模型中合成可执行程序,从而实现从头算法发现,而无需依赖于人类编写的代码。然而,将此范式扩展到Transformer受到叠加的阻碍,其中以重叠方向编码的纠缠特征阻碍了符号表达式的提取。在这项工作中,我们提出了离散Transformer,明确设计的架构,以弥合连续表示和离散符号逻辑之间的差距。通过实施严格的功能性解纠缠,将数值注意力约束到信息路由和数值MLP约束到元素算法,并采用温度退火采样,我们的方法有效地促进了人类可读程序的提取。从经验上讲,离散Transformer不仅实现了与基于RNN的基线相当的性能,而且关键地将可解释性扩展到连续变量域。此外,我们的退火过程的分析表明,有效的离散搜索经历了一个明确的阶段过渡,从探索到开发。我们进一步证明,我们的方法,使细粒度的控制,通过施加诱导偏见的合成程序。总的来说,这些研究结果建立了离散Transformer作为一个强大的框架,演示免费的算法发现,提供了一个严格的途径,Transformer的可解释性。
摘要:Algorithm extraction aims to synthesize executable programs directly from models trained on specific algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, extending this paradigm to Transformer is hindered by superposition, where entangled features encoded in overlapping directions obstruct the extraction of symbolic expressions. In this work, we propose the Discrete Transformer, an architecture explicitly engineered to bridge the gap between continuous representations and discrete symbolic logic. By enforcing a strict functional disentanglement, which constrains Numerical Attention to information routing and Numerical MLP to element-wise arithmetic, and employing temperature-annealed sampling, our method effectively facilitates the extraction of human-readable programs. Empirically, the Discrete Transformer not only achieves performance comparable to RNN-based baselines but crucially extends interpretability to continuous variable domains. Moreover, our analysis of the annealing process shows that the efficient discrete search undergoes a clear phase transition from exploration to exploitation. We further demonstrate that our method enables fine-grained control over synthesized programs by imposing inductive biases. Collectively, these findings establish the Discrete Transformer as a robust framework for demonstration-free algorithm discovery, offering a rigorous pathway toward Transformer interpretability.


【4】ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers
标题:ViTNT-FIQA:使用Vision Transformers进行免训练面部图像质量评估
链接:https://arxiv.org/abs/2601.05741

作者:Guray Ozgur,Eduarda Caldeira,Tahar Chettaoui,Jan Niklas Kolf,Marco Huber,Naser Damer,Fadi Boutros
备注:Accepted at WACV Workshops
摘要:人脸图像质量评估(FIQA)是可靠的人脸识别系统的基础。目前的方法主要只利用最终层表示,而无需训练的方法需要多次向前传递或反向传播。我们提出了ViTNT-FIQA,一种无需训练的方法,该方法可以测量中间Vision Transformer(ViT)块之间补丁嵌入演化的稳定性。我们证明了高质量的人脸图像表现出稳定的特征细化轨迹跨块,而退化的图像显示不稳定的转换。我们的方法计算从连续的Transformer块的L2归一化补丁嵌入之间的欧几里得距离,并将它们聚合成图像级质量分数。我们经验验证这种相关性的质量标记的合成数据集与控制的退化水平。与现有的无训练方法不同,ViTNT-FIQA只需要一个前向传递,而无需反向传播或架构修改。通过对八个基准测试(LFW,CFDB-30,CFP-FP,CALFW,Adience,CPLFW,XQLFW,IJB-C)的广泛评估,我们表明ViTNT-FIQA在保持计算效率和立即适用于任何预训练的基于ViT的人脸识别模型的同时,可以使用最先进的方法实现具有竞争力的性能。
摘要:Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.


【5】Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models
标题:追踪预训练Transformer中的刻板印象:从有偏见的神经元到更公平的模型
链接:https://arxiv.org/abs/2601.05663

作者:Gianmario Voria,Moses Openja,Foutse Khomh,Gemma Catolino,Fabio Palomba
摘要:基于transformer的语言模型的出现重塑了人工智能系统处理和生成文本的方式。在软件工程(SE)中,这些模型现在支持各种活动,加速自动化和决策。然而,有证据表明,这些模式可以复制或放大社会偏见,引起公平问题。最近关于神经元编辑的研究表明,预先训练的Transformers中的内部激活可以被跟踪和修改,以改变模型的行为。基于知识神经元的概念,编码事实信息的神经元,我们假设存在偏见的神经元,这些神经元在预先训练的Transformers中捕获刻板印象。为了检验这个假设,我们建立了一个有偏关系的数据集,即,三元组编码九种偏见类型的刻板印象,并适应神经元归因策略,以跟踪和抑制BERT模型中的偏见神经元。然后,我们评估抑制对SE任务的影响。我们的研究结果表明,有偏见的知识是本地化的小神经元子集,并抑制他们大大减少偏见,最小的性能损失。这表明,Transformers中的偏差可以在神经元级别上跟踪和减轻,为SE中的公平性提供了一种可解释的方法。
摘要 :The advent of transformer-based language models has reshaped how AI systems process and generate text. In software engineering (SE), these models now support diverse activities, accelerating automation and decision-making. Yet, evidence shows that these models can reproduce or amplify social biases, raising fairness concerns. Recent work on neuron editing has shown that internal activations in pre-trained transformers can be traced and modified to alter model behavior. Building on the concept of knowledge neurons, neurons that encode factual information, we hypothesize the existence of biased neurons that capture stereotypical associations within pre-trained transformers. To test this hypothesis, we build a dataset of biased relations, i.e., triplets encoding stereotypes across nine bias types, and adapt neuron attribution strategies to trace and suppress biased neurons in BERT models. We then assess the impact of suppression on SE tasks. Our findings show that biased knowledge is localized within small neuron subsets, and suppressing them substantially reduces bias with minimal performance loss. This demonstrates that bias in transformers can be traced and mitigated at the neuron level, offering an interpretable approach to fairness in SE.


【6】Transformer Is Inherently a Causal Learner
标题:Transformer本质上是一个因果学习者
链接:https://arxiv.org/abs/2601.05647

作者:Xinyue Wang,Stephen Wang,Biwei Huang
摘要:我们发现,以自回归方式训练的Transformers在其学习的表征中自然地编码时间延迟的因果结构。当预测多变量时间序列中的未来值时,Transformer输出相对于过去输入的梯度灵敏度直接恢复底层因果图,而没有任何明确的因果目标或结构约束。我们证明了这种连接理论上在标准的可识别性条件下,并开发了一个实用的提取方法,使用聚合梯度属性。在具有挑战性的情况下,如非线性动力学,长期依赖性和非平稳系统,这种方法大大超过了最先进的发现算法的性能,特别是随着数据异质性的增加,表现出扩展潜力,其中因果准确性随着数据量和异质性而提高,这是传统方法所缺乏的。这种统一的观点为未来的范式奠定了基础,在未来的范式中,因果发现通过基础模型的镜头进行操作,基础模型通过因果关系的镜头获得可解释性和增强。
摘要:We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.


【7】A universal vision transformer for fast calorimeter simulations
标题:用于快速量热计模拟的通用视觉Transformer
链接:https://arxiv.org/abs/2601.05289

作者:Luigi Favaro,Andrea Giammanco,Claudius Krause
备注:37 pages, 15 figures, 8 tables
摘要:探测器的高维复杂性使得快速量热计模拟成为现代生成式机器学习的主要应用。Vision Transformers(ViTs)可以以无与伦比的精度模拟Geant 4响应,并且不限于规则的几何形状。从CaloDREAM架构开始,我们展示了ViTs在规则和不规则几何形状以及多个检测器上的鲁棒性和可扩展性。我们的研究结果表明,ViTs产生的电磁和强子簇射在统计上无法区分从Geant 4在多个评估指标,同时保持在$\mathcal{O}(10-100)$ ms的生成时间在一个单一的GPU。此外,我们表明,在大型数据集上进行预训练和对目标几何形状进行微调可以降低训练成本和提高数据效率,或者完全提高生成的淋浴的保真度。
摘要:The high-dimensional complex nature of detectors makes fast calorimeter simulations a prime application for modern generative machine learning. Vision transformers (ViTs) can emulate the Geant4 response with unmatched accuracy and are not limited to regular geometries. Starting from the CaloDREAM architecture, we demonstrate the robustness and scalability of ViTs on regular and irregular geometries, and multiple detectors. Our results show that ViTs generate electromagnetic and hadronic showers statistically indistinguishable from Geant4 in multiple evaluation metrics, while maintaining the generation time in the $\mathcal{O}(10-100)$ ms on a single GPU. Furthermore, we show that pretraining on a large dataset and fine-tuning on the target geometry leads to reduced training costs and higher data efficiency, or altogether improves the fidelity of generated showers.


GAN|对抗|攻击|生成相关(4篇)

【1】SceneFoundry: Generating Interactive Infinite 3D Worlds
标题:SceneFoundry:生成交互式无限3D世界
链接:https://arxiv.org/abs/2601.05810

作者:ChunTeng Chen,YiChen Hsu,YiWen Liu,WeiFang Sun,TsaiChing Ni,ChunYi Lee,Min Sun,YuanFu Yang
备注:15 pages
摘要:自动生成大规模、交互式和物理逼真的3D环境的能力对于推进机器人学习和体现智能至关重要。然而,现有的生成方法往往无法捕捉真实世界的内部功能的复杂性,特别是那些包含铰接对象与可移动的部分必不可少的操纵和导航。本文介绍了SceneFoundry,一个语言引导的扩散框架,它可以生成公寓级的3D世界,具有功能性关节家具和语义多样的布局,用于机器人训练。根据自然语言提示,LLM模块控制地板布局的生成,而基于扩散的后验采样则使用来自大规模3D存储库的铰接资产有效地填充场景。为了确保物理可用性,SceneFoundry采用可微分的引导功能来调节对象数量,防止关节碰撞,并为机器人导航保持足够的步行空间。大量的实验表明,我们的框架在不同的场景类型和条件下生成结构有效,语义一致,功能交互的环境,使可扩展的体现AI研究。
摘要:The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research.


【2】AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces
标题:AGDC:具有联合离散和连续空间的变长序列的自回归生成
链接:https://arxiv.org/abs/2601.05680

作者:Yeonsang Shin,Insoo Kim,Bongkeun Kim,Keonwoo Bae,Bohyung Han
摘要:基于transformer的自回归模型在数据生成方面表现出色,但本质上受到其对离散化令牌的依赖的限制,这限制了其以高精度表示连续值的能力。我们分析了现有的基于离散化的方法生成混合离散-连续序列的可扩展性限制,特别是在高精度领域,如半导体电路设计,精度损失可能导致功能故障。为了应对这一挑战,我们提出了AGDC,这是一个新的统一框架,它联合建模可变长度序列的离散和连续值。AGDC采用了一种混合方法,将离散值的分类预测与连续值的基于扩散的建模相结合,包含两个关键技术组件:序列结束(EOS)logit调整机制,该机制使用MLP基于序列上下文动态调整EOS令牌logits,以及集成到损失函数中的长度正则化项。此外,我们还介绍了ContLayNet,这是一个大规模的基准测试,包括334 K高精度半导体布局样本,具有专门的评估指标,可以捕获精度误差显著影响性能的功能正确性。半导体布局(ContLayNet),图形布局和SVG上的实验表明,与基于离散化和固定模式的基线相比,AGDC在生成高保真混合矢量表示方面具有卓越的性能,实现了跨不同领域的可扩展高精度生成。
摘要:Transformer-based autoregressive models excel in data generation but are inherently constrained by their reliance on discretized tokens, which limits their ability to represent continuous values with high precision. We analyze the scalability limitations of existing discretization-based approaches for generating hybrid discrete-continuous sequences, particularly in high-precision domains such as semiconductor circuit designs, where precision loss can lead to functional failure. To address the challenge, we propose AGDC, a novel unified framework that jointly models discrete and continuous values for variable-length sequences. AGDC employs a hybrid approach that combines categorical prediction for discrete values with diffusion-based modeling for continuous values, incorporating two key technical components: an end-of-sequence (EOS) logit adjustment mechanism that uses an MLP to dynamically adjust EOS token logits based on sequence context, and a length regularization term integrated into the loss function. Additionally, we present ContLayNet, a large-scale benchmark comprising 334K high-precision semiconductor layout samples with specialized evaluation metrics that capture functional correctness where precision errors significantly impact performance. Experiments on semiconductor layouts (ContLayNet), graphic layouts, and SVGs demonstrate AGDC's superior performance in generating high-fidelity hybrid vector representations compared to discretization-based and fixed-schema baselines, achieving scalable high-precision generation across diverse domains.


【3】RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models
标题:RingSQL:使用文本到SQL推理模型的模式独立模板生成合成数据
链接:https://arxiv.org/abs/2601.05451

作者:Marko Sterbentz,Kevin Cushing,Cameron Barrie,Kristian J. Hammond
摘要:文本到SQL系统的最新进展是由更大的模型和改进的数据集驱动的,但进展仍然受到高质量训练数据稀缺的限制。手动创建数据是昂贵的,现有的合成方法权衡了可靠性和可扩展性。基于模板的方法可确保正确的SQL,但需要特定于模式的模板,而基于LLM的生成很容易扩展,但缺乏质量和正确性保证。我们介绍RingSQL,一个混合数据生成框架,它结合了模式无关的查询模板与基于LLM的自然语言问题的释义。这种方法在提供广泛的语言多样性的同时,在不同的模式中保持SQL的正确性。在我们的实验中,我们发现使用RingSQL生成的数据训练的模型与使用其他合成数据训练的模型相比,在六个文本到SQL基准测试中的准确性平均提高了+2.3%。我们在https://github.com/nu-c3lab/RingSQL上提供我们的代码。
摘要:Recent advances in text-to-SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high-quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template-based approaches ensure correct SQL but require schema-specific templates, while LLM-based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text-to-SQL benchmarks when compared to models trained on other synthetic data. We make our code available at https://github.com/nu-c3lab/RingSQL.


【4】Generalizable Blood Pressure Estimation from Multi-Wavelength PPG Using Curriculum-Adversarial Learning
标题:使用课程对抗学习从多波PPV中进行可推广的血压估计
链接:https://arxiv.org/abs/2509.12518

作者:Zequan Liang,Ruoyu Zhang,Wei Shao,Mahdi Pirayesh Shirazi Nejad,Ehsan Kourkchi,Setareh Rafatirad,Houman Homayoun
备注:In the proceedings of IEEE-EMBS International Conference on Body Sensor Networks 2025
摘要:准确和可推广的血压(BP)估计对于心血管疾病的早期发现和管理至关重要。在这项研究中,我们在公共多波长光电容积描记(PPG)数据集上执行受试者级别的数据分割,并提出了一个基于竞争-对抗学习的可推广的BP估计框架。我们的方法将课程学习(从高血压分类过渡到BP回归)与域对抗训练(混淆主体身份以鼓励学习主体不变特征)相结合。实验表明,多通道融合的性能始终优于单通道模型。在四波长PPG数据集上,我们的方法在严格的受试者级别分割下实现了强大的性能,收缩压(SBP)的平均绝对误差(MAE)为14.2mmHg,舒张压(DBP)为6.4mmHg。此外,消融研究验证了课程和对抗成分的有效性。这些结果突出了利用多波长PPG和竞争对抗策略中的互补信息进行准确和鲁棒BP估计的潜力。
摘要:Accurate and generalizable blood pressure (BP) estimation is vital for the early detection and management of cardiovascular diseases. In this study, we enforce subject-level data splitting on a public multi-wavelength photoplethysmography (PPG) dataset and propose a generalizable BP estimation framework based on curriculum-adversarial learning. Our approach combines curriculum learning, which transitions from hypertension classification to BP regression, with domain-adversarial training that confuses subject identity to encourage the learning of subject-invariant features. Experiments show that multi-channel fusion consistently outperforms single-channel models. On the four-wavelength PPG dataset, our method achieves strong performance under strict subject-level splitting, with mean absolute errors (MAE) of 14.2mmHg for systolic blood pressure (SBP) and 6.4mmHg for diastolic blood pressure (DBP). Additionally, ablation studies validate the effectiveness of both the curriculum and adversarial components. These results highlight the potential of leveraging complementary information in multi-wavelength PPG and curriculum-adversarial strategies for accurate and robust BP estimation.


半/弱/无/有监督|不确定性|主动学习(4篇)

【1】AWaRe-SAC: Proactive Slice Admission Control under Weather-Induced Capacity Uncertainty
标题:AWaRe-SAC:天气引发的容量不确定性下的主动切片准入控制
链接:https://arxiv.org/abs/2601.05978

作者:Dror Jacoby,Yanzhi Li,Shuyue Yu,Nicola Di Cicco,Hagit Messer,Gil Zussman,Igor Kadota
摘要:随着新兴应用需要更高的吞吐量和更低的时延,运营商越来越多地在x-haul传输网络内部署毫米波(mmWave)链路,跨越前传、中途和回程段。然而,毫米波频率对天气相关衰减的固有敏感性,特别是雨衰,使严格的服务质量(QoS)要求的维护复杂化。这就产生了一个关键的挑战:在未来网络容量不确定的情况下做出接纳决定。为了解决这个问题,我们开发了一个主动切片准入控制框架的毫米波x-haul网络受到降雨引起的波动。我们的目标是提高网络性能、确保QoS并优化收入,从而超越标准反应式方法的局限性。该框架集成了未来网络条件的深度学习预测器和基于主动Q学习的切片准入控制机制。我们使用来自密集城区的毫米波x-haul部署的真实数据验证了我们的解决方案,并结合了链路容量衰减和动态切片需求的现实模型。广泛的评估表明,我们的主动解决方案在动态链路条件下实现了2- 3倍的长期平均收入,为自适应准入控制提供了一个可扩展和弹性的框架。
摘要 :As emerging applications demand higher throughput and lower latencies, operators are increasingly deploying millimeter-wave (mmWave) links within x-haul transport networks, spanning fronthaul, midhaul, and backhaul segments. However, the inherent susceptibility of mmWave frequencies to weather-related attenuation, particularly rain fading, complicates the maintenance of stringent Quality of Service (QoS) requirements. This creates a critical challenge: making admission decisions under uncertainty regarding future network capacity. To address this, we develop a proactive slice admission control framework for mmWave x-haul networks subject to rain-induced fluctuations. Our objective is to improve network performance, ensure QoS, and optimize revenue, thereby surpassing the limitations of standard reactive approaches. The proposed framework integrates a deep learning predictor of future network conditions with a proactive Q-learning-based slice admission control mechanism. We validate our solution using real-world data from a mmWave x-haul deployment in a dense urban area, incorporating realistic models of link capacity attenuation and dynamic slice demands. Extensive evaluations demonstrate that our proactive solution achieves 2-3x higher long-term average revenue under dynamic link conditions, providing a scalable and resilient framework for adaptive admission control.


【2】Learn to Evolve: Self-supervised Neural JKO Operator for Wasserstein Gradient Flow
标题:学习进化:Wasserstein梯度流的自我监督神经JKO操作员
链接:https://arxiv.org/abs/2601.05583

作者:Xue Feng,Li Wang,Deanna Needell,Rongjie Lai
摘要:Jordan-Kinderlehrer-Otto(JKO)格式为计算Wasserstein梯度流提供了一个稳定的变分框架,但它的实际应用往往受到重复求解JKO子问题的高计算成本的限制。我们提出了一种自监督的方法来学习JKO解决方案运营商,而不需要任何JKO轨迹的数值解。学习的算子将输入密度直接映射到相应的JKO子问题的最小值,并且可以迭代地应用于有效地生成梯度流演化。一个关键的挑战是,只有一些初始密度通常可用于训练。为了解决这个问题,我们引入了一个学习进化算法,该算法通过在轨迹生成和算子更新之间交替来共同学习JKO算子及其诱导轨迹。随着训练的进行,生成的数据越来越接近真实的JKO轨迹。同时,这种学习进化策略作为一种自然形式的数据增强,显着提高学习算子的泛化能力。数值实验表明,所提出的方法的准确性,稳定性和鲁棒性在各种选择的能量和初始条件。
摘要:The Jordan-Kinderlehrer-Otto (JKO) scheme provides a stable variational framework for computing Wasserstein gradient flows, but its practical use is often limited by the high computational cost of repeatedly solving the JKO subproblems. We propose a self-supervised approach for learning a JKO solution operator without requiring numerical solutions of any JKO trajectories. The learned operator maps an input density directly to the minimizer of the corresponding JKO subproblem, and can be iteratively applied to efficiently generate the gradient-flow evolution. A key challenge is that only a number of initial densities are typically available for training. To address this, we introduce a Learn-to-Evolve algorithm that jointly learns the JKO operator and its induced trajectories by alternating between trajectory generation and operator updates. As training progresses, the generated data increasingly approximates true JKO trajectories. Meanwhile, this Learn-to-Evolve strategy serves as a natural form of data augmentation, significantly enhancing the generalization ability of the learned operator. Numerical experiments demonstrate the accuracy, stability, and robustness of the proposed method across various choices of energies and initial conditions.


【3】Imitation Learning for Combinatorial Optimisation under Uncertainty
标题:不确定性下组合优化的模仿学习
链接:https://arxiv.org/abs/2601.05383

作者:Prakash Gawas,Antoine Legrain,Louis-Martin Rousseau
摘要:模仿学习(IL)提供了一个数据驱动的框架,用于近似制定为顺序决策问题(SDP)的大规模组合优化问题的策略,其中精确求解方法在计算上是难以解决的。在这种情况下,IL的一个核心但未充分探索的方面是生成培训演示的专家的角色。现有的研究采用了广泛的专家结构,但缺乏一个统一的框架来解释他们的建模假设,计算特性和对学习成绩的影响。   本文介绍了一个系统的分类专家IL组合优化下的不确定性。专家分类沿三个维度:(一)他们的不确定性,包括近视,确定性,全信息,两阶段随机和多阶段随机配方的治疗;(二)他们的最优水平,区分任务最优和近似的专家;和(iii)他们与学习者的互动模式,从一次监督迭代,互动计划。在此基础上,我们提出了一个通用的数据集聚合(Dagger)算法,支持多个专家查询,专家聚合,灵活的交互策略。   所提出的框架进行评估的动态医生到病人的分配问题,随机到达和容量限制。计算实验比较了专家类型和交互机制的学习结果。结果表明,从随机专家学习的政策始终优于那些从确定性或全信息专家,而交互式学习提高解决方案的质量,使用更少的专家演示。当随机优化在计算上变得具有挑战性时,聚合确定性专家提供了一种有效的替代方案。
摘要:Imitation learning (IL) provides a data-driven framework for approximating policies for large-scale combinatorial optimisation problems formulated as sequential decision problems (SDPs), where exact solution methods are computationally intractable. A central but underexplored aspect of IL in this context is the role of the \emph{expert} that generates training demonstrations. Existing studies employ a wide range of expert constructions, yet lack a unifying framework to characterise their modelling assumptions, computational properties, and impact on learning performance.   This paper introduces a systematic taxonomy of experts for IL in combinatorial optimisation under uncertainty. Experts are classified along three dimensions: (i) their treatment of uncertainty, including myopic, deterministic, full-information, two-stage stochastic, and multi-stage stochastic formulations; (ii) their level of optimality, distinguishing task-optimal and approximate experts; and (iii) their interaction mode with the learner, ranging from one-shot supervision to iterative, interactive schemes. Building on this taxonomy, we propose a generalised Dataset Aggregation (DAgger) algorithm that supports multiple expert queries, expert aggregation, and flexible interaction strategies.   The proposed framework is evaluated on a dynamic physician-to-patient assignment problem with stochastic arrivals and capacity constraints. Computational experiments compare learning outcomes across expert types and interaction regimes. The results show that policies learned from stochastic experts consistently outperform those learned from deterministic or full-information experts, while interactive learning improves solution quality using fewer expert demonstrations. Aggregated deterministic experts provide an effective alternative when stochastic optimisation becomes computationally challenging.


【4】A Critical Examination of Active Learning Workflows in Materials Science
标题:材料科学主动学习工作流程的批判性审查
链接:https://arxiv.org/abs/2601.05946

作者:Akhil S. Nair,Lucas Foppa
摘要:主动学习(AL)在材料科学中发挥着关键作用,可以实现诸如构建用于原子模拟的机器学习原子间势和自动驾驶实验室操作等应用。尽管其广泛使用,AL工作流程的可靠性和有效性取决于隐式的设计假设,很少系统地检查。在这里,我们批判性地评估了材料科学中部署的AL工作流程,并研究了关键的设计选择,如替代模型,采样策略,不确定性量化和评估指标,与其性能的关系。通过识别常见的陷阱和讨论实际的缓解策略,我们为从业者提供指导,以有效地设计,评估和解释材料科学中的AL工作流程。
摘要:Active learning (AL) plays a critical role in materials science, enabling applications such as the construction of machine-learning interatomic potentials for atomistic simulations and the operation of self-driving laboratories. Despite its widespread use, the reliability and effectiveness of AL workflows depend on implicit design assumptions that are rarely examined systematically. Here, we critically assess AL workflows deployed in materials science and investigate how key design choices, such as surrogate models, sampling strategies, uncertainty quantification and evaluation metrics, relate to their performance. By identifying common pitfalls and discussing practical mitigation strategies, we provide guidance to practitioners for the efficient design, assessment, and interpretation of AL workflows in materials science.


迁移|Zero/Few/One-Shot|自适应(3篇)

【1】Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer
标题:持续学习实现无遗忘和积极的知识转移
链接:https://arxiv.org/abs/2601.05623

作者:Zhi Wang,Zhongbin Wu,Yanni Li,Bing Liu,Guangxi Li,Yuping Wang
摘要:现有的任务序列的持续学习研究主要集中在处理灾难性遗忘,以平衡新任务的学习可塑性和旧任务的记忆稳定性。然而,理想的CL代理不仅应该能够克服CF,而且还应该鼓励正向和反向知识转移(KT),即,将从先前任务中学习到的知识用于新任务的学习(即FKT),并利用新任务的知识来改进先前任务的性能(即BKT)。为此,本文首先模型CL作为一个优化问题,其中每个顺序的学习任务的目标是实现其最佳性能的约束下,FKT和BKT应该是积极的。然后提出了一种新的增强型任务持续学习(ETCL)方法,该方法实现了无遗忘和正KT。此外,理论上估计了可能导致负FKT和BKT的界限。在此基础上,提出了一种新的在线任务相似性检测策略,以提高正KT。为了克服CF,ETCL学习了一组特定于任务的二进制掩码,以隔离每个任务的稀疏子网络,同时保留任务的密集网络的性能。在新任务学习开始时,ETCL尝试将新任务的梯度与先前最相似任务的子网络的梯度对齐,以确保正FKT。通过使用一种新的双目标优化策略和正交梯度投影方法,ETCL只更新以前的类似任务在分类层的权重,以实现积极的BKT。广泛的评估表明,建议的ETCL显着优于强基线上不同的,类似的,和混合的任务序列。
摘要:Existing research on continual learning (CL) of a sequence of tasks focuses mainly on dealing with catastrophic forgetting (CF) to balance the learning plasticity of new tasks and the memory stability of old tasks. However, an ideal CL agent should not only be able to overcome CF, but also encourage positive forward and backward knowledge transfer (KT), i.e., using the learned knowledge from previous tasks for the new task learning (namely FKT), and improving the previous tasks' performance with the knowledge of the new task (namely BKT). To this end, this paper first models CL as an optimization problem in which each sequential learning task aims to achieve its optimal performance under the constraint that both FKT and BKT should be positive. It then proposes a novel Enhanced Task Continual Learning (ETCL) method, which achieves forgetting-free and positive KT. Furthermore, the bounds that can lead to negative FKT and BKT are estimated theoretically. Based on the bounds, a new strategy for online task similarity detection is also proposed to facilitate positive KT. To overcome CF, ETCL learns a set of task-specific binary masks to isolate a sparse sub-network for each task while preserving the performance of a dense network for the task. At the beginning of a new task learning, ETCL tries to align the new task's gradient with that of the sub-network of the previous most similar task to ensure positive FKT. By using a new bi-objective optimization strategy and an orthogonal gradient projection method, ETCL updates only the weights of previous similar tasks at the classification layer to achieve positive BKT. Extensive evaluations demonstrate that the proposed ETCL markedly outperforms strong baselines on dissimilar, similar, and mixed task sequences.


【2】ART: Adaptive Reasoning Trees for Explainable Claim Verification
标题:ART:用于可解释声明验证的自适应推理树
链接:https://arxiv.org/abs/2601.05455

作者:Sahil Wadhwa,Himanshu Kumar,Guanqun Yang,Abbaas Alif Mohamed Nishar,Pranab Mohanty,Swapnil Shinde,Yue Wu
摘要:大型语言模型(LLM)是复杂决策的有力候选者,它利用了大量的编码知识和出色的零射击(zero-shot)能力。然而,它们在高风险环境中的采用受到其不透明性的阻碍;它们的输出缺乏可靠的解释,无法有效地纠正错误,破坏了可信度。在本文中,我们提出了ART(自适应推理树),索赔验证的分层方法。这个过程开始于一个根声明,它分支为支持和攻击子论点。一个论点的强度是由下而上通过其子论点的成对比赛来确定的,由法官LLM裁决,允许系统地得出最终的、透明的和可验证的裁决,这在思想链(CoT)等方法中是缺失的。我们在多个数据集上实证验证ART,分析不同的参数生成器和比较策略。我们的研究结果表明,ART的结构化推理优于强基线,建立了一个新的基准,可解释的索赔验证,这是更可靠的,并确保在整个决策步骤的清晰度。
摘要:Large Language Models (LLMs) are powerful candidates for complex decision-making, leveraging vast encoded knowledge and remarkable zero-shot abilities. However, their adoption in high-stakes environments is hindered by their opacity; their outputs lack faithful explanations and cannot be effectively contested to correct errors, undermining trustworthiness. In this paper, we propose ART (Adaptive Reasoning Trees), a hierarchical method for claim verification. The process begins with a root claim, which branches into supporting and attacking child arguments. An argument's strength is determined bottom-up via a pairwise tournament of its children, adjudicated by a judge LLM, allowing a final, transparent and contestable verdict to be systematically derived which is missing in methods like Chain-of-Thought (CoT). We empirically validate ART on multiple datasets, analyzing different argument generators and comparison strategies. Our findings show that ART's structured reasoning outperforms strong baselines, establishing a new benchmark for explainable claim verification which is more reliable and ensures clarity in the overall decision making step.


【3】Rapid Adaptation of SpO2 Estimation to Wearable Devices via Transfer Learning on Low-Sampling-Rate PPG
标题:通过低采样率PPV上的转移学习快速适应SpO 2估计到可穿戴设备
链接:https://arxiv.org/abs/2509.12515

作者:Zequan Liang,Ruoyu Zhang,Wei Shao,krishna Karthik,Ehsan Kourkchi,Setareh Rafatirad,Houman Homayoun
备注:In the proceedings of IEEE-EMBS International Conference on Body Sensor Networks 2025
摘要:血氧饱和度(SpO 2)是医疗监护的重要指标。传统的SpO 2估计方法通常依赖于复杂的临床校准,因此不适合低功耗、可穿戴应用。在本文中,我们提出了一种基于迁移学习的框架,用于使用低采样率(25 Hz)双通道光电容积脉搏波(PPG)快速适应SpO 2估计到节能可穿戴设备。我们首先在公共临床数据集上预训练具有自我注意力的双向长短期记忆(BiLSTM)模型,然后使用从我们的可穿戴We-Be带和FDA批准的参考脉搏血氧仪收集的数据对其进行微调。实验结果表明,我们的方法在公共数据集上实现了2.967%的平均绝对误差(MAE),在私有数据集上实现了2.624%的平均绝对误差(MAE),显著优于传统的校准和非转移机器学习基线。此外,与100 Hz相比,使用25 Hz PPG可将功耗降低40%(不包括基线绘制)。我们的方法在瞬时SpO 2预测中也达到了3.284%的MAE,有效地捕获了快速波动。这些结果表明,在可穿戴设备上快速适应准确的低功耗SpO 2监测,无需临床校准。
摘要:Blood oxygen saturation (SpO2) is a vital marker for healthcare monitoring. Traditional SpO2 estimation methods often rely on complex clinical calibration, making them unsuitable for low-power, wearable applications. In this paper, we propose a transfer learning-based framework for the rapid adaptation of SpO2 estimation to energy-efficient wearable devices using low-sampling-rate (25Hz) dual-channel photoplethysmography (PPG). We first pretrain a bidirectional Long Short-Term Memory (BiLSTM) model with self-attention on a public clinical dataset, then fine-tune it using data collected from our wearable We-Be band and an FDA-approved reference pulse oximeter. Experimental results show that our approach achieves a mean absolute error (MAE) of 2.967% on the public dataset and 2.624% on the private dataset, significantly outperforming traditional calibration and non-transferred machine learning baselines. Moreover, using 25Hz PPG reduces power consumption by 40% compared to 100Hz, excluding baseline draw. Our method also attains an MAE of 3.284% in instantaneous SpO2 prediction, effectively capturing rapid fluctuations. These results demonstrate the rapid adaptation of accurate, low-power SpO2 monitoring on wearable devices without the need for clinical calibration.


强化学习(4篇)

【1】MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization
标题:MaxCode:用于自动代码优化的最大奖励强化学习框架
链接:https://arxiv.org/abs/2601.05475

作者:Jiefu Ou,Sapana Chaudhary,Kaj Bostrom,Nathaniel Weir,Shuai Zhang,Huzefa Rangwala,George Karypis
摘要 :大型语言模型(LLM)在一般编码任务中表现出强大的能力,但在优化代码时会遇到两个关键挑战:(i)编写优化代码(如高性能CUDA内核和竞争级别CPU代码)的复杂性需要系统,算法和特定语言的专业知识,(ii)需要解释性能指标,如时序和设备利用率超出二进制正确性。在这项工作中,我们探索推理时间搜索算法,引导LLM发现更好的解决方案,通过基于执行反馈的迭代细化。我们的方法称为MaxCode,将现有的搜索方法统一在最大奖励强化学习框架下,使观察和动作值函数模块化以进行修改。为了增强观察空间,我们集成了一个自然语言批评模型,将原始执行反馈转换为关于错误和性能瓶颈的诊断见解,以及迄今为止看到的最佳折扣奖励。它们一起为代码建议功能提供了更丰富的输入。为了改善搜索过程中的探索,我们训练了一个生成的奖励去模型,使用行动值从推出重新排名潜在的解决方案。在KernelBench(CUDA)和PIE(C++)优化基准测试上的测试表明,与基线相比,MaxCode提高了优化的代码性能,在绝对加速值和相对加速排名方面分别实现了20.3%和10.1%的相对提高。
摘要:Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.


【2】Interactive Distillation for Cooperative Multi-Agent Reinforcement Learning
标题:协作多智能体强化学习的交互式蒸馏
链接:https://arxiv.org/abs/2601.05407

作者:Minwoo Cho,Batuhan Altundas,Matthew Gombolay
摘要:知识蒸馏(KD)有可能通过为分散的学生雇用集中的教师来加速MARL,但面临关键瓶颈。具体来说,存在(1)在复杂领域综合高绩效教学政策的挑战,(2)教师必须在非分布(OOD)状态下推理时的困难,以及(3)分散的学生和集中的学生之间的不匹配教师的观察空间。为了解决这些局限性,我们提出了HINT(分层交互式基于教师的传输),这是一种用于集中式培训,分散式执行设置中的MARL的新型KD框架。通过利用分层RL,HINT提供了一个可扩展的,高性能的教师。我们的关键创新,伪政策RL,使教师政策更新使用教师和学生的经验,从而提高OOD适应。HINT还应用基于性能的过滤,仅保留与结果相关的指导,减少观察不匹配。我们在具有挑战性的合作领域(例如,FireCommander负责资源分配,MARINE负责战术战斗)。在这些基准测试中,HINT的表现优于基线,成功率提高了60%至165%。
摘要:Knowledge distillation (KD) has the potential to accelerate MARL by employing a centralized teacher for decentralized students but faces key bottlenecks. Specifically, there are (1) challenges in synthesizing high-performing teaching policies in complex domains, (2) difficulties when teachers must reason in out-of-distribution (OOD) states, and (3) mismatches between the decentralized students' and the centralized teacher's observation spaces. To address these limitations, we propose HINT (Hierarchical INteractive Teacher-based transfer), a novel KD framework for MARL in a centralized training, decentralized execution setup. By leveraging hierarchical RL, HINT provides a scalable, high-performing teacher. Our key innovation, pseudo off-policy RL, enables the teacher policy to be updated using both teacher and student experience, thereby improving OOD adaptation. HINT also applies performance-based filtering to retain only outcome-relevant guidance, reducing observation mismatches. We evaluate HINT on challenging cooperative domains (e.g., FireCommander for resource allocation, MARINE for tactical combat). Across these benchmarks, HINT outperforms baselines, achieving improvements of 60% to 165% in success rate.


【3】Sequential Bayesian Optimal Experimental Design in Infinite Dimensions via Policy Gradient Reinforcement Learning
标题:通过政策梯度强化学习的无限维序列Bayesian最优实验设计
链接:https://arxiv.org/abs/2601.05868

作者:Kaichen Shen,Peng Chen
摘要:序贯贝叶斯最优试验设计(SBOED)求解偏微分方程控制的反问题是一个计算上的挑战,特别是对于无限维随机场参数。高保真方法需要在嵌套贝叶斯反演和设计循环内重复进行正向和伴随PDE求解。我们将SBOED制定为有限时域马尔可夫决策过程,并通过策略梯度强化学习(PGRL)学习摊销设计策略,从而实现从实验历史中在线选择设计,而无需重复解决SBOED优化问题。为了使策略训练和奖励评估具有可扩展性,我们将双维度缩减-参数的主动子空间投影和状态的主成分分析-与调整后的导数通知潜在注意力神经算子(LANO)代理相结合,该代理预测参数到解决方案的映射及其雅可比矩阵。我们使用基于拉普拉斯的D-最优性奖励,同时注意到,一般来说,其他预期信息增益实用程序,如KL分歧,也可以在同一框架内使用。我们进一步介绍了一个基于特征值的评估策略,使用以前的样本作为代理的最大后验(MAP)点,避免重复MAP解决,同时保留准确的信息增益估计。污染源跟踪的顺序多传感器放置的数值实验表明,约100倍的加速比高保真有限元方法,随机传感器放置的性能提高,物理上可解释的政策,发现一个“上游”的跟踪策略。
摘要:Sequential Bayesian optimal experimental design (SBOED) for PDE-governed inverse problems is computationally challenging, especially for infinite-dimensional random field parameters. High-fidelity approaches require repeated forward and adjoint PDE solves inside nested Bayesian inversion and design loops. We formulate SBOED as a finite-horizon Markov decision process and learn an amortized design policy via policy-gradient reinforcement learning (PGRL), enabling online design selection from the experiment history without repeatedly solving an SBOED optimization problem. To make policy training and reward evaluation scalable, we combine dual dimension reduction -- active subspace projection for the parameter and principal component analysis for the state -- with an adjusted derivative-informed latent attention neural operator (LANO) surrogate that predicts both the parameter-to-solution map and its Jacobian. We use a Laplace-based D-optimality reward while noting that, in general, other expected-information-gain utilities such as KL divergence can also be used within the same framework. We further introduce an eigenvalue-based evaluation strategy that uses prior samples as proxies for maximum a posteriori (MAP) points, avoiding repeated MAP solves while retaining accurate information-gain estimates. Numerical experiments on sequential multi-sensor placement for contaminant source tracking demonstrate approximately $100\times$ speedup over high-fidelity finite element methods, improved performance over random sensor placements, and physically interpretable policies that discover an ``upstream'' tracking strategy.


【4】Autonomous Discovery of the Ising Model's Critical Parameters with Reinforcement Learning
标题:利用强化学习自主发现伊辛模型的关键参数
链接:https://arxiv.org/abs/2601.05577

作者:Hai Man,Chaobo Wang,Jia-Rui Li,Yuping Tian,Shu-Gang Chen
备注:37 pages, 9 figures. This is the Accepted Manuscript of an article published in J. Stat. Mech
摘要:传统的确定临界参数的方法往往受到人为因素的影响。该研究引入了一个物理启发的自适应强化学习框架,使智能体能够自主地与物理环境进行交互,同时精确地识别伊辛模型中的临界温度和各种类型的临界指数。有趣的是,我们的算法表现出的搜索行为让人想起相变,有效地收敛到目标参数,无论初始条件。实验结果表明,该方法显着优于传统的方法,特别是在环境中的强扰动。这项研究不仅将物理概念融入机器学习以增强算法的可解释性,而且还建立了科学探索的新范式,从人工分析过渡到自主AI发现。
摘要 :Traditional methods for determining critical parameters are often influenced by human factors. This research introduces a physics-inspired adaptive reinforcement learning framework that enables agents to autonomously interact with physical environments, simultaneously identifying both the critical temperature and various types of critical exponents in the Ising model with precision. Interestingly, our algorithm exhibits search behavior reminiscent of phase transitions, efficiently converging to target parameters regardless of initial conditions. Experimental results demonstrate that this method significantly outperforms traditional approaches, particularly in environments with strong perturbations. This study not only incorporates physical concepts into machine learning to enhance algorithm interpretability but also establishes a new paradigm for scientific exploration, transitioning from manual analysis to autonomous AI discovery.


元学习(1篇)

【1】TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning
标题:TIME:用于上下文触发显式推理的时间智能元推理引擎
链接:https://arxiv.org/abs/2601.05300

作者:Susmit Das
备注:14 pages, 3 figures with 27 page appendix. See https://github.com/The-Coherence-Initiative/TIME and https://github.com/The-Coherence-Initiative/TIMEBench for associated code
摘要:面向推理的大型语言模型通常在每个响应开始时将显式的“思考”暴露为长的转向全局跟踪,要么始终打开,要么在推理时外部切换。虽然对于算术、编程和解决问题很有用,但这种设计成本很高,模糊了索赔级别的可验证性,并且一旦模型开始呈现,就不能重新触发显式推理。对话模式也在很大程度上无视时间结构,将几秒钟后的回复和几周后的回复视为等同,除非在文本中说明时间。我们介绍时间,时间智能元推理引擎,行为对齐框架,将显式推理作为一个上下文敏感的资源驱动的话语和时间线索。TIME通过可选的ISO 8601标签、代表无声间隙的勾选以及可以出现在回复中任何位置的短块来增强对话。一个四阶段的课程,包括一个小的,最大限度地多样化的全批对齐步骤训练Qwen3密集模型调用简短的,到位的推理爆发,并保持面向用户的文本紧凑。我们用TIMEBench进行评估,TIMEBench是一个基于时间的对话基准,用于探测时间顺序,间隙和偏移下的常识,异常检测和连续性。在4B到32B的范围内,TIME在思考和不思考模式下都提高了TIMEBench分数,同时将推理标记减少了一个数量级。我们的培训数据和代码可在https://github.com/The-Coherence-Initiative/TIME上获得,TIMEBench可在https://github.com/The-Coherence-Initiative/TIMEBench上获得
摘要:Reasoning oriented large language models often expose explicit "thinking" as long, turn-global traces at the start of every response, either always on or toggled externally at inference time. While useful for arithmetic, programming, and problem solving, this design is costly, blurs claim level auditability, and cannot re-trigger explicit reasoning once the model begins presenting. Dialogue models are also largely blind to temporal structure, treating replies after seconds and replies after weeks as equivalent unless time is stated in text. We introduce TIME, the Temporally Intelligent Meta-reasoning Engine, a behavioral alignment framework that treats explicit reasoning as a context sensitive resource driven by discourse and temporal cues. TIME augments dialogue with optional ISO 8601  tags, tick turns that represent silent gaps, and short  blocks that can appear anywhere in a reply. A four-phase curriculum including a small, maximally diverse full-batch alignment step trains Qwen3 dense models to invoke brief, in-place reasoning bursts and keep user facing text compact. We evaluate with TIMEBench, a temporally grounded dialogue benchmark probing chronology, commonsense under gaps and offsets, anomaly detection, and continuity. Across 4B to 32B scales, TIME improves TIMEBench scores over base Qwen3 in both thinking and no-thinking modes while reducing reasoning tokens by about an order of magnitude. Our training data and code are available at https://github.com/The-Coherence-Initiative/TIME and TIMEBench is available at https://github.com/The-Coherence-Initiative/TIMEBench


医学相关(3篇)

【1】Prompt-Free SAM-Based Multi-Task Framework for Breast Ultrasound Lesion Segmentation and Classification
标题:无预算基于ASM的乳腺超声病变分割和分类多任务框架
链接:https://arxiv.org/abs/2601.05498

作者:Samuel E. Johnny,Bernes L. Atabonfack,Israel Alagbe,Assane Gueye
摘要:由于低对比度、斑点噪声和不同的病变形态,乳腺超声(BUS)成像中的准确肿瘤分割和分类仍然具有挑战性。这项研究提出了一个多任务深度学习框架,该框架使用Segment Anything Model(SAM)视觉编码器的嵌入来联合执行病变分割和诊断分类。与基于卷积的SAM变体不同,我们的方法采用了无卷积的、完全监督的自适应,其中高维SAM特征通过轻量级卷积头或UNet启发的解码器进行像素分割。分类分支通过掩模引导的注意力来增强,允许模型专注于病变相关特征,同时抑制背景伪影。在PRECISE 2025乳腺超声数据集上进行的实验,每个类分为80%的训练和20%的测试,表明该方法实现了0.887的Dice相似系数(DSC)和92.3%的准确率,在PRECISE挑战排行榜上名列前茅。这些结果表明,SAM为基础的表示,结合分割引导的学习,显着改善病变描绘和诊断预测乳腺超声成像。
摘要:Accurate tumor segmentation and classification in breast ultrasound (BUS) imaging remain challenging due to low contrast, speckle noise, and diverse lesion morphology. This study presents a multi-task deep learning framework that jointly performs lesion segmentation and diagnostic classification using embeddings from the Segment Anything Model (SAM) vision encoder. Unlike prompt-based SAM variants, our approach employs a prompt-free, fully supervised adaptation where high-dimensional SAM features are decoded through either a lightweight convolutional head or a UNet-inspired decoder for pixel-wise segmentation. The classification branch is enhanced via mask-guided attention, allowing the model to focus on lesion-relevant features while suppressing background artifacts. Experiments on the PRECISE 2025 breast ultrasound dataset, split per class into 80 percent training and 20 percent testing, show that the proposed method achieves a Dice Similarity Coefficient (DSC) of 0.887 and an accuracy of 92.3 percent, ranking among the top entries on the PRECISE challenge leaderboard. These results demonstrate that SAM-based representations, when coupled with segmentation-guided learning, significantly improve both lesion delineation and diagnostic prediction in breast ultrasound imaging.


【2】Autonomous Probe Microscopy with Robust Bag-of-Features Multi-Objective Bayesian Optimization: Pareto-Front Mapping of Nanoscale Structure-Property Trade-Offs
标题:具有稳健特征袋多目标Bayesian优化的自主探针显微镜:纳米规模结构-特性权衡的帕累托前映射
链接:https://arxiv.org/abs/2601.05528

作者:Kamyar Barakati,Haochen Zhu,C Charlotte Buchanan,Dustin A Gilbert,Philip Rack,Sergei V. Kalinin
备注:25 pages, 5 figures
摘要:组合材料库是生成大家族候选组合物的有效途径,但它们的影响通常受到表征的速度和深度以及从复杂表征数据中提取可操作的结构-性质关系的困难的限制。在这里,我们开发了一个自主扫描探针显微镜(SPM)框架,集成了自动原子力和磁力显微镜(AFM/MFM),以快速探索跨组合传播库的磁性和结构特性。为了在没有明确优化目标的情况下实现系统的自动探索,我们引入了测量表面形态和磁性结构的静态物理信息特征袋(BoF)表示与多目标贝叶斯优化(MOBO)的组合,以发现特征的相对重要性和鲁棒性。由此产生的闭环工作流程选择性地采样组成梯度,并重建与密集网格“地面实况”测量一致的特征景观。由此产生的帕累托结构揭示了多个纳米级目标同时优化,粗糙度,一致性和磁对比度之间的权衡是不可避免的,以及组合物的家庭如何聚类成不同的功能制度,从而将多特征成像数据转化为竞争结构-属性趋势的可解释地图。虽然演示了金钴镍和AFM/MFM,该方法是通用的,可以扩展到其他组合系统,成像方式和功能集,说明如何基于功能的MOBO和自主SPM可以将显微镜图像从静态数据产品转化为实时,多目标材料发现的主动反馈。
摘要 :Combinatorial materials libraries are an efficient route to generate large families of candidate compositions, but their impact is often limited by the speed and depth of characterization and by the difficulty of extracting actionable structure-property relations from complex characterization data. Here we develop an autonomous scanning probe microscopy (SPM) framework that integrates automated atomic force and magnetic force microscopy (AFM/MFM) to rapidly explore magnetic and structural properties across combinatorial spread libraries. To enable automated exploration of systems without a clear optimization target, we introduce a combination of a static physics-informed bag-of-features (BoF) representation of measured surface morphology and magnetic structure with multi-objective Bayesian optimization (MOBO) to discover the relative significance and robustness of features. The resulting closed-loop workflow selectively samples the compositional gradient and reconstructs feature landscapes consistent with dense grid "ground truth" measurements. The resulting Pareto structure reveals where multiple nanoscale objectives are simultaneously optimized, where trade-offs between roughness, coherence, and magnetic contrast are unavoidable, and how families of compositions cluster into distinct functional regimes, thereby turning multi-feature imaging data into interpretable maps of competing structure-property trends. While demonstrated for Au-Co-Ni and AFM/MFM, the approach is general and can be extended to other combinatorial systems, imaging modalities, and feature sets, illustrating how feature-based MOBO and autonomous SPM can transform microscopy images from static data products into active feedback for real-time, multi-objective materials discovery.


【3】Channel Selected Stratified Nested Cross Validation for Clinically Relevant EEG Based Parkinsons Disease Detection
标题:基于临床相关脑电的帕金森病检测的通道选择分层巢式交叉验证
链接:https://arxiv.org/abs/2601.05276

作者:Nicholas R. Rasmussen,Rodrigue Rizk,Longwei Wang,Arun Singh,KC Santosh
备注:Submitted to IEEE Conference -> posting to Arxiv as normal
摘要:帕金森病的早期检测仍然是临床神经科学中的一个关键挑战,脑电图为人群水平的筛查提供了一种非侵入性和可扩展的途径。虽然机器学习在这一领域显示出了希望,但许多报告的结果都存在方法上的缺陷,最明显的是患者层面的数据泄露,夸大了性能估计,限制了临床翻译。为了解决这些建模缺陷,我们提出了一个统一的评估框架,以嵌套交叉验证为基础,并结合了三个互补的保障措施:(i)患者水平分层,以消除受试者重叠,并确保无偏的泛化,(ii)多层窗口,以协调异构EEG记录,同时保留时间动态,以及(iii)内环通道选择,使原则性的特征减少,而不泄漏信息。应用于三个具有不同数量通道的独立数据集,在此框架下训练的卷积神经网络实现了80.6%的准确率,并在人口块测试下表现出最先进的性能,与文献中的其他方法相当。这种性能强调了嵌套交叉验证的必要性,作为防止偏倚的保障措施,并作为选择患者水平决策最相关信息的原则性手段,提供了可扩展到其他生物医学信号分析领域的可重复基础。
摘要:The early detection of Parkinsons disease remains a critical challenge in clinical neuroscience, with electroencephalography offering a noninvasive and scalable pathway toward population level screening. While machine learning has shown promise in this domain, many reported results suffer from methodological flaws, most notably patient level data leakage, inflating performance estimates and limiting clinical translation. To address these modeling pitfalls, we propose a unified evaluation framework grounded in nested cross validation and incorporating three complementary safeguards: (i) patient level stratification to eliminate subject overlap and ensure unbiased generalization, (ii) multi layered windowing to harmonize heterogeneous EEG recordings while preserving temporal dynamics, and (iii) inner loop channel selection to enable principled feature reduction without information leakage. Applied across three independent datasets with a heterogeneous number of channels, a convolutional neural network trained under this framework achieved 80.6% accuracy and demonstrated state of the art performance under held out population block testing, comparable to other methods in the literature. This performance underscores the necessity of nested cross validation as a safeguard against bias and as a principled means of selecting the most relevant information for patient level decisions, providing a reproducible foundation that can extend to other biomedical signal analysis domains.


蒸馏|知识提取(2篇)

【1】Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces
标题:通过识别相关子空间从大型ML模型中提取轻量级领域专家
链接:https://arxiv.org/abs/2601.05913

作者:Pattarawat Chormai,Ali Hashemi,Klaus-Robert Müller,Grégoire Montavon
备注:20 pages + supplement
摘要:知识蒸馏涉及将大型高性能AI模型(教师)的预测能力转移到较小的模型(学生),这些模型可以在计算能力有限的环境中运行。在本文中,我们解决的情况下,只有少数几个类及其相关的中间概念是相关的蒸馏。这种情况在实践中很常见,但现有的蒸馏方法很少明确关注相关的子任务。为了解决这一差距,我们引入了“SubDistill”,这是一种新的蒸馏算法,具有改进的数值特性,仅在每一层提取教师模型的相关组件。在CIFAR-100和ImageNet上使用卷积和Transformer模型进行的实验表明,SubDistill在一组代表性的子任务上优于现有的分层蒸馏技术。我们的基准评估得到了可解释的人工智能分析的补充,表明我们提炼的学生模型更接近于原始教师模型的决策结构。
摘要:Knowledge distillation involves transferring the predictive capabilities of large, high-performing AI models (teachers) to smaller models (students) that can operate in environments with limited computing power. In this paper, we address the scenario in which only a few classes and their associated intermediate concepts are relevant to distill. This scenario is common in practice, yet few existing distillation methods explicitly focus on the relevant subtask. To address this gap, we introduce 'SubDistill', a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer. Experiments on CIFAR-100 and ImageNet with Convolutional and Transformer models demonstrate that SubDistill outperforms existing layer-wise distillation techniques on a representative set of subtasks. Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.


【2】Compressing image encoders via latent distillation
标题:通过潜蒸馏压缩图像编码器
链接:https://arxiv.org/abs/2601.05639

作者:Caroline Mazini Rodrigues,Nicolas Keriven,Thomas Maugey
摘要:用于图像压缩的深度学习模型通常在硬件受限的应用中面临实际限制。虽然这些模型实现了高质量的重建,但它们通常是复杂的,重量级的,并且需要大量的训练数据和计算资源。我们提出了一种方法来部分压缩这些网络,通过减少其编码器的大小。我们的方法使用简化的知识蒸馏策略来近似原始模型的潜在空间,使用更少的数据和更短的训练,从重量级编码器中产生轻量级编码器。我们评估所产生的轻量级编码器在两个不同的架构上的图像压缩任务。实验表明,我们的方法保留重建质量和统计保真度优于训练轻量级编码器与原始损失,使其适用于资源有限的环境。
摘要:Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to partially compress these networks by reducing the size of their encoders. Our approach uses a simplified knowledge distillation strategy to approximate the latent space of the original models with less data and shorter training, yielding lightweight encoders from heavyweight ones. We evaluate the resulting lightweight encoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity better than training lightweight encoders with the original loss, making it practical for resource-limited environments.


聚类(1篇)

【1】From Global to Local: Cluster-Aware Learning for Wi-Fi Fingerprinting Indoor Localisation
标题:从全球到本地:Wi-Fi指纹识别室内本地化的机器人意识学习
链接:https://arxiv.org/abs/2601.05650

作者 :Miguel Matey-Sanz,Joaquín Torres-Sospedra,Joaquín Huerta,Sergio Trilles
备注:20 pages, 9 figures, 6 tables
摘要:Wi-Fi指纹识别仍然是室内定位的最实用的解决方案之一,然而,其性能通常受到指纹数据集的大小和异质性、强接收信号强度指示器变化性以及在大型和多楼层环境中引入的模糊性的限制。这些因素显着降低定位精度,特别是在不考虑结构约束的情况下应用全局模型时。本文介绍了一种基于聚类的方法,结构的指纹数据集之前,本地化。使用空间或无线电特征对指纹进行分组,并且可以在建筑物或楼层级别应用聚类。在定位阶段,基于最强接入点的聚类估计过程将看不见的指纹分配给最相关的聚类。然后,仅在选定的集群内执行本地化,从而允许学习模型在减少的和更连贯的数据子集上操作。在三个公共数据集和几个机器学习模型上评估了该方法的有效性。结果表明,定位误差的一致减少,特别是在建筑物水平的战略,但在降低地板检测精度的成本。这些结果表明,通过聚类显式结构化数据集是一种有效和灵活的方法,可扩展的室内定位。
摘要:Wi-Fi fingerprinting remains one of the most practical solutions for indoor positioning, however, its performance is often limited by the size and heterogeneity of fingerprint datasets, strong Received Signal Strength Indicator variability, and the ambiguity introduced in large and multi-floor environments. These factors significantly degrade localisation accuracy, particularly when global models are applied without considering structural constraints. This paper introduces a clustering-based method that structures the fingerprint dataset prior to localisation. Fingerprints are grouped using either spatial or radio features, and clustering can be applied at the building or floor level. In the localisation phase, a clustering estimation procedure based on the strongest access points assigns unseen fingerprints to the most relevant cluster. Localisation is then performed only within the selected clusters, allowing learning models to operate on reduced and more coherent subsets of data. The effectiveness of the method is evaluated on three public datasets and several machine learning models. Results show a consistent reduction in localisation errors, particularly under building-level strategies, but at the cost of reducing the floor detection accuracy. These results demonstrate that explicitly structuring datasets through clustering is an effective and flexible approach for scalable indoor positioning.


联邦学习|隐私保护|加密(2篇)

【1】SAFE: Secure and Accurate Federated Learning for Privacy-Preserving Brain-Computer Interfaces
标题:SAFE:安全准确的联邦学习,用于保护隐私的脑机接口
链接:https://arxiv.org/abs/2601.05789

作者:Tianwang Jia,Xiaoqing Chen,Dongrui Wu
备注:12 pages, 9 figures
摘要:基于脑电图(EEG)的脑机接口(BCI)由于其高效性和可移植性而被广泛采用;然而,其解码算法仍然面临多种挑战,包括泛化不足、对抗性脆弱性和隐私泄露。本文提出了安全和准确的联邦化学习(SAFE),这是一种基于联邦学习的方法,通过在模型训练期间将数据保持在本地来保护用户隐私。SAFE采用局部批次特定的归一化来减轻跨主题特征分布的变化,从而提高模型的泛化能力。它通过联邦对抗训练和对抗权重扰动在输入空间和参数空间中引入扰动,进一步增强了对抗鲁棒性。对来自运动想象(MI)和事件相关电位(ERP)BCI范式的五个EEG数据集的实验表明,SAFE在解码准确性和对抗鲁棒性方面始终优于14种最先进的方法,同时确保隐私保护。值得注意的是,它甚至优于完全不考虑隐私保护的集中式培训方法。据我们所知,SAFE是第一个同时实现高解码精度,强大的对抗鲁棒性和可靠的隐私保护的算法,而无需使用来自目标受试者的任何校准数据,这使得它非常适合现实世界的BCI。
摘要:Electroencephalogram (EEG)-based brain-computer interfaces (BCIs) are widely adopted due to their efficiency and portability; however, their decoding algorithms still face multiple challenges, including inadequate generalization, adversarial vulnerability, and privacy leakage. This paper proposes Secure and Accurate FEderated learning (SAFE), a federated learning-based approach that protects user privacy by keeping data local during model training. SAFE employs local batch-specific normalization to mitigate cross-subject feature distribution shifts and hence improves model generalization. It further enhances adversarial robustness by introducing perturbations in both the input space and the parameter space through federated adversarial training and adversarial weight perturbation. Experiments on five EEG datasets from motor imagery (MI) and event-related potential (ERP) BCI paradigms demonstrated that SAFE consistently outperformed 14 state-of-the-art approaches in both decoding accuracy and adversarial robustness, while ensuring privacy protection. Notably, it even outperformed centralized training approaches that do not consider privacy protection at all. To our knowledge, SAFE is the first algorithm to simultaneously achieve high decoding accuracy, strong adversarial robustness, and reliable privacy protection without using any calibration data from the target subject, making it highly desirable for real-world BCIs.


【2】When the Server Steps In: Calibrated Updates for Fair Federated Learning
标题:当服务器介入时:公平联合学习的校准更新
链接:https://arxiv.org/abs/2601.05352

作者:Tianrun Yu,Kaixiang Zhao,Cheng Zhang,Anjun Gao,Yueyang Quan,Zhuqing Liu,Minghong Fang
摘要:联邦学习(FL)已经成为一种变革性的分布式学习范式,使多个客户端能够在中央服务器的协调下协作训练全局模型,而无需共享原始训练数据。虽然FL提供了显着的优势,但它在确保不同人口群体的公平性方面面临着严峻的挑战。为了解决这些公平性问题,已经提出了各种公平性感知的去偏置方法。然而,这些方法中的许多方法要么需要修改客户的培训协议,要么在其聚合策略中缺乏灵活性。在这项工作中,我们通过引入EquFL来解决这些局限性,EquFL是一种新颖的服务器端去偏置方法,旨在减轻FL系统中的偏差。EquFL允许服务器在从客户端接收模型更新后生成单个校准更新。然后将该校准的更新与聚合的客户端更新集成,以产生减少偏差的调整的全局模型。从理论上讲,我们建立EquFL收敛到FedAvg实现的最优全局模型,并有效地减少了训练轮的公平性损失。从经验上讲,我们证明EquFL显着减轻系统内的偏见,展示其实际有效性。
摘要:Federated learning (FL) has emerged as a transformative distributed learning paradigm, enabling multiple clients to collaboratively train a global model under the coordination of a central server without sharing their raw training data. While FL offers notable advantages, it faces critical challenges in ensuring fairness across diverse demographic groups. To address these fairness concerns, various fairness-aware debiasing methods have been proposed. However, many of these approaches either require modifications to clients' training protocols or lack flexibility in their aggregation strategies. In this work, we address these limitations by introducing EquFL, a novel server-side debiasing method designed to mitigate bias in FL systems. EquFL operates by allowing the server to generate a single calibrated update after receiving model updates from the clients. This calibrated update is then integrated with the aggregated client updates to produce an adjusted global model that reduces bias. Theoretically, we establish that EquFL converges to the optimal global model achieved by FedAvg and effectively reduces fairness loss over training rounds. Empirically, we demonstrate that EquFL significantly mitigates bias within the system, showcasing its practical effectiveness.


推理|分析|理解|解释(6篇)

【1】PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
标题:PaCore:学习用并行协调推理扩展测试时计算
链接:https://arxiv.org/abs/2601.05593

作者:Jingcheng Hu,Yinmin Zhang,Shijie Shang,Xiaobo Yang,Yue Peng,Zhewei Huang,Hebin Zhou,Xin Wu,Jie Cheng,Fanqi Wan,Xiangwen Kong,Chengyuan Yao,Kaiwen Yan,Ailin Huang,Hongyu Zhou,Qi Han,Zheng Ge,Daxin Jiang,Xiangyu Zhang,Heung-Yeung Shum
摘要:我们介绍并行协调推理(PaCoRe),一个训练和推理框架,旨在克服当代语言模型的一个核心限制:它们无法在固定的上下文窗口下扩展测试时计算(TTC)远远超出顺序推理。PaCoRe通过在多轮中通过消息传递架构协调的大规模并行探索来驱动TTC,从而偏离了传统的顺序范式。每一轮都启动许多并行的推理轨迹,将他们的发现压缩成上下文绑定的消息,并综合这些消息来指导下一轮并最终产生最终答案。通过大规模的基于结果的强化学习进行端到端的训练,该模型掌握了PaCoRe所需的合成能力,并在不超过上下文限制的情况下扩展到数百万个令牌的有效TTC。该方法在不同领域都有很大的改进,尤其是将推理推向了数学前沿系统之外:8B模型在HMMT 2025上达到了94.5%,通过将有效TTC扩展到大约200万个令牌,超过了GPT-5的93.2%。我们开源了模型检查点、训练数据和完整的推理管道,以加速后续工作。
摘要:We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.


【2】DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis
标题:DeMa:双路径延迟感知Mamba,用于高效的多元时间序列分析
链接:https://arxiv.org/abs/2601.05527

作者:Rui An,Haohao Qu,Wenqi Fan,Xuequn Shang,Qing Li
备注:Under review
摘要:准确、高效的多元时间序列(MTS)分析对于广泛的智能应用越来越重要。在这个领域中,Transformers由于其捕获成对依赖关系的强大能力而成为主要的体系结构。然而,基于transformer的模型具有二次计算复杂度和高内存开销,限制了其可扩展性和在长期和大规模MTS建模中的实际部署。最近,Mamba已经成为一个有前途的线性时间替代品,具有高表现力。然而,直接将香草曼巴应用于MTS仍然是次优的,这是由于三个关键限制:(i)缺乏明确的跨变量建模,(ii)难以解开纠缠的系列内时间动态和系列间相互作用,以及(iii)对潜在时滞相互作用效应的建模不足。这些问题限制了其在各种MTS任务中的有效性。为了解决这些挑战,我们提出了DeMa,一个双路径延迟感知的Mamba骨干。DeMa保留了Mamba的线性复杂度优势,同时大大提高了其对MTS设置的适用性。具体来说,DeMa引入了三个关键创新:(i)它将MTS分解为系列内的时间动态和系列间的相互作用;(ii)它开发了一个带有Mamba-SSD模块的时间路径,以捕获每个单独系列内的长期动态,从而实现独立于系列的并行计算;以及(iii)它设计了具有Mamba-DALA模块的变量路径,该Mamba-DALA模块集成了延迟感知线性注意力以建模跨变量依赖性。对五个代表性任务(长期和短期预测、数据插补、异常检测和序列分类)的广泛实验表明,DeMa在提供卓越计算效率的同时实现了最先进的性能。
摘要:Accurate and efficient multivariate time series (MTS) analysis is increasingly critical for a wide range of intelligent applications. Within this realm, Transformers have emerged as the predominant architecture due to their strong ability to capture pairwise dependencies. However, Transformer-based models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment in long-term and large-scale MTS modeling. Recently, Mamba has emerged as a promising linear-time alternative with high expressiveness. Nevertheless, directly applying vanilla Mamba to MTS remains suboptimal due to three key limitations: (i) the lack of explicit cross-variate modeling, (ii) difficulty in disentangling the entangled intra-series temporal dynamics and inter-series interactions, and (iii) insufficient modeling of latent time-lag interaction effects. These issues constrain its effectiveness across diverse MTS tasks. To address these challenges, we propose DeMa, a dual-path delay-aware Mamba backbone. DeMa preserves Mamba's linear-complexity advantage while substantially improving its suitability for MTS settings. Specifically, DeMa introduces three key innovations: (i) it decomposes the MTS into intra-series temporal dynamics and inter-series interactions; (ii) it develops a temporal path with a Mamba-SSD module to capture long-range dynamics within each individual series, enabling series-independent, parallel computation; and (iii) it designs a variate path with a Mamba-DALA module that integrates delay-aware linear attention to model cross-variate dependencies. Extensive experiments on five representative tasks, long- and short-term forecasting, data imputation, anomaly detection, and series classification, demonstrate that DeMa achieves state-of-the-art performance while delivering remarkable computational efficiency.


【3】Explainable AI: Learning from the Learners
标题:可解释的人工智能:向学习者学习
链接:https://arxiv.org/abs/2601.05525

作者:Ricardo Vinuesa,Steven L. Brunton,Gianmarco Mengaldo
摘要:人工智能现在在几项科学和工程任务中表现优于人类,但其内部表示往往仍然不透明。在这个角度来看,我们认为,可解释的人工智能(XAI),与因果推理相结合,使{\it从学习者学习}。专注于发现,优化和认证,我们展示了基础模型和可解释性方法的组合如何允许提取因果机制,指导稳健的设计和控制,并支持高风险应用程序中的信任和问责制。我们讨论了在解释的忠实性、概括性和可用性方面的挑战,并提出XAI作为人类与人工智能在科学和工程领域合作的统一框架。
摘要:Artificial intelligence now outperforms humans in several scientific and engineering tasks, yet its internal representations often remain opaque. In this Perspective, we argue that explainable artificial intelligence (XAI), combined with causal reasoning, enables {\it learning from the learners}. Focusing on discovery, optimization and certification, we show how the combination of foundation models and explainability methods allows the extraction of causal mechanisms, guides robust design and control, and supports trust and accountability in high-stakes applications. We discuss challenges in faithfulness, generalization and usability of explanations, and propose XAI as a unifying framework for human-AI collaboration in science and engineering.


【4】Optimizing Digital Adjudication through Social Network Analysis: An Empirical Study of Credit Card Disputes in Beijing
标题:通过社交网络分析优化数字审判:北京信用卡纠纷实证研究
链接:https://arxiv.org/abs/2601.05299

作者:Chung Han Tsai,ChengTo Lin,Chung Han Tsai,ChengTo Lin,Baowen Zhang,Qingyue Deng,Yunhui Zhao,Zhijia Song,Baowen Zhang,Qingyue Deng,Yunhui Zhao,Zhijia Song
摘要:在司法系统快速数字化的背景下,将大数据整合到裁决中的探索仍然不足,特别是在揭示法律应用的结构逻辑方面。本文通过社会网络分析(SNA)对北京市审理的涉及个人信息保护的信用卡纠纷案件进行了实证分析。通过构建一个法律引证网络,我们揭示了实体法和程序法适用的潜在模式。研究结果表明,SNA可以有效地识别核心法律规范和典型案例,为优化“数字法院”系统提供了一个强大的方法框架。这些见解为通过数据驱动的案件检索和整体司法信息网络提高司法效率和一致性提供了切实可行的途径。
摘要:Amid the rapid digitalization of judicial systems, the integration of big data into adjudication remains underexplored, particularly in uncovering the structural logic of legal applications. This study bridges this gap by employing social network analysis (SNA) to examine credit card disputes involving personal information protection adjudicated in Beijing. By constructing a legal citation network, we reveal the latent patterns of substantive and procedural law application. The findings demonstrate that SNA can effectively identify core legal norms and typify cases, offering a robust methodological framework for optimizing 'Digital Court' systems. These insights provide practical pathways for enhancing judicial efficiency and consistency through data-driven case retrieval and holistic judicial information networks.


【5】Cedalion Tutorial: A Python-based framework for comprehensive analysis of multimodal fNIRS & DOT from the lab to the everyday world
标题:Ceduum SYS:一个基于Python的框架,用于全面分析从实验室到日常世界的多模式fNIRS和DOT
链接:https://arxiv.org/abs/2601.05923

作者:E. Middell,L. Carlton,S. Moradi,T. Codina,T. Fischer,J. Cutler,S. Kelley,J. Behrendt,T. Dissanayake,N. Harmening,M. A. Yücel,D. A. Boas,A. von Lühmann
备注:33 pages main manuscript, 180 pages Supplementary Tutorial Notebooks, 12 figures, 6 tables, under review in SPIE Neurophotonics
摘要:功能性近红外光谱(fNIRS)和漫射光学断层扫描(DOT)正在迅速发展,成为日常生活中可穿戴、多模式、数据驱动、人工智能支持的神经成像。然而,目前的分析工具在平台上是分散的,限制了可重复性、互操作性以及与现代机器学习(ML)工作流程的集成。Cedarum是一个基于Python的开源框架,旨在将多模态fNIRS和DOT数据的高级基于模型和数据驱动的分析统一在可重现,可扩展和社区驱动的环境中。Cedarian在基于Python生态系统的单一标准化架构中集成了前向建模、摄影测量光电探测器配准、信号处理、GLM分析、DOT图像重建和基于ML的数据驱动方法。它遵循SNIRF和BIDS标准,支持云可执行的Quixyter笔记本,并为可扩展的、完全可再现的分析管道提供容器化的工作流程,这些分析管道可以与原始研究出版物一起提供。Cedarin将已建立的光学神经成像管道与scikit-learn和PyTorch等ML框架连接起来,实现了与EEG、MEG和生理数据的无缝多模态融合。它实现了信号质量评估,运动校正,GLM建模和DOT重建的验证算法,并辅以模拟,数据增强和多模态生理分析模块。自动化文档将每个方法链接到其源代码发布,持续集成测试确保了健壮性。本教程提供了七个完全可执行的笔记本,演示了核心功能。Cedarin提供了一个开放、透明和社区可扩展的基础,支持可重现、可扩展、云和ML就绪的fNIRS/DOT工作流程,用于基于实验室和现实世界的神经成像。
摘要:Functional near-infrared spectroscopy (fNIRS) and diffuse optical tomography (DOT) are rapidly evolving toward wearable, multimodal, and data-driven, AI-supported neuroimaging in the everyday world. However, current analytical tools are fragmented across platforms, limiting reproducibility, interoperability, and integration with modern machine learning (ML) workflows. Cedalion is a Python-based open-source framework designed to unify advanced model-based and data-driven analysis of multimodal fNIRS and DOT data within a reproducible, extensible, and community-driven environment. Cedalion integrates forward modelling, photogrammetric optode co-registration, signal processing, GLM Analysis, DOT image reconstruction, and ML-based data-driven methods within a single standardized architecture based on the Python ecosystem. It adheres to SNIRF and BIDS standards, supports cloud-executable Jupyter notebooks, and provides containerized workflows for scalable, fully reproducible analysis pipelines that can be provided alongside original research publications. Cedalion connects established optical-neuroimaging pipelines with ML frameworks such as scikit-learn and PyTorch, enabling seamless multimodal fusion with EEG, MEG, and physiological data. It implements validated algorithms for signal-quality assessment, motion correction, GLM modelling, and DOT reconstruction, complemented by modules for simulation, data augmentation, and multimodal physiology analysis. Automated documentation links each method to its source publication, and continuous-integration testing ensures robustness. This tutorial paper provides seven fully executable notebooks that demonstrate core features. Cedalion offers an open, transparent, and community extensible foundation that supports reproducible, scalable, cloud- and ML-ready fNIRS/DOT workflows for laboratory-based and real-world neuroimaging.


【6】A Bayesian Generative Modeling Approach for Arbitrary Conditional Inference
标题:任意条件推理的Bayesian生成建模方法
链接:https://arxiv.org/abs/2601.05355

作者:Qiao Liu,Wing Hung Wong
摘要:现代数据分析越来越需要灵活的条件推理P(X_B|其中(X_A,X_B)是观测变量X的任意划分。现有的条件推理方法缺乏这种灵活性,因为它们被绑定到固定的条件结构,并且一旦训练就不能执行新的条件推理。为了解决这个问题,我们提出了一种无需重新训练即可进行任意条件推理的贝叶斯生成建模(BGM)方法。BGM通过迭代贝叶斯更新算法学习X的生成模型,其中模型参数和潜在变量被更新直到收敛。一旦训练,任何条件分布都可以在不重新训练的情况下获得。从经验上讲,BGM通过良好校准的预测区间实现了卓越的预测性能,表明单个学习模型可以作为具有不确定性量化的条件预测的通用引擎。我们提供的随机迭代算法的收敛性,统计一致性和条件风险界的理论保证。提出的BGM框架利用人工智能的力量来捕获变量之间的复杂关系,同时遵守贝叶斯原则,成为推进现代数据科学中各种应用的有前途的框架。BGM的代码可以在https://github.com/liuq-lab/bayesgm上免费获得。
摘要:Modern data analysis increasingly requires flexible conditional inference P(X_B | X_A) where (X_A, X_B) is an arbitrary partition of observed variable X. Existing conditional inference methods lack this flexibility as they are tied to a fixed conditioning structure and cannot perform new conditional inference once trained. To solve this, we propose a Bayesian generative modeling (BGM) approach for arbitrary conditional inference without retraining. BGM learns a generative model of X through an iterative Bayesian updating algorithm where model parameters and latent variables are updated until convergence. Once trained, any conditional distribution can be obtained without retraining. Empirically, BGM achieves superior prediction performance with well calibrated predictive intervals, demonstrating that a single learned model can serve as a universal engine for conditional prediction with uncertainty quantification. We provide theoretical guarantees for the convergence of the stochastic iterative algorithm, statistical consistency and conditional-risk bounds. The proposed BGM framework leverages the power of AI to capture complex relationships among variables while adhering to Bayesian principles, emerging as a promising framework for advancing various applications in modern data science. The code for BGM is freely available at https://github.com/liuq-lab/bayesgm.


检测相关(4篇)

【1】Community-Based Model Sharing and Generalisation: Anomaly Detection in IoT Temperature Sensor Networks
标题:基于社区的模型共享和概括:物联网温度传感器网络中的异常检测
链接:https://arxiv.org/abs/2601.05984

作者:Sahibzada Saadoon Hammad,Joaquín Huerta Guijarro,Francisco Ramos,Michael Gould Carlson,Sergio Trilles Oliver
备注:20 pages, 9 figures, Journal submission
摘要:物联网(IoT)设备的快速部署导致了实时监测环境和城市现象的大规模传感器网络。兴趣社区(CoI)通过对具有相似操作和环境特征的设备进行分组,为组织异构物联网传感器网络提供了一个有前途的范例。这项工作提出了一个异常检测框架的基础上的CoI范式分组传感器到社区使用融合的相似性矩阵,通过斯皮尔曼系数,空间接近度,高斯距离衰减,海拔相似性,采用时间相关性。对于每个社区,选择基于最佳轮廓的代表性站点,并使用贝叶斯超参数优化和扩展窗口交叉验证来训练三种自动编码器架构(BiLSTM,LSTM和MLP),并在来自同一集群的站点和其他集群的最佳代表性站点上进行测试。模型在数据的正常温度模式上进行训练,并通过重建误差分析检测异常。实验结果表明,一个强大的社区内的性能在整个评估配置,而社区之间的变化进行观察。总体而言,结果支持基于社区的模型共享在减少计算开销和分析物联网传感器网络模型通用性方面的适用性。
摘要:The rapid deployment of Internet of Things (IoT) devices has led to large-scale sensor networks that monitor environmental and urban phenomena in real time. Communities of Interest (CoIs) provide a promising paradigm for organising heterogeneous IoT sensor networks by grouping devices with similar operational and environmental characteristics. This work presents an anomaly detection framework based on the CoI paradigm by grouping sensors into communities using a fused similarity matrix that incorporates temporal correlations via Spearman coefficients, spatial proximity using Gaussian distance decay, and elevation similarities. For each community, representative stations based on the best silhouette are selected and three autoencoder architectures (BiLSTM, LSTM, and MLP) are trained using Bayesian hyperparameter optimization with expanding window cross-validation and tested on stations from the same cluster and the best representative stations of other clusters. The models are trained on normal temperature patterns of the data and anomalies are detected through reconstruction error analysis. Experimental results show a robust within-community performance across the evaluated configurations, while variations across communities are observed. Overall, the results support the applicability of community-based model sharing in reducing computational overhead and to analyse model generalisability across IoT sensor networks.


【2】Detecting Autism Spectrum Disorder with Deep Eye Movement Features
标题:利用深度眼动特征检测自闭症谱系障碍
链接 :https://arxiv.org/abs/2601.05812

作者:Zhanpei Huang,Taochen chen,Fangqing Gu,Yiqun Zhang
备注:Accepted to CIS 2025
摘要:自闭症谱系障碍(ASD)是一种神经发育障碍,其特征是社交和行为模式的缺陷。眼动数据为ASD检测提供了一种非侵入性诊断工具,因为它本质上是离散的,并表现出短期的时间依赖性,反映了固定点之间的局部注视焦点。这些特征使数据能够更深入地了解微妙的行为标记,将ASD相关模式与典型发展区分开来。眼动信号主要包含短时和局部依赖性。然而,尽管在基于transformer的模型中广泛应用了堆叠注意力层来捕获长距离依赖关系,但我们的实验结果表明,这种方法在应用于眼动数据时仅产生有限的好处。这可能是因为离散的固定点和凝视焦点的短期依赖性降低了全局注意力机制的效用,使得它们比专注于局部时间模式的架构效率更低。为了有效地捕捉微妙和复杂的眼动模式,区分ASD从典型发展(TD)的个人,离散短期序列(DSTS)建模框架的类感知表示和不平衡感知机制的设计。通过对几个眼动数据集的广泛实验,DSTS的性能优于传统的机器学习技术和更复杂的深度学习模型。
摘要:Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by deficits in social communication and behavioral patterns. Eye movement data offers a non-invasive diagnostic tool for ASD detection, as it is inherently discrete and exhibits short-term temporal dependencies, reflecting localized gaze focus between fixation points. These characteristics enable the data to provide deeper insights into subtle behavioral markers, distinguishing ASD-related patterns from typical development. Eye movement signals mainly contain short-term and localized dependencies. However, despite the widespread application of stacked attention layers in Transformer-based models for capturing long-range dependencies, our experimental results indicate that this approach yields only limited benefits when applied to eye movement data. This may be because discrete fixation points and short-term dependencies in gaze focus reduce the utility of global attention mechanisms, making them less efficient than architectures focusing on local temporal patterns. To efficiently capture subtle and complex eye movement patterns, distinguishing ASD from typically developing (TD) individuals, a discrete short-term sequential (DSTS) modeling framework is designed with Class-aware Representation and Imbalance-aware Mechanisms. Through extensive experiments on several eye movement datasets, DSTS outperforms both traditional machine learning techniques and more sophisticated deep learning models.


【3】Variational Autoencoders for P-wave Detection on Strong Motion Earthquake Spectrograms
标题:用于强动地震谱图P波检测的变分自动编码器
链接:https://arxiv.org/abs/2601.05759

作者:Turkan Simge Ispak,Salih Tileylioglu,Erdem Akagunduz
备注:13 pages, 8 figures, 3 tables
摘要:准确的P波检测对于地震预警至关重要,但由于高噪声水平、有限的标记数据和复杂的波形特征,强震记录带来了挑战。本研究将P波到达检测重新定义为一个自我监督的异常检测任务,以评估架构变化如何调节重建保真度和异常歧视之间的权衡。通过对492个变分自动编码器配置的全面网格搜索,我们表明,虽然跳过连接最大限度地减少了重建误差(平均绝对误差约为0.0012),但它们会导致“过度泛化”,使模型能够重建噪声并掩盖检测信号。相比之下,注意力机制优先考虑全局上下文而不是局部细节,并产生最高的检测性能,曲线下面积为0.875。基于注意力的变分自动编码器在0至40公里近源范围内实现了0.91的曲线下面积,证明了对即时预警应用的高度适用性。这些研究结果表明,有利于全局上下文的像素完美重建的架构约束是必不可少的鲁棒性,自我监督的P波检测。
摘要:Accurate P-wave detection is critical for earthquake early warning, yet strong-motion records pose challenges due to high noise levels, limited labeled data, and complex waveform characteristics. This study reframes P-wave arrival detection as a self-supervised anomaly detection task to evaluate how architectural variations regulate the trade-off between reconstruction fidelity and anomaly discrimination. Through a comprehensive grid search of 492 Variational Autoencoder configurations, we show that while skip connections minimize reconstruction error (Mean Absolute Error approximately 0.0012), they induce "overgeneralization", allowing the model to reconstruct noise and masking the detection signal. In contrast, attention mechanisms prioritize global context over local detail and yield the highest detection performance with an area-under-the-curve of 0.875. The attention-based Variational Autoencoder achieves an area-under-the-curve of 0.91 in the 0 to 40-kilometer near-source range, demonstrating high suitability for immediate early warning applications. These findings establish that architectural constraints favoring global context over pixel-perfect reconstruction are essential for robust, self-supervised P-wave detection.


【4】Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem
标题:基于非参数漂移定理的离散信号随机性检测
链接:https://arxiv.org/abs/2601.06009

作者:Sunia Tanweer,Firas A. Khasawneh
摘要:我们开发了一个实用的框架,区分扩散随机过程从确定性信号只使用一个单一的离散时间序列。我们的方法是基于经典的连续半鞅的偏移和交叉定理,相关的数量$N_\varepsilon$的幅度至少$\varepsilon $的偏移与二次变化$[X]_T$的过程。标度律普遍适用于所有具有有限二次变差的连续半鞅,包括具有非线性或状态依赖波动率的一般伊藤扩散,但对于确定性系统却严重失败-从而提供了一种理论上证明的方法来区分这些动态,而不是主观熵或基于递归的最新方法。我们构建了一个强大的数据驱动的扩散测试。该方法将经验偏移计数与理论预期进行比较。由此产生的比率$K(\vareps)=N_{\vareps}^{\mathrm{emp}}/N_{\vareps}^{\mathrm{theory}}$然后通过对数-对数斜率偏差来总结,该对数-对数斜率偏差测量$\vareps ^{-2}$定律,该定律提供了扩散类或非扩散类的分类。我们将此方法应用于正则随机系统、周期映射和混沌映射、加性白噪声系统以及随机Duffing系统。该方法是非参数的,无模型的,只依赖于连续半鞅的普遍小规模结构。
摘要:We develop a practical framework for distinguishing diffusive stochastic processes from deterministic signals using only a single discrete time series. Our approach is based on classical excursion and crossing theorems for continuous semimartingales, which correlates number $N_\varepsilon$ of excursions of magnitude at least $\varepsilon$ with the quadratic variation $[X]_T$ of the process. The scaling law holds universally for all continuous semimartingales with finite quadratic variation, including general Ito diffusions with nonlinear or state-dependent volatility, but fails sharply for deterministic systems -- thereby providing a theoretically-certfied method of distinguishing between these dynamics, as opposed to the subjective entropy or recurrence based state of the art methods. We construct a robust data-driven diffusion test. The method compares the empirical excursion counts against the theoretical expectation. The resulting ratio $K(\varepsilon)=N_{\varepsilon}^{\mathrm{emp}}/N_{\varepsilon}^{\mathrm{theory}}$ is then summarized by a log-log slope deviation measuring the $\varepsilon^{-2}$ law that provides a classification into diffusion-like or not. We demonstrate the method on canonical stochastic systems, some periodic and chaotic maps and systems with additive white noise, as well as the stochastic Duffing system. The approach is nonparametric, model-free, and relies only on the universal small-scale structure of continuous semimartingales.


优化|敛散性(3篇)

【1】IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck
标题:IIB-LPO:基于迭代信息瓶颈的潜在策略优化
链接:https://arxiv.org/abs/2601.05870

作者:Huilin Deng,Hongchen Luo,Yue Zhu,Long Li,Zhuoyue Chen,Xinghao Zhao,Ming Li,Jihai Zhang,Mengchang Wang,Yang Cao,Yu Kang
摘要 :大型语言模型(LLM)推理的可验证奖励强化学习(RLVR)的最新进展受到一个持续挑战的阻碍:探索崩溃。随机卷展的语义同质性通常会使模型陷入狭隘的、过度优化的行为中。虽然现有的方法利用政策熵来鼓励探索,但它们面临着固有的局限性。全局熵正则化容易受到奖励黑客攻击的影响,这可能会导致无意义的冗长,而局部标记选择性更新则会与预训练模型的强归纳偏差作斗争。为了解决这个问题,我们提出了通过迭代信息瓶颈(IIB-LPO)的潜在策略优化,这是一种新的方法,将探索从令牌分布的统计扰动转移到推理轨迹的拓扑分支。IIB-LPO在高熵状态下触发潜在分支以使推理路径多样化,并采用信息瓶颈原则作为轨迹过滤器和自我奖励机制,确保简洁和信息丰富的探索。四个数学推理基准的实证结果表明,IIB-LPO实现了最先进的性能,在准确性和多样性指标上分别超过了5.3%和7.4%。
摘要:Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning have been hindered by a persistent challenge: exploration collapse. The semantic homogeneity of random rollouts often traps models in narrow, over-optimized behaviors. While existing methods leverage policy entropy to encourage exploration, they face inherent limitations. Global entropy regularization is susceptible to reward hacking, which can induce meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To address this, we propose Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO), a novel approach that shifts exploration from statistical perturbation of token distributions to topological branching of reasoning trajectories. IIB-LPO triggers latent branching at high-entropy states to diversify reasoning paths and employs the Information Bottleneck principle both as a trajectory filter and a self-reward mechanism, ensuring concise and informative exploration. Empirical results across four mathematical reasoning benchmarks demonstrate that IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.


【2】Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR
标题:令牌和序列的动态混合策略优化
链接:https://arxiv.org/abs/2601.05607

作者:Zijun Min,Bingshuai Liu,Ante Wang,Long Zhang,Anxiang Zeng,Haibo Zhang,Jinsong Su
摘要:带有可验证奖励的强化学习(RLVR)为在推理任务中优化大型语言模型提供了一个有前途的框架。然而,现有的RLVR算法侧重于不同的粒度,每一个互补的优势和局限性。组相对策略优化(GRPO)使用令牌级重要性比率更新策略,这保留了细粒度的信用分配,但通常存在高方差和不稳定性。相比之下,组序列策略优化(GSPO)在所有令牌中应用单个序列级重要性比率,以更好地匹配序列级奖励,但牺牲了令牌级信用分配。在本文中,我们提出了动态混合策略优化(DHPO)的桥梁GRPO和GSPO在一个单一的剪切代理目标。DHPO使用加权机制将标记级和序列级重要性比率结合起来。我们探索两种变体的混合机制,包括平均混合和熵引导混合。为了进一步稳定训练,我们采用了一种特定于分支的裁剪策略,在混合之前将令牌级和序列级比率限制在单独的信任区域内,防止任何一个分支中的离群值主导更新。在七个具有挑战性的数学推理基准测试中,对Qwen3系列的密集模型和MoE模型的实验表明,DHPO始终优于GRPO和GSPO。我们将在接受本文后发布我们的代码。
摘要:Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks. However, existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations. Group Relative Policy Optimization (GRPO) updates the policy with token-level importance ratios, which preserves fine-grained credit assignment but often suffers from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies single sequence-level importance ratios across all tokens in a response that better matches sequence-level rewards, but sacrifices token-wise credit assignment. In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting mechanisms. We explore two variants of the mixing mechanism, including an averaged mixing and an entropy-guided mixing. To further stabilize training, we employ a branch-specific clipping strategy that constrains token-level and sequence-level ratios within separate trust regions before mixing, preventing outliers in either branch from dominating the update. Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO. We will release our code upon acceptance of this paper.


【3】Buffered AUC maximization for scoring systems via mixed-integer optimization
标题:通过混合整数优化的评分系统的缓冲AUC最大化
链接:https://arxiv.org/abs/2601.05544

作者:Moe Shiina,Shunnosuke Ikeda,Yuichi Takano
摘要:评分系统是由少量解释变量组成的线性分类器,每个解释变量分配一个小的整数系数。该系统具有高度的可解释性,并允许通过简单的手动计算进行预测,而无需计算器。几个先前的研究已经使用混合整数优化(MIO)技术来开发用于二进制分类的评分系统;然而,它们没有集中于直接最大化AUC(即,受试者工作特征曲线下面积),尽管AUC被认为是评分系统的基本评价指标。我们的目标是建立一个有效的MIO框架,用于构建评分系统,直接最大化缓冲AUC(bAUC)作为AUC的最紧凹下限。我们的优化模型被制定为一个混合整数线性优化(MILO)问题,最大限度地提高bAUC受组稀疏约束,限制在评分系统中的问题的数量。使用公开可用的真实世界数据集的计算实验表明,我们的MILO方法可以建立评分系统,与基于正则化和逐步回归的基线方法相比,具有优越的AUC值。这项研究有助于MIO技术的发展高度可解释的分类模型的进步。
摘要:A scoring system is a linear classifier composed of a small number of explanatory variables, each assigned a small integer coefficient. This system is highly interpretable and allows predictions to be made with simple manual calculations without the need for a calculator. Several previous studies have used mixed-integer optimization (MIO) techniques to develop scoring systems for binary classification; however, they have not focused on directly maximizing AUC (i.e., area under the receiver operating characteristic curve), even though AUC is recognized as an essential evaluation metric for scoring systems. Our goal herein is to establish an effective MIO framework for constructing scoring systems that directly maximize the buffered AUC (bAUC) as the tightest concave lower bound on AUC. Our optimization model is formulated as a mixed-integer linear optimization (MILO) problem that maximizes bAUC subject to a group sparsity constraint for limiting the number of questions in the scoring system. Computational experiments using publicly available real-world datasets demonstrate that our MILO method can build scoring systems with superior AUC values compared to the baseline methods based on regularization and stepwise regression. This research contributes to the advancement of MIO techniques for developing highly interpretable classification models.


预测|估计(7篇)

【1】Prophet as a Repro ducible Forecasting Framework: A Methodological Guide for Business and Financial Analytics
标题:先知作为可重复预测框架:商业和金融分析方法论指南
链接:https://arxiv.org/abs/2601.05929

作者:Sidney Shapiro,Burhanuddin Panvelwala
摘要:在预测研究和实践中,重现性仍然是一个持续的挑战,特别是在商业和金融分析中,预测为高风险决策提供信息。传统的预测方法虽然在理论上可以解释,但通常需要大量的手动调整,并且难以在专有环境中复制。机器学习方法提供了预测的灵活性,但也带来了与可解释性、随机训练过程和跨环境再现性相关的挑战。本文研究了Prophet,这是一个由Meta开发的开源预测框架,它是一个可重复性的解决方案,可以平衡可解释性、标准化工作流程和可访问性。本研究不是提出一种新的算法,而是评估Prophet的加法结构、开源实现和标准化工作流程如何有助于透明和可复制的预测实践。使用公开的金融和零售数据集,我们将Prophet的性能和可解释性与多个ARIMA规范(自动选择,手动指定和季节性变量)和随机森林进行了比较。这种多模型比较为Prophet的相对性能和再现性优势提供了可靠的评估。通过具体的Python示例,我们展示了Prophet如何促进高效的预测工作流程以及与分析管道的集成。这项研究将Prophet置于可重复研究的更广泛背景下。它强调了Prophet作为支持验证、可验证性和方法严谨性的方法构建块的作用。这项工作为研究人员和从业者提供了一个实用的参考框架,用于在基于Python的研究工作流程中进行可重复的预测。
摘要 :Reproducibility remains a persistent challenge in forecasting research and practice, particularly in business and financial analytics where forecasts inform high-stakes decisions. Traditional forecasting methods, while theoretically interpretable, often require extensive manual tuning and are difficult to replicate in proprietary environments. Machine learning approaches offer predictive flexibility but introduce challenges related to interpretability, stochastic training procedures, and cross-environment reproducibility. This paper examines Prophet, an open-source forecasting framework developed by Meta, as a reproducibility-enabling solution that balances interpretability, standardized workflows, and accessibility. Rather than proposing a new algorithm, this study evaluates how Prophet's additive structure, open-source implementation, and standardized workflow contribute to transparent and replicable forecasting practice. Using publicly available financial and retail datasets, we compare Prophet's performance and interpretability with multiple ARIMA specifications (auto-selected, manually specified, and seasonal variants) and Random Forest under a controlled and fully documented experimental design. This multi-model comparison provides a robust assessment of Prophet's relative performance and reproducibility advantages. Through concrete Python examples, we demonstrate how Prophet facilitates efficient forecasting workflows and integration with analytical pipelines. The study positions Prophet within the broader context of reproducible research. It highlights Prophet's role as a methodological building block that supports verification, auditability, and methodological rigor. This work provides researchers and practitioners with a practical reference framework for reproducible forecasting in Python-based research workflows.


【2】Tensor-DTI: Enhancing Biomolecular Interaction Prediction with Contrastive Embedding Learning
标题:张量-RTI:通过对比嵌入学习增强生物分子相互作用预测
链接:https://arxiv.org/abs/2601.05792

作者:Manel Gil-Sorribes,Júlia Vilalta-Mor,Isaac Filella-Mercè,Robert Soliva,Álvaro Ciudad,Víctor Guallar,Alexis Molina
备注:Accepted at the Generative and Experimental Perspectives for Biomolecular Design Workshop at ICLR 2025 and at the Learning Meaningful Representations of Life Workshop at ICLR 2025
摘要:精确的药物-靶标相互作用(DTI)预测对于计算药物发现是必不可少的,然而现有的模型通常依赖于单一模态预定义的分子描述符或基于序列的嵌入,具有有限的代表性。我们提出了Tensor-DTI,这是一个对比学习框架,它集成了来自分子图,蛋白质语言模型和结合位点预测的多模态嵌入,以改善相互作用建模。Tensor-DTI采用连体双编码器架构,使其能够捕获化学和结构相互作用特征,同时区分相互作用和非相互作用对。对多个DTI基准的评估表明,Tensor-DTI优于现有的基于序列和基于图形的模型。我们还在十亿规模的化学库中对CDK 2进行了大规模的推断实验,即使在CDK 2被拒绝训练的情况下,Tensor-DTI也会产生化学上合理的命中分布。在针对Glide docking和Boltz-2 co-folder的富集研究中,Tensor-DTI在CDK 2上仍然具有竞争力,并提高了在严格的家族保留分裂下回收家族外靶标的高亲和力配体的中等分数所需的筛选预算。此外,我们探索其适用性蛋白质-RNA和肽-蛋白质相互作用。我们的研究结果强调了将多模态信息与对比目标相结合的好处,以提高交互预测的准确性,并为虚拟筛选提供更具可解释性和可靠性的模型。
摘要:Accurate drug-target interaction (DTI) prediction is essential for computational drug discovery, yet existing models often rely on single-modality predefined molecular descriptors or sequence-based embeddings with limited representativeness. We propose Tensor-DTI, a contrastive learning framework that integrates multimodal embeddings from molecular graphs, protein language models, and binding-site predictions to improve interaction modeling. Tensor-DTI employs a siamese dual-encoder architecture, enabling it to capture both chemical and structural interaction features while distinguishing interacting from non-interacting pairs. Evaluations on multiple DTI benchmarks demonstrate that Tensor-DTI outperforms existing sequence-based and graph-based models. We also conduct large-scale inference experiments on CDK2 across billion-scale chemical libraries, where Tensor-DTI produces chemically plausible hit distributions even when CDK2 is withheld from training. In enrichment studies against Glide docking and Boltz-2 co-folder, Tensor-DTI remains competitive on CDK2 and improves the screening budget required to recover moderate fractions of high-affinity ligands on out-of-family targets under strict family-holdout splits. Additionally, we explore its applicability to protein-RNA and peptide-protein interactions. Our findings highlight the benefits of integrating multimodal information with contrastive objectives to enhance interaction-prediction accuracy and to provide more interpretable and reliability-aware models for virtual screening.


【3】PiXTime: A Model for Federated Time Series Forecasting with Heterogeneous Data Structures Across Nodes
标题:PiXTime:一种跨节点具有异类数据结构的联邦时间序列预测模型
链接:https://arxiv.org/abs/2601.05613

作者:Yiming Zhou,Mingyue Cheng,Hao Wang,Enhong Chen
摘要:时间序列是非常有价值的,并且很少在节点之间共享,这使得联邦学习成为利用分布式时态数据的有前途的范例。然而,不同的采样标准导致不同的时间粒度和变量集跨节点,阻碍了经典的联邦学习。我们提出了PiXTime,这是一种专为联邦学习设计的新型时间序列预测模型,可以跨具有多粒度和异构变量集的节点进行有效预测。PiXTime采用个性化的Patch Embedding将特定于节点的粒度时间序列映射为统一维度的令牌序列,以供后续共享模型处理,并使用全局VE Table来对齐跨节点的变量类别语义,从而增强跨节点的可传递性。通过基于transformer的共享模型,PiXTime捕获具有任意数量变量的辅助序列的表示,并使用交叉注意力来增强对目标序列的预测。实验表明,PiXTime在联邦设置中实现了最先进的性能,并在八个广泛使用的现实世界的传统基准测试中表现出卓越的性能。
摘要:Time series are highly valuable and rarely shareable across nodes, making federated learning a promising paradigm to leverage distributed temporal data. However, different sampling standards lead to diverse time granularities and variable sets across nodes, hindering classical federated learning. We propose PiXTime, a novel time series forecasting model designed for federated learning that enables effective prediction across nodes with multi-granularity and heterogeneous variable sets. PiXTime employs a personalized Patch Embedding to map node-specific granularity time series into token sequences of a unified dimension for processing by a subsequent shared model, and uses a global VE Table to align variable category semantics across nodes, thereby enhancing cross-node transferability. With a transformer-based shared model, PiXTime captures representations of auxiliary series with arbitrary numbers of variables and uses cross-attention to enhance the prediction of the target series. Experiments show PiXTime achieves state-of-the-art performance in federated settings and demonstrates superior performance on eight widely used real-world traditional benchmarks.


【4】Good Allocations from Bad Estimates
标题:从错误的估计中获得良好的分配
链接:https://arxiv.org/abs/2601.05597

作者:Sílvia Casacuberta,Moritz Hardt
摘要:条件平均治疗效应(CATE)估计是针对异质性人群进行治疗的事实上的金标准。该方法在$M$不同人群分层中估计治疗效果,误差$ε> 0$,以估计治疗效果的降序排列个体,直到预算耗尽。一般来说,这种方法需要O(M/ε^2)$个样本。如果目标是估计所有治疗效果,误差不超过$ε$,则这是最佳可能性。在这项工作中,我们展示了如何实现相同的总治疗效果CATE只有$O(M/ε)$样本的治疗效果的自然分布。关键的见解是,粗略的估计足以接近最佳的治疗分配。此外,我们表明,预算的灵活性,可以进一步降低分配的样本复杂性。最后,我们评估我们的算法在各种现实世界的RCT数据集。在所有情况下,它发现几乎最佳的治疗分配与令人惊讶的是很少的样本。我们的工作突出了治疗效果估计和治疗分配之间的根本区别:后者需要更少的样本。
摘要:Conditional average treatment effect (CATE) estimation is the de facto gold standard for targeting a treatment to a heterogeneous population. The method estimates treatment effects up to an error $ε> 0$ in each of $M$ different strata of the population, targeting individuals in decreasing order of estimated treatment effect until the budget runs out. In general, this method requires $O(M/ε^2)$ samples. This is best possible if the goal is to estimate all treatment effects up to an $ε$ error. In this work, we show how to achieve the same total treatment effect as CATE with only $O(M/ε)$ samples for natural distributions of treatment effects. The key insight is that coarse estimates suffice for near-optimal treatment allocations. In addition, we show that budget flexibility can further reduce the sample complexity of allocation. Finally, we evaluate our algorithm on various real-world RCT datasets. In all cases, it finds nearly optimal treatment allocations with surprisingly few samples. Our work highlights the fundamental distinction between treatment effect estimation and treatment allocation: the latter requires far fewer samples.


【5】Prediction of Fault Slip Tendency in CO${_2}$ Storage using Data-space Inversion
标题:利用数据空间倒置预测CO${_2}$存储中的断层滑动趋势
链接:https://arxiv.org/abs/2601.05431

作者:Xiaowen He,Su Jiang,Louis J. Durlofsky
摘要:在许多地下作业中,准确评估断层滑动的可能性至关重要。传统的基于模型的历史匹配方法,这需要生成后的geomodels校准到观测数据,可以是具有挑战性的应用在耦合流地质力学问题与故障。在这项工作中,我们实现了一个变分自动编码器(VAE)为基础的数据空间反演(DSI)框架,预测压力,应力和应变场,断层滑动趋势,在CO${_2}$存储项目。DSI工作流程所需的主要计算需要模拟O(1000)先验地质模型。感兴趣的量的后验分布,然后直接从先前的模拟结果和观测数据推断,而不需要生成后验几何模型。这里使用的模型涉及一个合成的三维系统,有两个故障。非均质渗透率和孔隙度场的实现使用地质统计学软件生成,并且从先验分布对每个实现采样不确定的地质力学和断层参数。这些geomodels耦合流地质力学模拟进行使用GEOS。使用先前的模拟结果训练具有堆叠的卷积长短期记忆层的VAE,以根据潜变量来表示压力、应变、有效正应力和剪应力场。VAE参数化与DSI一起用于后验预测,监测井提供观测到的压力和应变数据。合成真实模型的后验结果表明,DSI-VAE框架给出了准确的预测压力,应变,应力场和断层滑动趋势。该框架还表明,以减少关键的地质力学和故障参数的不确定性。
摘要:Accurately assessing the potential for fault slip is essential in many subsurface operations. Conventional model-based history matching methods, which entail the generation of posterior geomodels calibrated to observed data, can be challenging to apply in coupled flow-geomechanics problems with faults. In this work, we implement a variational autoencoder (VAE)-based data-space inversion (DSI) framework to predict pressure, stress and strain fields, and fault slip tendency, in CO${_2}$ storage projects. The main computations required by the DSI workflow entail the simulation of O(1000) prior geomodels. The posterior distributions for quantities of interest are then inferred directly from prior simulation results and observed data, without the need to generate posterior geomodels. The model used here involves a synthetic 3D system with two faults. Realizations of heterogeneous permeability and porosity fields are generated using geostatistical software, and uncertain geomechanical and fault parameters are sampled for each realization from prior distributions. Coupled flow-geomechanics simulations for these geomodels are conducted using GEOS. A VAE with stacked convolutional long short-term memory layers is trained, using the prior simulation results, to represent pressure, strain, effective normal stress and shear stress fields in terms of latent variables. The VAE parameterization is used with DSI for posterior predictions, with monitoring wells providing observed pressure and strain data. Posterior results for synthetic true models demonstrate that the DSI-VAE framework gives accurate predictions for pressure, strain, and stress fields and for fault slip tendency. The framework is also shown to reduce uncertainty in key geomechanical and fault parameters.


【6】GlyRAG: Context-Aware Retrieval-Augmented Framework for Blood Glucose Forecasting
标题:GlyRAG:用于血糖预测的上下文感知检索增强框架
链接:https://arxiv.org/abs/2601.05353

作者:Shovito Barua Soumma,Hassan Ghasemzadeh
摘要:从CGM准确预测血糖对于预防血糖异常事件至关重要,从而实现积极的糖尿病管理。然而,当前的预测模型将使用CGM捕获的血糖读数视为数字序列,忽略上下文或依赖于难以大规模收集和部署的附加传感器/模态。最近,LLM已经显示出对时间序列预测任务的承诺,但它们在糖尿病护理中作为代理上下文提取器的作用在很大程度上仍未被探索。为了解决这些限制,我们提出了GlyRAG,一个上下文感知的,检索增强的预测框架,直接从CGM轨迹中获得血糖动态的语义理解,而不需要额外的传感器模式。GlyRAG采用LLM作为情境化代理来生成临床总结。这些摘要通过语言模型嵌入,并在多模态Transformer架构中与基于块的葡萄糖表示融合,其中交叉翻译损失对齐文本和生理嵌入。然后,检索模块在学习的嵌入空间中识别类似的历史事件,并在进行预测推理之前使用交叉注意来整合这些基于案例的类似物。对两个T1 D队列的广泛评估表明,GlyRAG始终优于最先进的方法,与基线相比,RMSE降低高达39%,RMSE进一步降低1.7%。临床评价显示,GlyRAG将85%的预测置于安全区,并在两个队列中预测血糖异常事件方面实现了51%的改善。这些结果表明,基于LLM的情境化和CGM轨迹检索可以提高长期血糖预测的准确性和临床可靠性,而无需额外的传感器,从而支持未来的糖尿病管理决策支持工具。
摘要:Accurate forecasting of blood glucose from CGM is essential for preventing dysglycemic events, thus enabling proactive diabetes management. However, current forecasting models treat blood glucose readings captured using CGMs as a numerical sequence, either ignoring context or relying on additional sensors/modalities that are difficult to collect and deploy at scale. Recently, LLMs have shown promise for time-series forecasting tasks, yet their role as agentic context extractors in diabetes care remains largely unexplored. To address these limitations, we propose GlyRAG, a context-aware, retrieval-augmented forecasting framework that derives semantic understanding of blood glucose dynamics directly from CGM traces without requiring additional sensor modalities. GlyRAG employs an LLM as a contextualization agent to generate clinical summaries. These summaries are embedded by a language model and fused with patch-based glucose representations in a multimodal transformer architecture with a cross translation loss aligining textual and physiological embeddings. A retrieval module then identifies similar historical episodes in the learned embedding space and uses cross-attention to integrate these case-based analogues prior to making a forecasting inference. Extensive evaluations on two T1D cohorts show that GlyRAG consistently outperforms state-of-the art methods, achieving up to 39% lower RMSE and a further 1.7% reduction in RMSE over the baseline. Clinical evaluation shows that GlyRAG places 85% predictions in safe zones and achieves 51% improvement in predicting dysglycemic events across both cohorts. These results indicate that LLM-based contextualization and retrieval over CGM traces can enhance the accuracy and clinical reliability of long-horizon glucose forecasting without the need for extra sensors, thus supporting future agentic decision-support tools for diabetes management.


【7】Machine learning assisted state prediction of misspecified linear dynamical system via modal reduction
标题:通过模式约简的机器学习辅助错误指定线性动力系统的状态预测
链接:https://arxiv.org/abs/2601.05297

作者:Rohan Vitthal Thorat,Rajdip Nayek
摘要:结构动力学的准确预测对于在整个使用寿命期间保持数字孪生保真度至关重要。具有固定标称参数的参数化模型通常由于几何、材料行为、阻尼或边界条件的简化而忽略关键物理效应,从而导致损害预测准确性的模型形式误差(MFE)。这项工作介绍了一个全面的框架MFE估计和校正在高维有限元(FE)为基础的结构动力系统。高斯过程潜力模型(GPLFM)在简化模态域中以非参数方式表示差异,允许对未建模动态进行灵活的数据驱动表征。线性贝叶斯滤波方法联合估计系统状态和差异,将认识和任意的不确定性。为了确保计算的易处理性,FE系统被投影到一个减少的模态基础上,和网格不变的神经网络映射模态状态的差异估计,允许在不同的FE离散化模型整流没有再训练。验证是在五个MFE的错误,包括不正确的梁理论,阻尼误指定,误指定的边界条件,未建模的材料非线性,和局部损伤证明代理模型的位移和旋转预测误差的大幅减少下看不见的激励。所提出的方法提供了一种潜在的手段,以维护数字孪生精度在固有的建模不确定性。
摘要 :Accurate prediction of structural dynamics is imperative for preserving digital twin fidelity throughout operational lifetimes. Parametric models with fixed nominal parameters often omit critical physical effects due to simplifications in geometry, material behavior, damping, or boundary conditions, resulting in model form errors (MFEs) that impair predictive accuracy. This work introduces a comprehensive framework for MFE estimation and correction in high-dimensional finite element (FE) based structural dynamical systems. The Gaussian Process Latent Force Model (GPLFM) represents discrepancies non-parametrically in the reduced modal domain, allowing a flexible data-driven characterization of unmodeled dynamics. A linear Bayesian filtering approach jointly estimates system states and discrepancies, incorporating epistemic and aleatoric uncertainties. To ensure computational tractability, the FE system is projected onto a reduced modal basis, and a mesh-invariant neural network maps modal states to discrepancy estimates, permitting model rectification across different FE discretizations without retraining. Validation is undertaken across five MFE scenarios-including incorrect beam theory, damping misspecification, misspecified boundary condition, unmodeled material nonlinearity, and local damage demonstrating the surrogate model's substantial reduction of displacement and rotation prediction errors under unseen excitations. The proposed methodology offers a potential means to uphold digital twin accuracy amid inherent modeling uncertainties.


其他神经网络|深度学习|模型|建模(21篇)

【1】On the Robustness of Age for Learning-Based Wireless Scheduling in Unknown Environments
标题:未知环境中基于学习的无线调度的时代稳健性
链接:https://arxiv.org/abs/2601.05956

作者:Juaren Steiger,Bin Li
备注:technical report of conference paper
摘要:约束组合多臂强盗模型已被广泛用于解决无线网络及相关领域的问题,包括未知信道条件下的吞吐量优化无线调度问题。这方面的大多数工作使用的算法设计策略,结合了强盗学习算法与虚拟队列技术,以跟踪违反吞吐量约束。这些算法在其算法设计中寻求最小化虚拟队列长度。然而,在网络中的信道条件突然改变,由此产生的约束可能变得不可行,导致无限增长的虚拟队列长度。在本文中,我们的关键观察,头的线年龄的动态,即年龄最老的数据包在虚拟队列中,使其更强大的算法设计时使用的虚拟队列长度相比。因此,我们设计了一个基于学习的调度策略,使用的头线年龄的地方的虚拟队列长度。我们证明了我们的政策与i.i.d.下的最先进性能相匹配。网络条件。至关重要的是,我们还表明,该系统保持稳定,即使在突然变化的信道条件下,可以迅速恢复约束不可行的时期。
摘要:The constrained combinatorial multi-armed bandit model has been widely employed to solve problems in wireless networking and related areas, including the problem of wireless scheduling for throughput optimization under unknown channel conditions. Most work in this area uses an algorithm design strategy that combines a bandit learning algorithm with the virtual queue technique to track the throughput constraint violation. These algorithms seek to minimize the virtual queue length in their algorithm design. However, in networks where channel conditions change abruptly, the resulting constraints may become infeasible, leading to unbounded growth in virtual queue lengths. In this paper, we make the key observation that the dynamics of the head-of-line age, i.e. the age of the oldest packet in the virtual queue, make it more robust when used in algorithm design compared to the virtual queue length. We therefore design a learning-based scheduling policy that uses the head-of-line age in place of the virtual queue length. We show that our policy matches state-of-the-art performance under i.i.d. network conditions. Crucially, we also show that the system remains stable even under abrupt changes in channel conditions and can rapidly recover from periods of constraint infeasibility.


【2】Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets
标题:基于深度学习的胰腺肿瘤分割模型在公共内窥镜超声数据集中的性能
链接:https://arxiv.org/abs/2601.05937

作者:Pankaj Gupta,Priya Mudgil,Niharika Dutta,Kartik Bose,Nitish Kumar,Anupam Kumar,Jimil Shah,Vaneet Jearth,Jayanta Samanta,Vishal Sharma,Harshal Mandavdhare,Surinder Rana,Saroj K Sinha,Usha Dutta
摘要:背景:胰腺癌是最具侵袭性的癌症之一,生存率低。超声内镜(EUS)是一种重要的诊断方法,但其有效性受到操作者主观性的限制。本研究评估了一种基于Vision Transformer的胰腺肿瘤深度学习分割模型。研究方法:使用具有Vision Transformer主干的USFM框架的分割模型在5倍交叉验证中使用17,367个EUS图像(来自两个公共数据集)进行训练和验证。该模型在来自另一个公共数据集的350张EUS图像的独立数据集上进行了测试,这些图像由放射科医生手动分割。预处理包括灰度转换、裁剪和缩放到512 x512像素。评价指标包括Dice相似系数(DSC)、交并比(IoU)、敏感性、特异性和准确性。结果如下:在5倍交叉验证中,该模型的平均DSC为0.651 +/- 0.738,IoU为0.579 +/- 0.658,灵敏度为69.8%,特异性为98.8%,准确性为97.5%。对于外部验证集,该模型的DSC为0.657(95% CI:0.634-0.769),IoU为0.614(95% CI:0.590-0.689),灵敏度为71.8%,特异性为97.7%。结果是一致的,但9.7%的情况下表现出错误的多个预测。结论:基于Vision Transformer的模型对EUS图像中的胰腺肿瘤分割表现出强大的性能。然而,数据集的异质性和有限的外部验证突出了进一步细化,标准化和前瞻性研究的必要性。
摘要:Background: Pancreatic cancer is one of the most aggressive cancers, with poor survival rates. Endoscopic ultrasound (EUS) is a key diagnostic modality, but its effectiveness is constrained by operator subjectivity. This study evaluates a Vision Transformer-based deep learning segmentation model for pancreatic tumors. Methods: A segmentation model using the USFM framework with a Vision Transformer backbone was trained and validated with 17,367 EUS images (from two public datasets) in 5-fold cross-validation. The model was tested on an independent dataset of 350 EUS images from another public dataset, manually segmented by radiologists. Preprocessing included grayscale conversion, cropping, and resizing to 512x512 pixels. Metrics included Dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, specificity, and accuracy. Results: In 5-fold cross-validation, the model achieved a mean DSC of 0.651 +/- 0.738, IoU of 0.579 +/- 0.658, sensitivity of 69.8%, specificity of 98.8%, and accuracy of 97.5%. For the external validation set, the model achieved a DSC of 0.657 (95% CI: 0.634-0.769), IoU of 0.614 (95% CI: 0.590-0.689), sensitivity of 71.8%, and specificity of 97.7%. Results were consistent, but 9.7% of cases exhibited erroneous multiple predictions. Conclusions: The Vision Transformer-based model demonstrated strong performance for pancreatic tumor segmentation in EUS images. However, dataset heterogeneity and limited external validation highlight the need for further refinement, standardization, and prospective studies.


【3】Can We Predict Before Executing Machine Learning Agents?
标题:我们可以在执行机器学习代理之前进行预测吗?
链接:https://arxiv.org/abs/2601.05930

作者:Jingsheng Zheng,Jintian Zhang,Yujie Luo,Yuren Mao,Yunjun Gao,Lun Du,Huajun Chen,Ningyu Zhang
备注:Work in progress
摘要:自主机器学习代理已经彻底改变了科学发现,但它们仍然受到生成-执行-反馈范式的限制。以前的方法遭受严重的执行瓶颈,因为假设评估严格依赖于昂贵的物理执行。为了绕过这些物理约束,我们将执行先验内在化,以瞬时预测推理代替昂贵的运行时检查,从World Models中汲取灵感。在这项工作中,我们形式化的任务,以数据为中心的解决方案偏好,并构建了一个全面的语料库的18,438成对比较。我们证明,LLM具有显着的预测能力时,启动与验证数据分析报告,实现61.5%的准确性和强大的置信度校准。最后,我们在FOREAGENT中实例化了这个框架,FOREAGENT是一个采用Predict-then-Verify循环的代理,实现了6倍的收敛加速,同时超过了基于执行的基线+6%。我们的代码和数据集将很快在https://github.com/zjunlp/predict-before-execute上公开。
摘要:Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict-before-execute.


【4】Auditing Fairness under Model Updates: Fundamental Complexity and Property-Preserving Updates
标题:模型更新下的审计公平性:基本复杂性和属性保留更新
链接:https://arxiv.org/abs/2601.05909

作者:Ayoub Ajarra,Debabrota Basu
摘要:随着机器学习模型越来越多地嵌入到社会基础设施中,审计它们的偏见变得越来越重要。然而,在现实世界的部署中,审计是复杂的,因为模型所有者可能会自适应地更新他们的模型,以响应不断变化的环境,如金融市场。这些更新可能会改变底层模型类,同时保留某些感兴趣的属性,这就提出了关于在这种变化下可以可靠地审计什么的基本问题。   在这项工作中,我们研究了任意更新下的群体公平审计。我们考虑修改预审计模型类,同时保持审计属性的不变性的一般变化。我们的目标有两个方面:(i)通过识别哪些策略性变化保留了审计中的属性,来表征允许更新的信息复杂性;以及(ii)使用最少数量的标记样本来有效地估计审计属性,例如组公平性。   我们提出了一个通用的框架,PAC审计的基础上经验属性优化(EPO)的甲骨文。对于统计奇偶校验,我们建立了分布免费的审计边界的特征在于SP维,一种新的组合措施,捕获的复杂性,可接受的战略更新。最后,我们证明,我们的框架自然延伸到其他审计目标,包括预测误差和稳健风险。
摘要:As machine learning models become increasingly embedded in societal infrastructure, auditing them for bias is of growing importance. However, in real-world deployments, auditing is complicated by the fact that model owners may adaptively update their models in response to changing environments, such as financial markets. These updates can alter the underlying model class while preserving certain properties of interest, raising fundamental questions about what can be reliably audited under such shifts.   In this work, we study group fairness auditing under arbitrary updates. We consider general shifts that modify the pre-audit model class while maintaining invariance of the audited property. Our goals are two-fold: (i) to characterize the information complexity of allowable updates, by identifying which strategic changes preserve the property under audit; and (ii) to efficiently estimate auditing properties, such as group fairness, using a minimal number of labeled samples.   We propose a generic framework for PAC auditing based on an Empirical Property Optimization (EPO) oracle. For statistical parity, we establish distribution-free auditing bounds characterized by the SP dimension, a novel combinatorial measure that captures the complexity of admissible strategic updates. Finally, we demonstrate that our framework naturally extends to other auditing objectives, including prediction error and robust risk.


【5】GlueNN: gluing patchwise analytic solutions with neural networks
标题:GSYS NN:将分片分析解决方案与神经网络粘合起来
链接:https://arxiv.org/abs/2601.05889

作者:Doyoung Kim,Donghee Lee,Hye-Sung Lee,Jiheon Lee,Jaeok Yi
备注:7 pages, 3 figures
摘要:在物理和工程中的许多问题中,人们会遇到复杂的微分方程,这些方程具有强烈的尺度依赖项,无法获得精确的解析解或数值解。一个常见的策略是将域划分为几个区域(补丁),并简化每个区域中的方程。当近似解析解可以在每个补丁中获得时,然后在接口处匹配它们以构造全局解。然而,这种修补程序可能无法重现正确的解决方案,因为近似形式可能会打破附近的匹配边界。在这项工作中,我们提出了一个学习框架,其中的积分常数的渐近解析解提升到尺度相关的功能。通过用域上的原始微分方程约束这些系数函数,网络学习全局有效的解,该解在渐近状态之间平滑插值,从而消除了对任意边界匹配的需要。我们证明了这个框架的有效性,在代表性的问题,从化学动力学和宇宙学,它准确地再现全球解决方案,并优于传统的匹配程序。
摘要:In many problems in physics and engineering, one encounters complicated differential equations with strongly scale-dependent terms for which exact analytical or numerical solutions are not available. A common strategy is to divide the domain into several regions (patches) and simplify the equation in each region. When approximate analytic solutions can be obtained in each patch, they are then matched at the interfaces to construct a global solution. However, this patching procedure can fail to reproduce the correct solution, since the approximate forms may break down near the matching boundaries. In this work, we propose a learning framework in which the integration constants of asymptotic analytic solutions are promoted to scale-dependent functions. By constraining these coefficient functions with the original differential equation over the domain, the network learns a globally valid solution that smoothly interpolates between asymptotic regimes, eliminating the need for arbitrary boundary matching. We demonstrate the effectiveness of this framework in representative problems from chemical kinetics and cosmology, where it accurately reproduces global solutions and outperforms conventional matching procedures.


【6】CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning
标题:CLewR:重新启动机器翻译偏好学习的课程学习
链接:https://arxiv.org/abs/2601.05858

作者:Alexandra Dragomir,Florin Brad,Radu Tudor Ionescu
摘要:大型语言模型(LLM)在zero-shot多语言机器翻译(MT)中表现出了具有竞争力的性能。一些后续工作通过偏好优化进一步提高了MT性能,但它们留下了一个关键方面,即在训练过程中给出数据样本的顺序。我们通过将课程学习集成到各种最先进的偏好优化算法中来解决这个问题,以提高MT性能。我们引入了一种新的课程学习策略与重启(CLewR),它在训练过程中多次重复容易到困难的课程,以有效地减轻灾难性忘记简单的例子。我们展示了几个模型家族(Gemma2,Qwen2.5,Llama3.1)和偏好优化技术的一致收益。我们在https://github.com/alexandra-dragomir/CLewR公开发布我们的代码。
摘要:Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra-dragomir/CLewR.


【7】A Dual Pipeline Machine Learning Framework for Automated Multi Class Sleep Disorder Screening Using Hybrid Resampling and Ensemble Learning
标题:使用混合恢复和集合学习进行自动多类睡眠障碍筛查的双管道机器学习框架
链接:https://arxiv.org/abs/2601.05814

作者:Md Sultanul Islam Ovi,Muhsina Tarannum Munfa,Miftahul Alam Adib,Syed Sabbir Hasan
备注:32 pages, 5 figures, 14 tables
摘要:睡眠障碍的准确分类,特别是失眠和睡眠呼吸暂停,对于降低长期健康风险和提高患者生活质量非常重要。然而,临床睡眠研究是资源密集型的,并且难以扩展到人群水平的筛查。本文提出了一个双管道机器学习框架,用于使用睡眠健康和生活方式数据集进行多类睡眠障碍筛查。该框架由两个并行处理流组成:一个统计管道,使用互信息和线性判别分析来实现线性可分性,另一个基于包装器的管道,使用自动编码器进行Boruta特征选择以进行非线性表示学习。为了解决类不平衡,我们使用混合SMOTETomek reservation策略。在实验中,Extra Trees和K Nearest Neighbors的准确率达到了98.67%,超过了同一数据集上最近的基线。使用Wilcoxon符号秩检验的统计学检验表明,相对于基线配置的改进是显著的,并且推断延迟保持在400毫秒以下。这些结果表明,所提出的双管道设计支持准确和有效的自动化筛查无创睡眠障碍风险分层。
摘要 :Accurate classification of sleep disorders, particularly insomnia and sleep apnea, is important for reducing long term health risks and improving patient quality of life. However, clinical sleep studies are resource intensive and are difficult to scale for population level screening. This paper presents a Dual Pipeline Machine Learning Framework for multi class sleep disorder screening using the Sleep Health and Lifestyle dataset. The framework consists of two parallel processing streams: a statistical pipeline that targets linear separability using Mutual Information and Linear Discriminant Analysis, and a wrapper based pipeline that applies Boruta feature selection with an autoencoder for non linear representation learning. To address class imbalance, we use the hybrid SMOTETomek resampling strategy. In experiments, Extra Trees and K Nearest Neighbors achieved an accuracy of 98.67%, outperforming recent baselines on the same dataset. Statistical testing using the Wilcoxon Signed Rank Test indicates that the improvement over baseline configurations is significant, and inference latency remains below 400 milliseconds. These results suggest that the proposed dual pipeline design supports accurate and efficient automated screening for non invasive sleep disorder risk stratification.


【8】Learning Reconstructive Embeddings in Reproducing Kernel Hilbert Spaces via the Representer Theorem
标题:通过代表定理学习再生核赫BERT空间中的重建嵌入
链接:https://arxiv.org/abs/2601.05811

作者:Enrique Feito-Casares,Francisco M. Melgarejo-Meseguer,José-Luis Rojo-Álvarez
摘要:由于人们对揭示高维数据潜在结构的表示学习方法越来越感兴趣,这项工作提出了在再生核希尔伯特空间(RKHS)内进行基于重构的流形学习的新算法。每个观测首先被重建为RKHS中其他样本的线性组合,通过优化表示定理的向量形式来实现其自动表示属性。一个可分离的运营商值的内核扩展到向量值数据的制定,同时保持一个单一的标量相似性函数的简单性。随后的内核对齐任务将数据投影到低维潜在空间中,其Gram矩阵旨在匹配高维重建内核,从而将RKHS的自动重建几何结构转移到嵌入中。因此,所提出的算法代表了一种扩展的方法的自动表示属性,表现出许多自然数据,通过使用和适应核学习理论的知名成果。模拟(同心圆和瑞士卷)和真实(癌症分子活动和物联网网络入侵)数据集的数值实验为所提出方法的实际有效性提供了经验证据。
摘要:Motivated by the growing interest in representation learning approaches that uncover the latent structure of high-dimensional data, this work proposes new algorithms for reconstruction-based manifold learning within Reproducing-Kernel Hilbert Spaces (RKHS). Each observation is first reconstructed as a linear combination of the other samples in the RKHS, by optimizing a vector form of the Representer Theorem for their autorepresentation property. A separable operator-valued kernel extends the formulation to vector-valued data while retaining the simplicity of a single scalar similarity function. A subsequent kernel-alignment task projects the data into a lower-dimensional latent space whose Gram matrix aims to match the high-dimensional reconstruction kernel, thus transferring the auto-reconstruction geometry of the RKHS to the embedding. Therefore, the proposed algorithms represent an extended approach to the autorepresentation property, exhibited by many natural data, by using and adapting well-known results of Kernel Learning Theory. Numerical experiments on both simulated (concentric circles and swiss-roll) and real (cancer molecular activity and IoT network intrusions) datasets provide empirical evidence of the practical effectiveness of the proposed approach.


【9】GenCtrl -- A Formal Controllability Toolkit for Generative Models
标题:GenControl--生成模型的形式可控性工具包
链接:https://arxiv.org/abs/2601.05637

作者:Emily Cheng,Carmen Amo Alonso,Federico Danieli,Arno Blaas,Luca Zappella,Pau Rodriguez,Xavier Suau
摘要:随着生成模型变得无处不在,对生成过程进行细粒度控制的需求非常迫切。然而,尽管从激励到微调的可控生成方法激增,但一个基本问题仍然没有答案:这些模型首先真的可控吗?在这项工作中,我们提供了一个理论框架,正式回答这个问题。框架的人与模型的交互作为一个控制过程,我们提出了一种新的算法来估计可控的模型在对话设置。值得注意的是,我们提供了关于估计误差作为样本复杂度的函数的正式保证:我们导出了无分布的可控集估计的可能近似正确的界,除了输出有界性之外不使用任何假设,并且适用于任何黑盒非线性控制系统(即,任何生成模型)。我们实证展示了控制对话过程中不同任务的理论框架,语言模型和文本到图像生成。我们的研究结果表明,模型的可控性是令人惊讶的脆弱和高度依赖于实验设置。这突出了严格的可控性分析的必要性,将重点从简单的尝试控制转移到首先了解其基本限制。
摘要:As generative models become ubiquitous, there is a critical need for fine-grained control over the generation process. Yet, while controlled generation methods from prompting to fine-tuning proliferate, a fundamental question remains unanswered: are these models truly controllable in the first place? In this work, we provide a theoretical framework to formally answer this question. Framing human-model interaction as a control process, we propose a novel algorithm to estimate the controllable sets of models in a dialogue setting. Notably, we provide formal guarantees on the estimation error as a function of sample complexity: we derive probably-approximately correct bounds for controllable set estimates that are distribution-free, employ no assumptions except for output boundedness, and work for any black-box nonlinear control system (i.e., any generative model). We empirically demonstrate the theoretical framework on different tasks in controlling dialogue processes, for both language models and text-to-image generation. Our results show that model controllability is surprisingly fragile and highly dependent on the experimental setting. This highlights the need for rigorous controllability analysis, shifting the focus from simply attempting control to first understanding its fundamental limits.


【10】Toward an Integrated Cross-Urban Accident Prevention System: A Multi-Task Spatial-Temporal Learning Framework for Urban Safety Management
标题:迈向综合跨城市事故预防系统:城市安全管理的多任务时空学习框架
链接:https://arxiv.org/abs/2601.05521

作者:Jiayu Fang,Zhiqi Shao,Haoning Xi,Boris Choy,Junbin Gao
备注:38pages, 18figures
摘要:由于城市事故数据的异质性、不一致的报告以及固有的集群、稀疏、周期性和噪声性质,跨城市事故预防系统的开发尤其具有挑战性。这些固有的数据属性,再加上分散的治理和不兼容的报告标准,长期以来一直阻碍着跨城市综合事故预防框架的建立。为了解决这一差距,我们提出了曼巴局域时空网络MLA-STNet,一个统一的系统,制定事故风险预测作为一个跨多个城市的多任务学习问题。MLA-STNet集成了两个互补的模块:(i)时空地理曼巴注意力(STG-MA),它抑制不稳定的时空波动并加强长期时间依赖性;(ii)时空语义曼巴注意力(STS-MA),它通过共享参数设计来减轻跨城市的异质性,该设计联合训练所有城市,同时保留各个语义表示空间。我们通过75个实验验证了所提出的框架下的两个预测方案,全天和高频率的事故期,使用真实世界的数据集从纽约市和芝加哥。与最先进的基线相比,MLA-STNet的RMSE降低了6%,召回率提高了8%,MAP提高了5%,同时在50%的输入噪声下保持了不到1%的性能变化。这些结果表明,MLA-STNet有效地将异构的城市数据集统一到一个可扩展、强大和可解释的跨城市事故预防系统中,为协调和数据驱动的城市安全管理铺平了道路。
摘要:The development of a cross-city accident prevention system is particularly challenging due to the heterogeneity, inconsistent reporting, and inherently clustered, sparse, cyclical, and noisy nature of urban accident data. These intrinsic data properties, combined with fragmented governance and incompatible reporting standards, have long hindered the creation of an integrated, cross-city accident prevention framework. To address this gap, we propose the Mamba Local-ttention Spatial-Temporal Network MLA-STNet, a unified system that formulates accident risk prediction as a multi-task learning problem across multiple cities. MLA-STNet integrates two complementary modules: (i)the Spatio-Temporal Geographical Mamba-Attention (STG-MA), which suppresses unstable spatio-temporal fluctuations and strengthens long-range temporal dependencies; and (ii) the Spatio-Temporal Semantic Mamba-Attention (STS-MA), which mitigates cross-city heterogeneity through a shared-parameter design that jointly trains all cities while preserving individual semantic representation spaces. We validate the proposed framework through 75 experiments under two forecasting scenarios, full-day and high-frequency accident periods, using real-world datasets from New York City and Chicago. Compared with the state-of-the-art baselines, MLA-STNet achieves up to 6% lower RMSE, 8% higher Recall, and 5% higher MAP, while maintaining less than 1% performance variation under 50% input noise. These results demonstrate that MLA-STNet effectively unifies heterogeneous urban datasets within a scalable, robust, and interpretable Cross-City Accident Prevention System, paving the way for coordinated and data-driven urban safety management.


【11】Efficient Differentiable Causal Discovery via Reliable Super-Structure Learning
标题:基于可靠超结构学习的高效可微因果发现
链接 :https://arxiv.org/abs/2601.05474

作者:Pingchuan Ma,Qixin Zhang,Shuai Wang,Dacheng Tao
摘要:最近,可区分的因果发现已经成为一种有前途的方法,以提高现有方法的准确性和效率。然而,当应用于高维数据或具有潜在混杂因素的数据时,这些方法通常基于现成的连续优化算法,与庞大的搜索空间、目标函数的复杂性以及图论约束的非平凡性作斗争。因此,人们对利用超级结构来指导优化过程的兴趣激增。尽管如此,在正确的粒度级别学习适当的超结构,并在各种设置中有效地这样做,提出了重大挑战。   在本文中,我们提出了ALVGL,一个新的和一般的增强可区分的因果发现管道。ALVGL采用稀疏和低秩分解来学习数据的精度矩阵。我们设计了一个ADMM程序来优化这种分解,识别精度矩阵中与底层因果结构最相关的组件。这些组件,然后结合起来,构建一个超结构,可证明是一个超集的真正的因果图。该超结构用于初始化具有更集中的搜索空间的标准可微因果发现方法,从而提高优化效率和准确性。   我们展示了ALVGL的多功能性,通过实例化它在一系列的结构因果模型,包括高斯和非高斯设置,有和没有不可测量的混杂因素。在合成数据集和真实数据集上的大量实验表明,ALVGL不仅达到了最先进的精度,而且显著提高了优化效率,使其成为可区分因果发现的可靠有效解决方案。
摘要:Recently, differentiable causal discovery has emerged as a promising approach to improve the accuracy and efficiency of existing methods. However, when applied to high-dimensional data or data with latent confounders, these methods, often based on off-the-shelf continuous optimization algorithms, struggle with the vast search space, the complexity of the objective function, and the nontrivial nature of graph-theoretical constraints. As a result, there has been a surge of interest in leveraging super-structures to guide the optimization process. Nonetheless, learning an appropriate super-structure at the right level of granularity, and doing so efficiently across various settings, presents significant challenges.   In this paper, we propose ALVGL, a novel and general enhancement to the differentiable causal discovery pipeline. ALVGL employs a sparse and low-rank decomposition to learn the precision matrix of the data. We design an ADMM procedure to optimize this decomposition, identifying components in the precision matrix that are most relevant to the underlying causal structure. These components are then combined to construct a super-structure that is provably a superset of the true causal graph. This super-structure is used to initialize a standard differentiable causal discovery method with a more focused search space, thereby improving both optimization efficiency and accuracy.   We demonstrate the versatility of ALVGL by instantiating it across a range of structural causal models, including both Gaussian and non-Gaussian settings, with and without unmeasured confounders. Extensive experiments on synthetic and real-world datasets show that ALVGL not only achieves state-of-the-art accuracy but also significantly improves optimization efficiency, making it a reliable and effective solution for differentiable causal discovery.


【12】Inverting Non-Injective Functions with Twin Neural Network Regression
标题:用双神经网络回归反求非内射函数
链接:https://arxiv.org/abs/2601.05378

作者:Sebastian J. Wetzel
摘要:Non-injective functions are not invertible. However, non-injective functions can be restricted to sub-domains on which they are locally injective and surjective and thus invertible if the dimensionality between input and output spaces are the same. Further, even if the dimensionalities do not match it is often possible to choose a preferred solution from many possible solutions. Twin neural network regression is naturally capable of incorporating these properties to invert non-injective functions. Twin neural network regression is trained to predict adjustments to well known input variables $\mathbf{x}^{\text{anchor}}$ to obtain an estimate for an unknown $\mathbf{x}^{\text{new}}$ under a change of the target variable from $\mathbf{y}^{\text{anchor}}$ to $\mathbf{y}^{\text{new}}$. In combination with k-nearest neighbor search, I propose a deterministic framework that finds input parameters to a given target variable of non-injective functions. The method is demonstrated by inverting non-injective functions describing toy problems and robot arm control that are a) defined by data or b) known as mathematical formula.
摘要:Non-injective functions are not invertible. However, non-injective functions can be restricted to sub-domains on which they are locally injective and surjective and thus invertible if the dimensionality between input and output spaces are the same. Further, even if the dimensionalities do not match it is often possible to choose a preferred solution from many possible solutions. Twin neural network regression is naturally capable of incorporating these properties to invert non-injective functions. Twin neural network regression is trained to predict adjustments to well known input variables $\mathbf{x}^{\text{anchor}}$ to obtain an estimate for an unknown $\mathbf{x}^{\text{new}}$ under a change of the target variable from $\mathbf{y}^{\text{anchor}}$ to $\mathbf{y}^{\text{new}}$. In combination with k-nearest neighbor search, I propose a deterministic framework that finds input parameters to a given target variable of non-injective functions. The method is demonstrated by inverting non-injective functions describing toy problems and robot arm control that are a) defined by data or b) known as mathematical formula.


【13】The Kernel Manifold: A Geometric Approach to Gaussian Process Model Selection
标题:核Manifold:高斯过程模型选择的几何方法
链接:https://arxiv.org/abs/2601.05371

作者:Md Shafiqul Islam,Shakti Prasad Padhy,Douglas Allaire,Raymundo Arróyave
摘要:高斯过程(GP)回归是一个强大的非参数贝叶斯框架,但其性能关键取决于协方差核的选择。因此,选择一个合适的内核是模型质量的核心,但仍然是概率建模中最具挑战性和计算成本最高的步骤之一。我们提出了一个贝叶斯优化框架的内核的内核几何,使用预期的分歧为基础的GP先验之间的距离,有效地探索内核空间。该距离矩阵的多维缩放(MDS)嵌入将离散内核库映射到连续的欧几里得流形中,从而实现平滑BO。在这个公式中,输入空间包括内核组成,目标是对数边际似然,特征化由MDS坐标给出。当发散产生有效度量时,嵌入保留几何形状并产生稳定的BO景观。我们展示了合成基准,真实世界的时间序列数据集和增材制造案例研究预测熔池几何形状的方法,实现了相对于基线的卓越预测精度和不确定性校准,包括大语言模型(LLM)引导的搜索。该框架为内核搜索建立了可重用的概率几何结构,与GP建模和深度内核学习直接相关。
摘要:Gaussian Process (GP) regression is a powerful nonparametric Bayesian framework, but its performance depends critically on the choice of covariance kernel. Selecting an appropriate kernel is therefore central to model quality, yet remains one of the most challenging and computationally expensive steps in probabilistic modeling. We present a Bayesian optimization framework built on kernel-of-kernels geometry, using expected divergence-based distances between GP priors to explore kernel space efficiently. A multidimensional scaling (MDS) embedding of this distance matrix maps a discrete kernel library into a continuous Euclidean manifold, enabling smooth BO. In this formulation, the input space comprises kernel compositions, the objective is the log marginal likelihood, and featurization is given by the MDS coordinates. When the divergence yields a valid metric, the embedding preserves geometry and produces a stable BO landscape. We demonstrate the approach on synthetic benchmarks, real-world time-series datasets, and an additive manufacturing case study predicting melt-pool geometry, achieving superior predictive accuracy and uncertainty calibration relative to baselines including Large Language Model (LLM)-guided search. This framework establishes a reusable probabilistic geometry for kernel search, with direct relevance to GP modeling and deep kernel learning.


【14】Ontology Neural Networks for Topologically Conditioned Constraint Satisfaction
标题:用于布局条件约束满足的实体神经网络
链接:https://arxiv.org/abs/2601.05304

作者:Jaehong Oh
备注:12 pages, 11 figures
摘要:神经符号推理系统面临着在满足物理和逻辑约束的同时保持语义一致性的根本挑战。基于我们以前在本体神经网络上的工作,我们提出了一个增强的框架,将拓扑条件与梯度稳定机制相结合。该方法采用Forman-Ricci曲率来捕获图拓扑,深度Delta学习用于约束投影期间的稳定秩1扰动,协方差矩阵自适应进化策略用于参数优化。在多个问题大小的实验评估表明,该方法实现了平均能量减少到1.15相比,基线值为11.68,在约束满足任务的成功率为95%。该框架表现出种子独立的收敛性和优雅的缩放行为高达20个节点的问题,这表明拓扑结构可以通知基于梯度的优化,而不会牺牲可解释性或计算效率。
摘要:Neuro-symbolic reasoning systems face fundamental challenges in maintaining semantic coherence while satisfying physical and logical constraints. Building upon our previous work on Ontology Neural Networks, we present an enhanced framework that integrates topological conditioning with gradient stabilization mechanisms. The approach employs Forman-Ricci curvature to capture graph topology, Deep Delta Learning for stable rank-one perturbations during constraint projection, and Covariance Matrix Adaptation Evolution Strategy for parameter optimization. Experimental evaluation across multiple problem sizes demonstrates that the method achieves mean energy reduction to 1.15 compared to baseline values of 11.68, with 95 percent success rate in constraint satisfaction tasks. The framework exhibits seed-independent convergence and graceful scaling behavior up to twenty-node problems, suggesting that topological structure can inform gradient-based optimization without sacrificing interpretability or computational efficiency.


【15】Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach
标题:研究手稿中的插图:一种高效的深度学习方法
链接:https://arxiv.org/abs/2601.05269

作者:Yoav Evron,Michal Bar-Asher Siegal,Michael Fire
备注:14 pages, 5 figures
摘要:最近的人工智能(AI)革命为人文学科开辟了变革的可能性,特别是在解锁嵌入在历史手稿中的视觉内容方面。虽然数字档案馆现在提供了前所未有的访问这些材料,但系统地研究大规模插图的能力仍然具有挑战性。我们的研究提出了一种快速和可扩展的AI方法,用于检测,提取和描述数字化手稿中的插图。我们的系统专注于梵蒂冈图书馆等馆藏,能够对数百万页进行高效的可视化分析。我们的管道由三个阶段组成:(1)经过微调的图像分类模型过滤出纯文本页面;(2)高效的对象检测模型识别和裁剪插图;(3)多模式图像字幕模型生成简洁,人类可读的描述。这些都存储在一个可搜索的数据库中,使学者能够通过关键字查询检索相关的视觉材料。通过利用最近人工智能进步的力量,我们实现了以前不切实际的大规模视觉研究,使历史研究,艺术史和文化遗产的学者能够以新的精度和速度探索视觉主题,艺术风格和跨文化影响。通过将我们的管道应用于超过300万页的数字化手稿,我们自动识别并提取了超过20万幅独特的插图。这种处理速度在每页0.06秒以下,在效率和视觉奖学金的可访问性方面大大优于传统的分割技术。我们的工作展示了尖端的人工智能工具如何深刻地重塑学术工作流程,并为数字手稿时代的多学科研究开辟新的途径。
摘要:The recent Artificial Intelligence (AI) revolution has opened transformative possibilities for the humanities, particularly in unlocking the visual content embedded in historical manuscripts. While digital archives now offer unprecedented access to these materials, the ability to systematically study illustrations at a large scale remains challenging. Our study presents a fast and scalable AI approach for detecting, extracting, and describing illustrations in digitized manuscripts. Focusing on collections like the Vatican Library, our system enables efficient visual analysis across millions of pages. Our pipeline consists of three stages: (1) a fine-tuned image classification model filters out text-only pages; (2) an efficient object detection model identifies and crops illustrations; and (3) a multimodal image captioning model generates concise, human-readable descriptions. These are stored in a searchable database, allowing scholars to retrieve relevant visual materials through keyword queries. By harnessing the power of recent AI advancements, we enable large-scale visual research that was previously impractical, empowering scholars in historical studies, art history, and cultural heritage to explore visual motifs, artistic styles, and cross-cultural influences with new precision and speed. Applying our pipeline to over three million digitized manuscript pages, we automatically identified and extracted more than 200,000 unique illustrations. This scale of processing in under 0.06 seconds per page, dramatically outperforms traditional segmentation techniques in both efficiency and accessibility for visual scholarship. Our work demonstrates how cutting-edge AI tools can profoundly reshape scholarly workflows and open new avenues for multidisciplinary research in the age of digital manuscripts.


【16】Tiny Recursive Models on ARC-AGI-1: Inductive Biases, Identity Conditioning, and Test-Time Compute
标题:ARC-AGI-1上的微型回归模型:归纳偏差、身份条件反射和测试时间计算
链接:https://arxiv.org/abs/2512.11847

作者:Antonio Roye-Azar,Santiago Vargas-Naranjo,Dhruv Ghai,Nithin Balamurugan,Rayan Amir
备注:13 pages, 0 figures, 6 tables
摘要:微型递归模型(TRM)被提出作为一个参数有效的替代大型语言模型解决抽象和推理语料库(ARC)风格的任务。最初的工作报告了强大的性能,并表明递归潜在更新能够实现非平凡的推理,但目前还不清楚这种性能在多大程度上源于架构,测试时计算或特定于任务的先验。在本技术说明中,我们对ARC-AGI-1上的ARC Prize TRM检查点进行了实证分析,并报告了四个行为发现和效率比较。首先,我们证明了测试时间增强和多数投票集成占报告性能的很大一部分:1000个样本的投票管道将Pass@1比单遍规范推理提高了约11个百分点。其次,谜题身份消融揭示了对任务标识符的严格依赖:用空白或随机标记替换正确的谜题ID会产生零准确性。第三,递归轨迹分析表明,大部分的最终精度是在第一个递归步骤中实现的,并且在很少的潜在更新之后性能饱和,这表明浅层有效递归。第四,在规范与重增强制度下的早期训练实验表明,重增强扩大了候选解的分布,提高了多样本成功率。最后,我们将TRM与规范ARC-AGI-1上Llama 3 8B的朴素QLoRA微调进行比较,发现TRM的非自回归设计在此设置中实现了更高的吞吐量和更低的内存使用。总的来说,TRM的ARC-AGI-1性能似乎是由效率,特定任务条件反射和积极的测试时间计算之间的相互作用引起的,而不是深层的内部推理。
摘要:Tiny Recursive Models (TRM) were proposed as a parameter-efficient alternative to large language models for solving Abstraction and Reasoning Corpus (ARC) style tasks. The original work reports strong performance and suggests that recursive latent updates enable non-trivial reasoning, but it remains unclear how much of this performance stems from architecture, test-time compute, or task-specific priors. In this technical note, we empirically analyze the ARC Prize TRM checkpoint on ARC-AGI-1 and report four behavioral findings and an efficiency comparison. First, we show that test-time augmentation and majority-vote ensembling account for a substantial fraction of reported performance: the 1000-sample voting pipeline improves Pass@1 by about 11 percentage points over single-pass canonical inference. Second, a puzzle-identity ablation reveals strict dependence on task identifiers: replacing the correct puzzle ID with a blank or random token yields zero accuracy. Third, a recursion trajectory analysis shows that most of the final accuracy is achieved at the first recursion step and that performance saturates after few latent updates, indicating shallow effective recursion. Fourth, early-stage training experiments under canonical versus heavy augmentation regimes suggest that heavy augmentation broadens the distribution of candidate solutions and improves multi-sample success. Finally, we compare TRM with a naive QLoRA fine-tune of Llama 3 8B on canonical ARC-AGI-1, finding that TRM's non-autoregressive design achieves much higher throughput and substantially lower memory usage in this setting. Overall, TRM's ARC-AGI-1 performance appears to arise from an interaction between efficiency, task-specific conditioning, and aggressive test-time compute rather than deep internal reasoning.


【17】DeePM: Regime-Robust Deep Learning for Systematic Macro Portfolio Management
标题:DeePM:用于系统宏观投资组合管理的机制鲁棒深度学习
链接:https://arxiv.org/abs/2601.05975

作者:Kieran Wood,Stephen J. Roberts,Stefan Zohren
摘要:DeePM(Deep Portfolio Manager)是一个结构化的深度学习宏观投资组合管理器,经过端到端的培训,以最大限度地提高稳健的风险调整效用。DeePM解决了金融学习中的三个基本挑战:(1)它通过定向延迟解决了异步“粗糙过滤”问题(2)它通过宏观经济图先验(Macroeconomic Graph Prior)来对抗低信噪比,根据经济第一原则来规范交叉资产依赖性;以及(3)它优化了分布鲁棒目标,其中平滑最差窗口惩罚用作熵风险值(EVaR)的可微代理--一种窗口鲁棒效用,鼓励在最不利的历史子周期中的强劲表现。在2010年至2025年对50种具有高度现实交易成本的多元化期货进行的大规模回测中,DeePM仅使用每日收盘价获得的净风险调整收益约为经典趋势跟踪策略和被动基准的两倍。此外,DeePM将最先进的Momentum Transformer架构改进了大约50%。该模型展示了2010年代“CTA(商品交易顾问)冬季”和2020年后波动机制转变的结构性弹性,在疫情、通胀冲击和随后的长期高企环境中保持一致的表现。消融研究证实,严格滞后的横截面的注意力,图形先验,交易成本的原则性治疗,和强大的极大极小优化是这种泛化能力的主要驱动力。
摘要:We propose DeePM (Deep Portfolio Manager), a structured deep-learning macro portfolio manager trained end-to-end to maximize a robust, risk-adjusted utility. DeePM addresses three fundamental challenges in financial learning: (1) it resolves the asynchronous "ragged filtration" problem via a Directed Delay (Causal Sieve) mechanism that prioritizes causal impulse-response learning over information freshness; (2) it combats low signal-to-noise ratios via a Macroeconomic Graph Prior, regularizing cross-asset dependence according to economic first principles; and (3) it optimizes a distributionally robust objective where a smooth worst-window penalty serves as a differentiable proxy for Entropic Value-at-Risk (EVaR) - a window-robust utility encouraging strong performance in the most adverse historical subperiods. In large-scale backtests from 2010-2025 on 50 diversified futures with highly realistic transaction costs, DeePM attains net risk-adjusted returns that are roughly twice those of classical trend-following strategies and passive benchmarks, solely using daily closing prices. Furthermore, DeePM improves upon the state-of-the-art Momentum Transformer architecture by roughly fifty percent. The model demonstrates structural resilience across the 2010s "CTA (Commodity Trading Advisor) Winter" and the post-2020 volatility regime shift, maintaining consistent performance through the pandemic, inflation shocks, and the subsequent higher-for-longer environment. Ablation studies confirm that strictly lagged cross-sectional attention, graph prior, principled treatment of transaction costs, and robust minimax optimization are the primary drivers of this generalization capability.


【18】Multi-task Modeling for Engineering Applications with Sparse Data
标题:稀疏数据工程应用的多任务建模
链接:https://arxiv.org/abs/2601.05910

作者:Yigitcan Comlek,R. Murali Krishnan,Sandipp Krishnan Ravi,Amin Moghaddas,Rafael Giorjao,Michael Eff,Anirban Samaddar,Nesar S. Ramachandra,Sandeep Madireddy,Liping Wang
备注:15 pages, 5 figures, 6 tables
摘要:现代工程和科学工作流程通常需要跨相关任务和保真度水平进行同步预测,其中高保真数据稀缺且昂贵,而低保真数据则更为丰富。本文介绍了一个多任务高斯过程(MTGP)的框架,为工程系统的特点是多源,多保真度的数据,解决数据稀疏性和不同的任务相关性的挑战。所提出的框架利用跨输出和保真度水平的任务间关系来提高预测性能并降低计算成本。该框架在三个有代表性的场景中进行了验证:Forrester功能基准,3D椭球体空隙建模和搅拌摩擦焊。通过量化和利用任务间的关系,建议MTGP框架提供了一个强大的和可扩展的解决方案,预测建模的领域具有显着的计算和实验成本,支持明智的决策和有效的资源利用。
摘要:Modern engineering and scientific workflows often require simultaneous predictions across related tasks and fidelity levels, where high-fidelity data is scarce and expensive, while low-fidelity data is more abundant. This paper introduces an Multi-Task Gaussian Processes (MTGP) framework tailored for engineering systems characterized by multi-source, multi-fidelity data, addressing challenges of data sparsity and varying task correlations. The proposed framework leverages inter-task relationships across outputs and fidelity levels to improve predictive performance and reduce computational costs. The framework is validated across three representative scenarios: Forrester function benchmark, 3D ellipsoidal void modeling, and friction-stir welding. By quantifying and leveraging inter-task relationships, the proposed MTGP framework offers a robust and scalable solution for predictive modeling in domains with significant computational and experimental costs, supporting informed decision-making and efficient resource utilization.


【19】What Functions Does XGBoost Learn?
标题:XGboost学习哪些功能?
链接:https://arxiv.org/abs/2601.05444

作者:Dohyeong Ki,Adityanand Guntuboyina
摘要:This paper establishes a rigorous theoretical foundation for the function class implicitly learned by XGBoost, bridging the gap between its empirical success and our theoretical understanding. We introduce an infinite-dimensional function class $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ that extends finite ensembles of bounded-depth regression trees, together with a complexity measure $V^{d, s}_{\infty-\text{XGB}}(\cdot)$ that generalizes the $L^1$ regularization penalty used in XGBoost. We show that every optimizer of the XGBoost objective is also an optimizer of an equivalent penalized regression problem over $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ with penalty $V^{d, s}_{\infty-\text{XGB}}(\cdot)$, providing an interpretation of XGBoost as implicitly targeting a broader function class. We also develop a smoothness-based interpretation of $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ and $V^{d, s}_{\infty-\text{XGB}}(\cdot)$ in terms of Hardy--Krause variation. We prove that the least squares estimator over $\{f \in \mathcal{F}^{d, s}_{\infty-\text{ST}}: V^{d, s}_{\infty-\text{XGB}}(f) \le V\}$ achieves a nearly minimax-optimal rate of convergence $n^{-2/3} (\log n)^{4(\min(s, d) - 1)/3}$, thereby avoiding the curse of dimensionality. Our results provide the first rigorous characterization of the function space underlying XGBoost, clarify its connection to classical notions of variation, and identify an important open problem: whether the XGBoost algorithm itself achieves minimax optimality over this class.
摘要:This paper establishes a rigorous theoretical foundation for the function class implicitly learned by XGBoost, bridging the gap between its empirical success and our theoretical understanding. We introduce an infinite-dimensional function class $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ that extends finite ensembles of bounded-depth regression trees, together with a complexity measure $V^{d, s}_{\infty-\text{XGB}}(\cdot)$ that generalizes the $L^1$ regularization penalty used in XGBoost. We show that every optimizer of the XGBoost objective is also an optimizer of an equivalent penalized regression problem over $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ with penalty $V^{d, s}_{\infty-\text{XGB}}(\cdot)$, providing an interpretation of XGBoost as implicitly targeting a broader function class. We also develop a smoothness-based interpretation of $\mathcal{F}^{d, s}_{\infty-\text{ST}}$ and $V^{d, s}_{\infty-\text{XGB}}(\cdot)$ in terms of Hardy--Krause variation. We prove that the least squares estimator over $\{f \in \mathcal{F}^{d, s}_{\infty-\text{ST}}: V^{d, s}_{\infty-\text{XGB}}(f) \le V\}$ achieves a nearly minimax-optimal rate of convergence $n^{-2/3} (\log n)^{4(\min(s, d) - 1)/3}$, thereby avoiding the curse of dimensionality. Our results provide the first rigorous characterization of the function space underlying XGBoost, clarify its connection to classical notions of variation, and identify an important open problem: whether the XGBoost algorithm itself achieves minimax optimality over this class.


【20】A brief note on learning problem with global perspectives
标题:全球视野下的学习问题简介
链接:https://arxiv.org/abs/2601.05441

作者:Getachew K. Befekadu
备注:7 Pages with 1 Figure
摘要:这篇简短的笔记考虑了动态优化委托代理设置的学习问题,其中代理人被允许对学习过程具有全局观点,即,根据事物的相对重要性或基于由主体共享的一些聚合信息的事物的真实关系来查看事物的能力。然而,对聚合中的代理的学习过程施加影响的主体的主要任务是解决作为条件矩限制模型下的概率似然估计器提出的高级优化问题,该模型还考虑了关于代理在样本外的预测性能以及仅对主体可用的一组私有数据集的信息。特别是,我们提出了一个连贯的数学参数,这是必要的特征背后的学习过程中,这个抽象的委托代理学习框架,虽然我们承认,有一些概念和理论问题仍然需要解决。
摘要:This brief note considers the problem of learning with dynamic-optimizing principal-agent setting, in which the agents are allowed to have global perspectives about the learning process, i.e., the ability to view things according to their relative importances or in their true relations based-on some aggregated information shared by the principal. Whereas, the principal, which is exerting an influence on the learning process of the agents in the aggregation, is primarily tasked to solve a high-level optimization problem posed as an empirical-likelihood estimator under conditional moment restrictions model that also accounts information about the agents' predictive performances on out-of-samples as well as a set of private datasets available only to the principal. In particular, we present a coherent mathematical argument which is necessary for characterizing the learning process behind this abstract principal-agent learning framework, although we acknowledge that there are a few conceptual and theoretical issues still need to be addressed.


【21】On the use of case estimate and transactional payment data in neural networks for individual loss reserving
标题:神经网络中案例估计和交易支付数据用于个人损失准备
链接:https://arxiv.org/abs/2601.05274

作者:Benjamin Avanzi,Matthew Lambrianidis,Greg Taylor,Bernard Wong
摘要:在精算储备文献中,使用基于个人索赔数据训练的神经网络已经变得越来越流行。我们考虑如何在神经网络模型中最好地输入历史支付数据。此外,病例估计也可以以时间序列的形式提供,我们将分析扩展到评估其预测能力。在本文中,我们比较了一个前馈神经网络训练汇总交易的循环神经网络配备分析索赔的整个支付历史和/或案件估计的发展历史。我们通过在SPLICE模拟的多个可比较的高度复杂数据集上训练和比较模型的性能得出结论(Avanzi,Taylor和Wang,2023)。我们发现证据表明,案例估计将显着改善预测,但为神经网络配备记忆只能带来微不足道的改善。尽管保险公司之间的案例评估流程和质量会有很大差异,但我们提供了一种标准化的方法来评估其价值。
摘要:The use of neural networks trained on individual claims data has become increasingly popular in the actuarial reserving literature. We consider how to best input historical payment data in neural network models. Additionally, case estimates are also available in the format of a time series, and we extend our analysis to assessing their predictive power. In this paper, we compare a feed-forward neural network trained on summarised transactions to a recurrent neural network equipped to analyse a claim's entire payment history and/or case estimate development history. We draw conclusions from training and comparing the performance of the models on multiple, comparable highly complex datasets simulated from SPLICE (Avanzi, Taylor and Wang, 2023). We find evidence that case estimates will improve predictions significantly, but that equipping the neural network with memory only leads to meagre improvements. Although the case estimation process and quality will vary significantly between insurers, we provide a standardised methodology for assessing their value.


其他(13篇)

【1】An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift
标题:领域转移下偏好调整普遍性和多样性的实证研究
链接:https://arxiv.org/abs/2601.05882

作者:Constantinos Karouzos,Xingwei Tan,Nikolaos Aletras
摘要:偏好调整通过优化显式偏好信号而不是单独的可能性,将预训练的语言模型与人类对质量,有用性或安全性的判断相匹配。先前的工作表明,当在训练域之外进行评估时,偏好调整会降低性能并降低有用性。然而,在何种程度上适应战略减轻这一领域的转变仍然没有探索。我们应对这一挑战,进行全面和系统的研究领域转移下的对齐泛化。我们比较了五个流行的对齐目标和各种适应策略,从源到目标,包括目标域监督微调和伪标签,在总结和回答问题的帮助任务。我们的研究结果揭示了系统的差异,在泛化跨对齐目标下域转移。我们表明,基于伪标记的自适应策略可以大大减少域移位退化
摘要:Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation


【2】A New Family of Poisson Non-negative Matrix Factorization Methods Using the Shifted Log Link
标题:一类新的使用位移日志链的Poisson非负矩阵分解方法
链接:https://arxiv.org/abs/2601.05845

作者:Eric Weine,Peter Carbonetto,Rafael A. Irizarry,Matthew Stephens
摘要:泊松非负矩阵分解(NMF)是一种广泛使用的方法,用于找到计数数据的可解释的“基于部分”分解。虽然泊松NMF存在许多变体,但现有方法假设分解中的“部分”相加。这种假设在某些情况下可能是自然的,但在其他情况下则不然。在这里,我们引入泊松NMF与移位对数链接函数放松这一假设。移位对数链接函数具有单个调谐参数,并且当该参数变化时,模型从假设部分相加地组合(即,标准Poisson NMF)来假设各部分以更多的乘法组合。我们提供了一个算法来拟合这个模型的最大似然,也是一个近似,大大减少了计算时间的大型,稀疏的数据集(计算规模与数据矩阵中的非零条目的数量)。我们说明了这些新方法在各种真实的数据集。我们的例子显示了如何在泊松NMF链接函数的选择可以实质性地影响结果,以及如何在某些设置中使用移位日志链接函数可以提高解释性相比,标准的,添加剂的链接。
摘要:Poisson non-negative matrix factorization (NMF) is a widely used method to find interpretable "parts-based" decompositions of count data. While many variants of Poisson NMF exist, existing methods assume that the "parts" in the decomposition combine additively. This assumption may be natural in some settings, but not in others. Here we introduce Poisson NMF with the shifted-log link function to relax this assumption. The shifted-log link function has a single tuning parameter, and as this parameter varies the model changes from assuming that parts combine additively (i.e., standard Poisson NMF) to assuming that parts combine more multiplicatively. We provide an algorithm to fit this model by maximum likelihood, and also an approximation that substantially reduces computation time for large, sparse datasets (computations scale with the number of non-zero entries in the data matrix). We illustrate these new methods on a variety of real datasets. Our examples show how the choice of link function in Poisson NMF can substantively impact the results, and how in some settings the use of a shifted-log link function may improve interpretability compared with the standard, additive link.


【3】mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations
标题:mHC-Lite:您不需要20次Sinkhorn-Knopp迭代
链接:https://arxiv.org/abs/2601.05732

作者:Yongyi Yang,Jianyang Gao
摘要:超连接(HC)通过引入动态残差矩阵来概括残差连接,该动态残差矩阵将多个残差流中的信息混合在一起,从而加速深度神经网络的收敛。然而,无约束的残差矩阵可能会损害训练稳定性。为了解决这个问题,DeepSeek的流形约束超连接(mHC)通过迭代Sinkhorn-Knopp(SK)归一化近似地将这些矩阵投影到Birkhoff多面体上。我们确定了这种方法的两个局限性:(i)有限SK迭代不能保证精确的双重随机性,留下了一个近似差距,可以通过网络深度累积并破坏稳定性;(ii)有效的SK实现需要高度专业化的CUDA内核,提高了工程障碍并降低了可移植性。出于Birkhoff-冯诺依曼定理,我们提出了mHC-lite,一个简单的reparameterization,明确构造双随机矩阵的置换矩阵的凸组合。这种方法通过构造保证了精确的双随机性,并且可以仅使用本地矩阵运算来实现。大量的实验表明,mHC-lite在性能上匹配或超过了MHC,同时通过简单的实现了更高的训练吞吐量,并消除了HC和MHC中观察到的残余不稳定性。该代码可在https://github.com/FFTYYY/mhc-lite上公开获取。
摘要:Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.


【4】Visualising Information Flow in Word Embeddings with Diffusion Tensor Imaging
标题:利用扩散张量成像可视化单词嵌入中的信息流
链接:https://arxiv.org/abs/2601.05713

作者:Thomas Fabian
摘要:理解大型语言模型(LLM)如何表示自然语言是自然语言处理(NLP)研究的核心挑战。许多现有的方法从LLM中提取单词嵌入,通过点图可视化嵌入空间,并比较某些单词的相对位置。然而,这种方法只考虑单个单词,而不是整个自然语言表达,因此忽略了单词使用的上下文。在这里,我们提出了一种新的工具,通过应用扩散张量成像(DTI)的字嵌入分析和可视化的信息流在自然语言表达。我们发现,DTI揭示了信息如何在词嵌入之间流动。跟踪LLM的层内的信息流允许比较不同的模型结构并揭示修剪LLM的未充分利用的层的机会。此外,我们的模型揭示了代词解析和隐喻检测等任务的信息流差异。我们的研究结果表明,我们的模型允许对LLM如何表示实际的自然语言表达的新见解,扩展了孤立词嵌入的比较,并提高了NLP模型的可解释性。
摘要:Understanding how large language models (LLMs) represent natural language is a central challenge in natural language processing (NLP) research. Many existing methods extract word embeddings from an LLM, visualise the embedding space via point-plots, and compare the relative positions of certain words. However, this approach only considers single words and not whole natural language expressions, thus disregards the context in which a word is used. Here we present a novel tool for analysing and visualising information flow in natural language expressions by applying diffusion tensor imaging (DTI) to word embeddings. We find that DTI reveals how information flows between word embeddings. Tracking information flows within the layers of an LLM allows for comparing different model structures and revealing opportunities for pruning an LLM's under-utilised layers. Furthermore, our model reveals differences in information flows for tasks like pronoun resolution and metaphor detection. Our results show that our model permits novel insights into how LLMs represent actual natural language expressions, extending the comparison of isolated word embeddings and improving the interpretability of NLP models.


【5】Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation
标题:通过最大池膨胀量化和诱导CNN中的形状偏差
链接:https://arxiv.org/abs/2601.05599

作者:Takito Sawada,Akinori Iwata,Masahiro Okuda
备注:Accepted to IEVC 2026. 4 pages, 1 figure, 3 tables
摘要:众所周知,卷积神经网络(CNN)表现出强烈的纹理偏差,有利于局部模式而不是全局形状信息-这是其卷积架构固有的趋势。虽然这种偏差对于纹理丰富的自然图像是有益的,但它通常会降低形状主导数据(如插图和草图)的性能。虽然先前的工作已经提出了形状偏差模型来缓解这个问题,但这些方法缺乏定量指标来识别哪些数据集实际上会从这种修改中受益。为了解决这一差距,我们提出了一个数据驱动的度量,通过计算每个图像的亮度通道和其L0平滑对应物之间的结构相似性指数(SSIM)来量化数据集的形状-纹理平衡。在此基础上,我们进一步引入了一种计算效率高的自适应方法,该方法通过修改最大池化操作的膨胀来促进形状偏差,同时保持卷积权重冻结。实验结果表明,这种方法一致地提高了形状占主导地位的数据集的分类精度,特别是在低数据的制度,完全微调是不切实际的,只需要训练最终的分类层。
摘要:Convolutional Neural Networks (CNNs) are known to exhibit a strong texture bias, favoring local patterns over global shape information--a tendency inherent to their convolutional architecture. While this bias is beneficial for texture-rich natural images, it often degrades performance on shape-dominant data such as illustrations and sketches. Although prior work has proposed shape-biased models to mitigate this issue, these approaches lack a quantitative metric for identifying which datasets would actually benefit from such modifications. To address this gap, we propose a data-driven metric that quantifies the shape-texture balance of a dataset by computing the Structural Similarity Index (SSIM) between each image's luminance channel and its L0-smoothed counterpart. Building on this metric, we further introduce a computationally efficient adaptation method that promotes shape bias by modifying the dilation of max-pooling operations while keeping convolutional weights frozen. Experimental results show that this approach consistently improves classification accuracy on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical, requiring training only the final classification layer.


【6】Autoregressive Ranking: Bridging the Gap Between Dual and Cross Encoders
标题:自回归排序:弥合双编码器和交叉编码器之间的差距
链接:https://arxiv.org/abs/2601.05588

作者:Benjamin Rozonoyer,Chong You,Michael Boratko,Himanshu Jain,Nilesh Gupta,Srinadh Bhojanapalli,Andrew McCallum,Felix Yu
备注:22 pages, 5 figures
摘要:双重和交叉编码器一直是信息检索(IR)的支柱,但正在受到LLM的新兴能力的挑战。一种基于LLM的方法,我们称之为逐点生成排名-生成单个docID长度的令牌,而不是列表,以便通过beam搜索进行排名-结合了效率和表现力优势,同时利用了Causal Transformers的上下文功能。尽管有充分的证据表明,预训练的LLM非常适合排名,但我们发现,绝大多数基于LLM的方法都依赖于下一个令牌预测,这是一种从根本上与排名无关的损失函数(尤其是逐点监督)。在本文中,我们首先证明了多令牌docID的逐点生成排名的表达性优于双编码器。然后,我们提出SToICaL -一个简单的令牌项目校准损失-它可以在逐点设置内的项目和令牌级别上结合排名感知监督。我们对来自WordNet(Fellbaum,1998)和ESCI(Reddy等人,arXiv:2206.06588)。SToICaL的两个变体成功地抑制了无效docID生成的概率,并改进了top-1检索之外的常见排名指标。
摘要 :Dual and cross encoders have long been mainstays of information retrieval (IR), but are being challenged by the emergent capabilities of LLMs. An LLM-based approach we term pointwise generative ranking - generating tokens the length of a single docID as opposed to a list in order to enable ranking via beam search - combines efficiency and expressivity benefits while leveraging the in-context capabilities of Causal Transformers. Although there is ample evidence to suggest that pretrained LLMs are well-suited for ranking, we find that the vast majority of LLM-based approaches rely on next-token prediction, a loss function which is fundamentally rank-agnostic (and especially so with pointwise supervision). In this paper, we first prove that the expressivity of pointwise generative ranking with multi-token docIDs is superior to that of dual encoders. We then propose SToICaL - a Simple Token-Item Calibrated Loss - which can incorporate rank-aware supervision at both the item and token levels within the pointwise setup. We run a suite of experiments on ranking tasks derived from WordNet (Fellbaum, 1998) and ESCI (Reddy et al., arXiv:2206.06588). Two variants of SToICaL successfully suppress the probability of invalid docID generations and improve on common ranking metrics beyond top-1 retrieval.


【7】Poisson Hyperplane Processes with Rectified Linear Units
标题:具有纠正线性单位的Poisson超平面过程
链接:https://arxiv.org/abs/2601.05586

作者:Shufei Ge,Shijia Wang,Lloyd Elliott
摘要:神经网络在各种分类和回归任务中表现出最先进的性能。修正线性单元(ReLU)通常用作神经网络模型中隐藏层的激活函数。在本文中,我们建立了泊松超平面过程(PHP)和双层ReLU神经网络之间的联系。我们证明了具有高斯先验的PHP是双层ReLU神经网络的替代概率表示。此外,我们表明,一个两层的神经网络构造的PHP是可扩展的大规模问题,通过分解命题。最后,我们提出了一个退火顺序蒙特卡罗算法贝叶斯推理。我们的数值实验表明,我们提出的方法优于经典的双层ReLU神经网络。我们提出的模型的实现可以在https://github.com/ShufeiGe/Pois_Relu.git上获得。
摘要:Neural networks have shown state-of-the-art performances in various classification and regression tasks. Rectified linear units (ReLU) are often used as activation functions for the hidden layers in a neural network model. In this article, we establish the connection between the Poisson hyperplane processes (PHP) and two-layer ReLU neural networks. We show that the PHP with a Gaussian prior is an alternative probabilistic representation to a two-layer ReLU neural network. In addition, we show that a two-layer neural network constructed by PHP is scalable to large-scale problems via the decomposition propositions. Finally, we propose an annealed sequential Monte Carlo algorithm for Bayesian inference. Our numerical experiments demonstrate that our proposed method outperforms the classic two-layer ReLU neural network. The implementation of our proposed model is available at https://github.com/ShufeiGe/Pois_Relu.git.


【8】Generalized Canonical Polyadic Tensor Decompositions with General Symmetry
标题:具有一般对称性的广义典型多元张量分解
链接:https://arxiv.org/abs/2601.05335

作者:Alex Mulrooney,David Hong
备注:This work has been submitted to the IEEE for possible publication. 11 pages, 5 figures
摘要:典型多元(CP)张量分解是一种用于发现张量数据中潜在低维结构的主力算法。这是在传统的CP分解,通过拟合一个低秩张量的最小二乘损失的数据。广义CP(GCP)分解通过允许更合适的一般损失函数来推广这种方法,例如,对二进制数据进行建模并对数据进行计数,或者提高对异常值的鲁棒性。然而,GCP分解并没有明确地解释张量中的任何对称性,这在现代应用中通常会出现。例如,通过随时间堆叠动态图的邻接矩阵而形成的张量将自然地沿着对应于图节点的两个模式呈现对称性。在本文中,我们开发了一个对称GCP(SymGCP)分解,允许一般形式的对称性,即,沿着模式的任何子集对称。SymGCP通过在分解中强制相应的对称性来解释对称性。我们推导出SymGCP的梯度,使其有效的计算,通过所有在一次优化与现有的张量核。梯度的形式也导致了各种随机近似,使我们能够开发可以扩展到大张量的随机SymGCP算法。我们证明了建议的SymGCP算法的实用性与合成和真实数据的各种实验。
摘要:Canonical Polyadic (CP) tensor decomposition is a workhorse algorithm for discovering underlying low-dimensional structure in tensor data. This is accomplished in conventional CP decomposition by fitting a low-rank tensor to data with respect to the least-squares loss. Generalized CP (GCP) decompositions generalize this approach by allowing general loss functions that can be more appropriate, e.g., to model binary and count data or to improve robustness to outliers. However, GCP decompositions do not explicitly account for any symmetry in the tensors, which commonly arises in modern applications. For example, a tensor formed by stacking the adjacency matrices of a dynamic graph over time will naturally exhibit symmetry along the two modes corresponding to the graph nodes. In this paper, we develop a symmetric GCP (SymGCP) decomposition that allows for general forms of symmetry, i.e., symmetry along any subset of the modes. SymGCP accounts for symmetry by enforcing the corresponding symmetry in the decomposition. We derive gradients for SymGCP that enable its efficient computation via all-at-once optimization with existing tensor kernels. The form of the gradients also leads to various stochastic approximations that enable us to develop stochastic SymGCP algorithms that can scale to large tensors. We demonstrate the utility of the proposed SymGCP algorithms with a variety of experiments on both synthetic and real data.


【9】MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
标题:MoEBlaze:打破内存墙,在现代GPU上进行高效的MoE训练
链接:https://arxiv.org/abs/2601.05296

作者:Jiyuan Zhang,Yining Liu,Siqi Yan,Lisen Deng,Jennifer Cao,Shuqi Yang,Min Ni,Bi Xue,Shen Li
摘要:普遍存在的“内存墙”瓶颈在现代大规模专家混合(MoE)架构中被显著放大。MoE固有的架构稀疏性导致稀疏算术计算,并且还引入大量激活存储器开销-由大型令牌路由缓冲区以及对物化和缓冲中间张量的需求驱动。这种内存压力限制了GPU上可以容纳的最大批处理大小和序列长度,并且还导致过多的数据移动,从而阻碍了性能和有效的模型扩展。我们提出了MoEBlaze,一个内存高效的MoE训练框架,通过共同设计的系统方法来解决这些问题:(i)端到端的令牌分派和MoE训练方法,优化数据结构,以消除中间缓冲区和激活物化,以及(ii)共同设计的内核与智能激活检查点,以减轻内存占用,同时实现更好的性能。我们证明,MoEBlaze可以实现超过4倍的加速比和超过50%的内存节省相比,现有的MoE框架。
摘要:The pervasive "memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE's inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads -- driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.


【10】Improving User Experience with Personalized Review Ranking and Summarization
标题:通过个性化的评论排名和总结改善用户体验
链接:https://arxiv.org/abs/2601.05261

作者:Muhammad Mufti,Omar Hammad,Mahfuzur Rahman
摘要 :在线消费者评论通过提供对产品质量、可用性和性能的见解,在指导购买决策方面发挥着至关重要的作用。然而,越来越多的用户生成评论导致信息过载,使消费者难以识别符合其特定偏好的内容。现有的评论排名系统通常依赖于有用性投票、星级和新近度等指标,但这些指标无法捕捉单个用户的兴趣,并且通常单独对待文本情感和评级信号。本研究针对这些局限性,提出了一个个性化的框架,集成了审查排名和抽象总结,以提高决策效率。该系统首先通过对星级和评论内容的混合分析来建模每个用户的情绪。同时,用户偏好来自历史评论,使用句子嵌入和聚类,形成与主题和情感维度一致的语义配置文件。相关性评分算法根据情感和方面相似性将这些配置文件与看不见的评论进行匹配。然后汇总匹配度最高的评论,以反映个人兴趣。一项有70名参与者的用户研究表明,个性化方法提高了满意度、感知相关性和决策信心,同时减少了阅读时间。结果突出了该方法在缓解信息过载和提供针对用户特定偏好的内容方面的有效性,强调了其在丰富的决策环境中增强用户体验的价值。
摘要:Online consumer reviews play a crucial role in guiding purchase decisions by offering insights into product quality, usability, and performance. However, the increasing volume of user-generated reviews has led to information overload, making it difficult for consumers to identify content that aligns with their specific preferences. Existing review ranking systems typically rely on metrics such as helpfulness votes, star ratings, and recency, but these fail to capture individual user interests and often treat textual sentiment and rating signals separately. This research addresses these limitations by proposing a personalized framework that integrates review ranking and abstractive summarization to enhance decision-making efficiency. The proposed system begins by modeling each user's sentiment through a hybrid analysis of star ratings and review content. Simultaneously, user preferences were derived from historical reviews using sentence embeddings and clustering, forming semantic profiles aligned with thematic and sentiment dimensions. A relevance scoring algorithm matched these profiles with unseen reviews based on sentiment and aspect similarity. Top-matched reviews were then summarized to reflect individual interests. A user study with 70 participants demonstrated that the personalized approach improved satisfaction, perceived relevance, and decision-making confidence, while reducing time spent reading. The results highlight the method's effectiveness in alleviating information overload and delivering content tailored to user-specific preferences, emphasizing its value in enhancing user experience in review-rich decision-making environments.


【11】Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM
标题:迈向现实的保证:SmothLLM的概率证书
链接:https://arxiv.org/abs/2511.18721

作者:Adarsh Kumarappan,Ayushi Mehrotra
摘要:SmoothLLM防御提供了针对越狱攻击的认证保证,但它依赖于严格的“k-不稳定”假设,这在实践中很少成立。这种强假设可能会限制所提供的安全证书的可信度。在这项工作中,我们通过引入一个更现实的概率框架,'(k,$\varepalent $)-不稳定的,'来解决这个限制,以证明防御不同的越狱攻击,从基于梯度(GCG)到语义(PAIR)。我们推导出一个新的,数据知情的SmoothLLM的防御概率的下限,结合攻击成功的经验模型,提供了一个更值得信赖和实用的安全证书。通过引入(k,$\vareps $)-不稳定的概念,我们的框架为从业者提供了可操作的安全保证,使他们能够设置更好地反映LLM真实世界行为的认证阈值。最终,这项工作提供了一种实用和理论基础的机制,使LLM更能抵抗对其安全对齐的利用,这是安全AI部署中的一个关键挑战。
摘要:The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict `k-unstable' assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, `(k, $\varepsilon$)-unstable,' to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.


【12】Dynamic Inclusion and Bounded Multi-Factor Tilts for Robust Portfolio Construction
标题:稳健投资组合构建的动态包容性和有界多因素倾斜
链接:https://arxiv.org/abs/2601.05428

作者:Roberto Garrone
备注:28 pages, 7 figures, algorithmic portfolio construction framework emphasizing robustness, explicit constraints, and implementability
摘要:本文提出了一个投资组合的建设框架,旨在保持稳健的估计误差,非平稳性和现实的交易约束。该方法结合了动态资产资格,确定性的再平衡,以及应用于等权重基线的有界多因素倾斜。资产合格性被正式定义为对投资组合构建的依赖于国家的约束,允许要素暴露根据可观察到的市场条件(如流动性、波动性和横截面宽度)进行内源性调整。该框架不是估计预期收益或协方差,而是依赖于横截面排名和硬结构边界来控制集中度,营业额和脆弱性。由此产生的方法是完全算法的,透明的,直接可实现的。它为参数优化和无约束多因素模型提供了一种面向稳健性的替代方案,特别适用于以稳定性和运营可行性为主要目标的长期分配。
摘要:This paper proposes a portfolio construction framework designed to remain robust under estimation error, non-stationarity, and realistic trading constraints. The methodology combines dynamic asset eligibility, deterministic rebalancing, and bounded multi-factor tilts applied to an equal-weight baseline. Asset eligibility is formalized as a state-dependent constraint on portfolio construction, allowing factor exposure to adjust endogenously in response to observable market conditions such as liquidity, volatility, and cross-sectional breadth. Rather than estimating expected returns or covariances, the framework relies on cross-sectional rankings and hard structural bounds to control concentration, turnover, and fragility. The resulting approach is fully algorithmic, transparent, and directly implementable. It provides a robustness-oriented alternative to parametric optimization and unconstrained multi-factor models, particularly suited for long-horizon allocations where stability and operational feasibility are primary objectives.


【13】Archetypal cases for questionnaires with nominal multiple choice questions
标题:带有名义选择题的问卷原型案例
链接:https://arxiv.org/abs/2601.05392

作者:Aleix Alcacer,Irene Epifanio
备注:Statistical Methods for Data Analysis and Decision Sciences. Third Conference of the Statistics and Data Science Group of the Italian Statistical Society. Milan, April 2-3, 2025
摘要:原型分析作为一种探索性工具,将观察结果的集合解释为纯(极端)模式的凸组合。当这些图案与样品中的实际观察结果相对应时,它们被称为原型。这是第一次,我们建议将原型分析应用到名义观察中,特别是从具有单一可能答案的名义多项选择题的问卷中识别原型案例。这种方法可以增强我们对名义数据集的理解,类似于它在多元背景下的应用。我们比较这种方法与使用原型分析和概率原型分析,并证明了这种方法的好处,使用一个真实的例子:德国的信用数据集。
摘要:Archetypal analysis serves as an exploratory tool that interprets a collection of observations as convex combinations of pure (extreme) patterns. When these patterns correspond to actual observations within the sample, they are termed archetypoids. For the first time, we propose applying archetypoid analysis to nominal observations, specifically for identifying archetypal cases from questionnaires featuring nominal multiple-choice questions with a single possible answer. This approach can enhance our understanding of a nominal data set, similar to its application in multivariate contexts. We compare this methodology with the use of archetype analysis and probabilistic archetypal analysis and demonstrate the benefits of this methodology using a real-world example: the German credit dataset.


机器翻译由腾讯交互翻译提供,仅供参考

点击“阅读原文”获取带摘要的学术速递

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/191595