点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计184篇
大模型相关(27篇)
【1】Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
标题:大型语言模型使用独特、统一的机制生成有害内容
链接:https://arxiv.org/abs/2604.09544
作者:Hadas Orgad,Boyi Wei,Kaden Zheng,Martin Wattenberg,Peter Henderson,Seraphina Goldfarb-Tarrant,Yonatan Belinkov
摘要:大型语言模型(LLM)经过对齐训练以避免有害行为,但由此产生的保护措施仍然很脆弱:越狱通常会绕过它们,并且在狭窄领域进行微调可能会导致广泛的“紧急错误”。这种脆弱性是否反映了一个基本的缺乏连贯的内部组织的危害仍然不清楚。在这里,我们使用有针对性的权重修剪作为一种因果干预,以探讨在LLM的有害性的内部组织。我们发现,有害内容的生成取决于一组紧凑的权重,这些权重在各种伤害类型中是通用的,并且与良性功能不同。对齐的模型比未对齐的模型表现出更大的伤害生成权重压缩,这表明对齐在内部重塑了有害的表示-尽管表面上的安全护栏很脆弱。这种压缩解释了紧急错位:如果有害能力的权重被压缩,在一个领域中进行微调,可以引发广泛的错位。与此相一致,修剪危害生成权重在一个狭窄的领域大大减少紧急错位。值得注意的是,LLM有害生成能力与它们如何识别和解释这些内容无关。总之,这些结果揭示了一个连贯的内部结构,在LLM的危害性,可以作为一个基础,更有原则的安全方法。
摘要:Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.
【2】Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
标题:自动指令修订(Air):LLM任务适应策略的结构化比较
链接:https://arxiv.org/abs/2604.09418
作者:Solomiia Bilyk,Volodymyr Getmanskyi,Taras Firman
摘要:本文研究了自动指令修订(AIR),一种基于规则归纳的方法,用于使用有限的特定于任务的示例使大型语言模型(LLM)适应下游任务。我们将AIR定位在更广泛的适应策略中,包括即时优化,基于检索的方法和微调。然后,我们比较这些方法在不同的基准套件,旨在强调不同的任务要求,如知识注入,结构化提取,标签重新映射,逻辑推理。本文认为,适应性能是强烈的任务依赖性:没有一种方法在所有设置中占主导地位。在五个基准测试中,AIR在标签重映射分类方面最强或接近最佳,而KNN检索在闭卷QA方面表现最好,微调主导了结构化提取和事件顺序推理。当任务行为可以被紧凑的、可解释的指令规则捕获时,AIR是最有前途的,而检索和微调在由特定于源的知识或特定于机器人的注释规则主导的任务中仍然更强。
摘要:This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.
【3】DiffHLS: Differential Learning for High-Level Synthesis QoR Prediction with GNNs and LLM Code Embeddings
标题:迪夫HLS:采用GNN和LLM代码嵌入的高级综合MQR预测的差异学习
链接:https://arxiv.org/abs/2604.09240
作者:Zedong Peng,Zeju Li,Qiang Xu,Jieru Zhao
摘要:高级综合(HLS)将C/C++编译成RTL,但探索pragma驱动的优化选择仍然昂贵,因为每个设计点都需要耗时的综合。我们提出了\textbf{\DiffHLS},这是一个用于HLS结果质量(QoR)预测的差分学习框架,它从内核设计对中学习:内核基线和插入杂注的设计变体。\DiffHLS~使用专用的图神经网络(GNN)分支对内核进行编码并设计中间表示图,并使用来自预训练的代码大语言模型(LLM)的代码嵌入来增强delta路径。我们不是直接回归绝对目标,而是联合预测内核基线和设计引起的增量,并将它们组合以获得设计预测。在PolyBench上,在四个GNN主干下,\DiffHLS~达到了比GNN基线更低的平均MAPE,并且LLM代码嵌入在仅GNN消融上持续改进。我们进一步验证了ForgeHLS数据集的可扩展性。
摘要:High-Level Synthesis (HLS) compiles C/C++ into RTL, but exploring pragma-driven optimization choices remains expensive because each design point requires time-consuming synthesis. We propose \textbf{\DiffHLS}, a differential learning framework for HLS Quality-of-Result (QoR) prediction that learns from kernel--design pairs: a kernel baseline and a pragma-inserted design variant. \DiffHLS~encodes kernel and design intermediate-representation graphs with dedicated graph neural network (GNN) branches, and augments the delta pathway with code embeddings from a pretrained code large language model (LLM). Instead of regressing absolute targets directly, we jointly predict the kernel baseline and the design-induced delta, and compose them to obtain the design prediction. On PolyBench, \DiffHLS~attains lower average MAPE than GNN baselines under four GNN backbones, and LLM code embeddings consistently improve over a GNN-only ablation. We further validate scalability on the ForgeHLS dataset.
【4】Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
标题:LLM遵循自己的规则吗?对自封的安全政策的自适应审计
链接:https://arxiv.org/abs/2604.09189
作者:Avni Mittal
摘要
:LLM通过RLHF内部化安全策略,但这些策略从未正式指定,并且仍然难以检查。现有的基准评估模型与外部标准,但不衡量模型是否理解和执行自己的规定的边界。我们引入了符号神经一致性审计(SNCA),这是一个框架,它(1)通过结构化提示提取模型的自我陈述的安全规则,(2)将它们形式化为类型化谓词(绝对,条件,自适应),(3)通过确定性比较危害基准来测量行为合规性。对45个危害类别和47,496个观察结果的四个前沿模型进行评估,揭示了声明的政策和观察到的行为之间的系统性差距:声称绝对拒绝的模型经常遵守有害的提示,推理模型实现了最高的自我一致性,但未能阐明29%类别的政策,规则类型的跨模型一致性非常低(11%)。这些结果表明,LLM所说的和他们所做的之间的差距是可测量的,并且依赖于架构,激励反射一致性审计作为行为基准的补充。
摘要:LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.
【5】The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge
标题:NeurIPS 2023 LLM效率挑战的下一个AI解决方案
链接:https://arxiv.org/abs/2604.09034
作者:Gyuwon Park,DongIl Shin,SolGil Oh,SangGi Ryu,Byung-Hak Kim
摘要:大型语言模型(LLM)的快速发展对自然语言处理领域产生了重大影响,但其日益增长的复杂性引发了对资源使用和透明度的担忧。为了应对这些挑战,我们参加了NeurIPS LLM效率挑战赛,旨在在严格的限制条件下微调基础模型。我们的重点是LLaMa2 700亿模型,在24小时限制内在单个A100 40GB GPU上进行优化。我们的方法依赖于一个自定义数据集,该数据集由各种开源资源和基准测试精心组装而成,与挑战赛的开源精神保持一致。我们的方法利用了量化低秩自适应(QLoRA)微调,并集成了Flash Attention 2等高级注意力机制。我们试验了LoRA技术的各种配置,优化了计算效率和模型精度之间的平衡。我们的微调策略得到了多个数据集组合的创建和迭代测试的支持,从而选择了一个在不同任务和基准测试中表现出强大性能的版本。我们努力的成果是一个高效微调的LLaMa2 70B模型,该模型在单个GPU的限制下运行,不仅显着降低了资源利用率,而且在一系列QA基准测试中具有高精度。我们的研究证明了在资源受限的环境中优化大规模模型的可行性,强调了LLM在现实世界应用中的潜力。
摘要:The rapid evolution of Large Language Models (LLMs) has significantly impacted the field of natural language processing, but their growing complexity raises concerns about resource usage and transparency. Addressing these challenges, we participated in the NeurIPS LLM Efficiency Challenge, aiming to fine-tune a foundation model within stringent constraints. Our focus was the LLaMa2 70 billion model, optimized on a single A100 40GB GPU within a 24-hour limit. Our methodology hinged on a custom dataset, carefully assembled from diverse open-source resources and benchmark tests, aligned with the challenge's open-source ethos. Our approach leveraged Quantized-Low Rank Adaptation (QLoRA) Fine tuning, integrated with advanced attention mechanisms like Flash Attention 2. We experimented with various configurations of the LoRA technique, optimizing the balance between computational efficiency and model accuracy. Our fine-tuning strategy was underpinned by the creation and iterative testing of multiple dataset compositions, leading to the selection of a version that demonstrated robust performance across diverse tasks and benchmarks. The culmination of our efforts was an efficiently fine-tuned LLaMa2 70B model that operated within the constraints of a single GPU, showcasing not only a significant reduction in resource utilization but also high accuracy across a range of QA benchmarks. Our study serves as a testament to the feasibility of optimizing large-scale models in resource-constrained environments, emphasizing the potential of LLMs in real-world applications.
【6】Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection
标题:Leave My Images Alone:防止多模态大型语言模型通过视觉提示注入分析图像
链接:https://arxiv.org/abs/2604.09024
作者:Zedian Shao,Hongbin Liu,Yuepeng Hu,Neil Zhenqiang Gong
备注:Appeared in ACL 2026 main conference
摘要:多模态大型语言模型(MLLM)已经成为分析互联网规模图像数据的强大工具,提供了显着的好处,但也引起了关键的安全和社会问题。特别是,开放权重MLLM可能被滥用于从大规模的个人图像中提取敏感信息,例如身份、位置或其他私人细节。在这项工作中,我们提出了ImageProtector,这是一种用户端方法,通过嵌入精心制作的,几乎察觉不到的扰动,作为对MLLM的视觉提示注入攻击,在共享之前主动保护图像。因此,当攻击者使用MLLM分析受保护的图像时,MLLM始终被诱导生成拒绝响应,例如“对不起,我无法帮助该请求。“我们在六个MLLM和四个数据集上实证了ImageProtector的有效性。此外,我们评估了三种潜在的对策,高斯噪声,DiffPure和对抗性训练,并表明虽然它们部分减轻了ImageProtector的影响,但它们同时降低了模型的准确性和/或效率。我们的研究重点是开放权重MLLM和大规模自动图像分析的实际重要设置,并强调了基于扰动的隐私保护的承诺和局限性。
摘要:Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as "I'm sorry, I can't help with that request." We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.
【7】Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models
标题:医学视觉-语言模型中预测熵的校正与释义敏感性
链接:https://arxiv.org/abs/2604.08941
作者:Binesh Sadanandan,Vahid Behzadan
摘要:医学视觉语言模型VLM遭受两种故障模式,威胁安全部署,错误校准的信心和敏感性问题改写。我们发现它们有一个共同的原因,接近决策边界,通过在分布MIMIC CXR和分布PadChest胸部X射线数据集之间的MedGemma 4BIT上对五种不确定性量化方法进行基准测试,并在LLaVA RAD 7B上进行交叉架构验证。对于良好校准的单模型方法,来自一个前向通道的预测熵预测哪些样本将在MedGemma上的改写AUROC 0.711、LLaVARAD p 10 4上的改写0.878下翻转,从而使得单个熵阈值能够标记不可靠的和改写敏感的预测。一个五成员LoRA合奏失败的MIMIC PadChest转移42.9 ECE,34.1准确度,虽然LLaVA RAD的合奏不崩溃69.1。MC Dropout在5风险下实现了最佳校准ECE 4.3和选择性预测覆盖率21.5,但单个前向传递的总熵在错误检测AUROC 0.743 vs 0.657和释义筛选方面都优于整体。简单的方法获胜。
摘要
:Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.
【8】Beyond Relevance: Utility-Centric Retrieval in the LLM Era
标题:超越相关性:LLM时代以实用为中心的检索
链接:https://arxiv.org/abs/2604.08920
作者:Hengran Zhang,Minghao Tang,Keping Bi,Jiafeng Guo
备注:Accepted by SIGIR2026
摘要:信息检索系统传统上是针对主题相关性进行优化的,主题相关性是指检索到的文档与查询匹配的程度。然而,相关性只是接近一个更深层次的目标:效用,即检索到的信息是否有助于完成用户的潜在任务。检索增强生成(RAG)的出现从根本上改变了这种范式。检索到的文档不再直接由用户使用,而是作为产生答案的大型语言模型(LLM)的证据。因此,检索的有效性必须通过其对生成质量的贡献来评估,而不是仅仅通过基于相关性的排名指标来评估。本教程认为,检索目标正在从以相关性为中心的优化向以LLM为中心的实用程序发展。我们提出了一个统一的框架,涵盖LLM不可知与LLM特定的实用程序,上下文无关与上下文相关的实用程序,以及与LLM信息需求和代理RAG的连接。通过综合最新的进展,本教程提供了概念基础和实际指导,设计检索系统与基于LLM的信息访问的要求。
摘要:Information retrieval systems have traditionally optimized for topical relevance-the degree to which retrieved documents match a query. However, relevance only approximates a deeper goal: utility, namely, whether retrieved information helps accomplish a user's underlying task. The emergence of retrieval-augmented generation (RAG) fundamentally changes this paradigm. Retrieved documents are no longer consumed directly by users but instead serve as evidence for large language models (LLMs) that produce answers. As a result, retrieval effectiveness must be evaluated by its contribution to generation quality rather than by relevance-based ranking metrics alone. This tutorial argues that retrieval objectives are evolving from relevance-centric optimization toward LLM-centric utility. We present a unified framework covering LLM-agnostic versus LLM-specific utility, context-independent versus context-dependent utility, and the connection with LLM information needs and agentic RAG. By synthesizing recent advances, the tutorial provides conceptual foundations and practical guidance for designing retrieval systems aligned with the requirements of LLM-based information access.
【9】Uncertainty-Aware Transformers: Conformal Prediction for Language Models
标题:具有不确定性的Transformer:语言模型的保形预测
链接:https://arxiv.org/abs/2604.08885
作者:Abhiram Vellore,Niraj K. Jha
摘要:Transformers对人工智能领域产生了深远的影响,特别是对大型语言模型及其变体。然而,与神经网络一样,它们的黑盒性质限制了高风险环境中的信任和部署。为了使模型在关键应用中真正有用和值得信赖,它们必须提供的不仅仅是预测:它们必须为用户提供对支撑其决策的推理的清晰理解。本文提出了一个基于transformer的语言模型的不确定性量化框架。这个框架被称为CONFIDE(针对Fine-tuned DEep语言模型的共形预测),它将共形预测应用于仅编码器架构(如BERT和RoBERTa)的内部嵌入,同时支持超参数调优。CONFIDE使用[CLS]标记嵌入或扁平化隐藏状态来构建类条件不一致性分数,从而实现具有实例级解释的统计有效预测集。从经验上讲,CONFIDE在BERT-tiny上将测试精度提高了4.09%,并实现了更高的正确效率(即,预测集的预期大小以其包含真实标签为条件)与包括NM 2和VanillaNN在内的现有方法相比。我们发现,早期和中间的Transformer层往往产生更好的校准和语义上更有意义的表示共形预测。在资源受限的模型和具有模糊标签的高风险任务中,CONFIDE提供了基于softmax的不确定性失败的鲁棒性和可解释性。我们的立场CONFIDE作为一个框架,实际的诊断和效率/鲁棒性的改善,在以前的适形基线。
摘要:Transformers have had a profound impact on the field of artificial intelligence, especially on large language models and their variants. However, as was the case with neural networks, their black-box nature limits trust and deployment in high-stakes settings. For models to be genuinely useful and trustworthy in critical applications, they must provide more than just predictions: they must supply users with a clear understanding of the reasoning that underpins their decisions. This article presents an uncertainty quantification framework for transformer-based language models. This framework, called CONFIDE (CONformal prediction for FIne-tuned DEep language models), applies conformal prediction to the internal embeddings of encoder-only architectures, like BERT and RoBERTa, while enabling hyperparameter tuning. CONFIDE uses either [CLS] token embeddings or flattened hidden states to construct class-conditional nonconformity scores, enabling statistically valid prediction sets with instance-level explanations. Empirically, CONFIDE improves test accuracy by up to 4.09% on BERT-tiny and achieves greater correct efficiency (i.e., the expected size of the prediction set conditioned on it containing the true label) compared to prior methods, including NM2 and VanillaNN. We show that early and intermediate transformer layers often yield better-calibrated and more semantically meaningful representations for conformal prediction. In resource-constrained models and high-stakes tasks with ambiguous labels, CONFIDE offers robustness and interpretability where softmax-based uncertainty fails. We position CONFIDE as a framework for practical diagnostic and efficiency/robustness improvement over prior conformal baselines.
【10】Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
标题:用于保护多模式LLM的字典对齐概念控制
链接:https://arxiv.org/abs/2604.08846
作者:Jinqi Luo,Jinyu Yang,Tal Neiman,Lei Fan,Bing Yin,Son Tran,Mubarak Shah,René Vidal
备注:Accepted in CVPR 2026. Project page: https://peterljq.github.io/project/daco
摘要:多模态大型语言模型(MLLM)已被证明容易受到恶意查询的攻击,这些查询可能会引发不安全的响应。最近的工作使用快速工程,响应分类或微调来提高MLLM的安全性。然而,这样的方法通常对演变的恶意模式无效,可能需要重新运行查询,或者需要大量的计算资源。在推理时引导冻结模型的激活最近已经成为一种灵活有效的解决方案。然而,MLLM的现有转向方法通常仅处理一组狭窄的安全相关概念,或者难以在不影响其他概念的情况下调整特定概念。为了解决这些挑战,我们引入了字典对齐概念控制(DACO),这是一个框架,它利用策划的概念字典和稀疏自动编码器(SAE)来提供对MLLM激活的粒度控制。首先,我们通过检索超过400,000个标题图像刺激并将其激活总结为概念方向来策划15,000个多模态概念的词典。我们将数据集命名为DACO-400 K。其次,我们表明,策划字典可以用来干预激活通过稀疏编码。第三,我们提出了一种新的转向方法,使用我们的字典初始化SAE的训练,并自动注释SAE原子的语义,以保护MLLM。在多个MLLM上的实验(例如,QwenVL、LLaVA、InternVL)的安全基准(例如,MM-SafetyBench,JailBreakV)表明,DACO显着提高了MLLM的安全性,同时保持通用功能。
摘要
:Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. We name the dataset DACO-400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.
【11】HiFloat4 Format for Language Model Pre-training on Ascend NPUs
标题:Ascend NPU上语言模型预训练的HiFloat 4格式
链接:https://arxiv.org/abs/2604.08826
作者:Mehran Taghian,Yunke Peng,Xing Huang,Yao Wang,Yaoyuan Wang,Wei Guo,Yuanyong Luo,Tianchi Hu,Junsong Wang,Xin Wang,Hu Liu,Yu Cheng,Ziwei Yu,Hongliang Li,Mehdi Rahimifar,Lei Yan,Xuefei Wang,Zhuang Ma,Lei Liu,Hui Yu,Anandharaju Durai Raju,Hoang Le,Hei Yi Mak,Tanzila Rahman,Shadan Golestan
摘要:大型基础模型已成为现代机器学习的核心,性能随模型大小和数据可预测地扩展。然而,训练和部署这样的模型会产生大量的计算和内存成本,从而促进了低精度训练技术的发展。最近的工作表明,4位浮点(FP4)格式(如MXFP 4和NVFP 4)可以成功应用于大型语言模型(LLM)中的线性GEMM操作,与更高精度的基线相比,计算吞吐量和内存效率提高了4倍。在这项工作中,我们研究了华为Ascend NPU最近提出的HiFloat4 FP4格式,并在大规模训练环境中系统地将其与MXFP 4进行了比较。所有实验都在Ascend NPU集群上进行,线性和专家GEMM操作完全以FP4精度执行。我们评估两种密集架构(例如,Pangu和LLaMA风格的模型)和专家混合(MoE)模型,其中标准线性层和专家特定的GEMM都在FP4中运行。此外,我们还探索了专为FP4训练量身定制的稳定技术,这些技术可以显着减少数值退化,将相对误差保持在全精度基线的1%以内,同时保留4位计算的效率优势。我们的研究结果提供了一个全面的实证研究FP4培训NPU和突出的FP4格式之间的实际权衡在大规模密集和MoE模型。
摘要:Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We evaluate both dense architectures (e.g., Pangu and LLaMA-style models) and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. Furthermore, we explore stabilization techniques tailored to FP4 training that significantly reduce numerical degradation, maintaining relative error within 1% of full-precision baselines while preserving the efficiency benefits of 4-bit computation. Our results provide a comprehensive empirical study of FP4 training on NPUs and highlight the practical trade-offs between FP4 formats in large-scale dense and MoE models.
【12】Adaptive Simulation Experiment for LLM Policy Optimization
标题:LLM政策优化的自适应模拟实验
链接:https://arxiv.org/abs/2604.08779
作者:Mingjie Hu,Siyang Gao,Jian-qiang Hu,Enlu Zhou
摘要:大型语言模型(LLM)在提高运营管理的运营效率方面具有巨大的潜力。部署这些模型需要指定管理响应质量、塑造用户体验并影响运营价值的策略。在这项研究中,我们把LLM作为随机模拟器,并提出了一个基于成对比较的自适应模拟实验框架,用于从有限的候选者中确定最优策略。我们考虑两个政策空间:没有参数假设的非结构化空间,以及其中数据从偏好模型生成的结构化空间。对于这两种设置,我们的基本数据的特点,以确定最佳的政策,高概率的要求。在非结构化的情况下,我们推导出一个封闭形式的最佳抽样比例的表达式,连同一个明确的操作解释。在结构化的情况下,我们制定了一个正则化的凸规划来计算最佳比例。然后,我们开发了一个自适应的实验程序,称为LLM-PO,这两个政策空间,并证明,它确定了最佳的政策与所需的统计保证,同时渐近达到基本的数据要求。数值实验表明,LLM-PO始终优于基准方法,提高了LLM的性能。
摘要:Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational interpretation. In the structured case, we formulate a regularized convex program to compute the optimal proportions. We then develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements. Numerical experiments demonstrate that LLM-PO consistently outperforms benchmark methods and improves LLM performance.
【13】Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition
标题:每个响应都很重要:通过张量分解量化基于LLM的多智能体系统的不确定性
链接:https://arxiv.org/abs/2604.08708
作者:Tiejin Chen,Huaiyuan Yao,Jia Chen,Evangelos E. Papalexakis,Hua Wei
备注:Accept to ACL 26
摘要:虽然基于大型语言模型的多智能体系统(MAS)在复杂任务上始终优于单智能体系统,但它们复杂的交互引入了通信动态和角色依赖性带来的关键可靠性挑战。现有的不确定性量化方法,通常是为单圈输出设计的,无法解决MAS的独特复杂性。具体来说,这些方法的斗争与三个不同的挑战:级联的不确定性,在多步推理,代理间的通信路径的变化,和通信拓扑结构的多样性。为了弥合这一差距,我们引入了MATU,这是一种通过张量分解量化不确定性的新框架。MATU超越了分析最终文本输出,将整个推理轨迹表示为嵌入矩阵,并将多个执行运行组织为高阶张量。通过应用张量分解,我们解开和量化不同来源的不确定性,提供了一个全面的可靠性措施,是概括在不同的代理结构。我们提供了全面的实验表明,MATU有效地估计整体和强大的不确定性,在不同的任务和通信拓扑结构。
摘要:While Large Language Model-based Multi-Agent Systems (MAS) consistently outperform single-agent systems on complex tasks, their intricate interactions introduce critical reliability challenges arising from communication dynamics and role dependencies. Existing Uncertainty Quantification methods, typically designed for single-turn outputs, fail to address the unique complexities of the MAS. Specifically, these methods struggle with three distinct challenges: the cascading uncertainty in multi-step reasoning, the variability of inter-agent communication paths, and the diversity of communication topologies. To bridge this gap, we introduce MATU, a novel framework that quantifies uncertainty through tensor decomposition. MATU moves beyond analyzing final text outputs by representing entire reasoning trajectories as embedding matrices and organizing multiple execution runs into a higher-order tensor. By applying tensor decomposition, we disentangle and quantify distinct sources of uncertainty, offering a comprehensive reliability measure that is generalizable across different agent structures. We provide comprehensive experiments to show that MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies.
【14】Efficient RL Training for LLMs with Experience Replay
标题:通过经验回放,为LLM提供高效的RL训练
链接:https://arxiv.org/abs/2604.08706
作者:Charles Arnal,Vivien Cabannes,Taco Cohen,Julia Kempe,Remi Munos
摘要:虽然经验重放-在培训期间存储推出并多次重复使用它们的实践-是一般RL的基础技术,但由于普遍认为新鲜的策略数据对于高性能至关重要,因此在LLM后期培训中仍然没有探索。在这项工作中,我们挑战这个假设。我们提出了一个系统的研究重放缓冲区的LLM后训练,形式化的最佳设计之间的权衡过时引起的方差,样本多样性和高计算成本的一代。我们表明,当发电成本高昂时,严格的按政策抽样是次优的。从经验上讲,我们表明,一个精心设计的重放缓冲区可以大大减少推理计算,而不会降低-在某些情况下,甚至提高-最终模型的性能,同时保持政策熵。
摘要:While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
【15】QoS-QoE Translation with Large Language Model
标题:使用大型语言模型的Qos-Qoe翻译
链接:https://arxiv.org/abs/2604.08703
作者:Yingjie Yu,Mingyuan Wu,Ahmadreza Eslaminia,Lingzhi Zhao,Kaizhuo Yan,Klara Nahrstedt
摘要:QoS-QoE转换是多媒体系统中的一个基本问题,因为它表征了可测量的系统和网络条件如何影响用户感知体验。尽管许多先前的研究已经研究了这种关系,但他们的研究结果通常是针对特定的设置开发的,并且仍然分散在论文,实验设置和报告格式中,限制了系统的重用,跨场景的概括和大规模的分析。为了解决这一差距,我们首先介绍了QoS-QoE翻译数据集,这是一个基于源的数据集,来自多媒体文献的结构化QoS-QoE关系,重点是视频流相关的任务。我们通过自动化管道构建数据集,该管道结合了论文策展、QoS-Qoe关系提取和迭代数据评估。每个记录都保留提取的关系以及参数定义、支持证据和上下文元数据。我们进一步评估了大型语言模型(LLM)在对我们的数据集进行监督微调之前和之后对QoS-QoE翻译的能力,并在双向翻译中从QoS-QoE和QoE-QoS中显示出连续值和离散标签预测的强大性能。我们的数据集为QoS-QoE翻译中的LLM基准测试和支持未来基于LLM的多媒体质量预测和优化推理提供了基础。完整的数据集和代码可在https://yyu6969.github.io/qos-qoe-prediction-page/上公开获取,以实现完全可重复性和开放访问。
摘要:QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific setups and remain scattered across papers, experimental settings, and reporting formats, limiting systematic reuse, cross-scenario generalization, and large-scale analysis. To address this gap, we first introduce QoS-QoE Translation dataset, a source-grounded dataset of structured QoS-QoE relationships from the multimedia literature, with a focus on video streaming related tasks. We construct the dataset through an automated pipeline that combines paper curation, QoS-QoE relationship extraction, and iterative data evaluation. Each record preserves the extracted relationship together with parameter definitions, supporting evidence, and contextual metadata. We further evaluate the capability of large language models (LLMs) on QoS-QoE translation, both before and after supervised fine-tuning on our dataset, and show strong performance on both continuous-value and discrete-label prediction in bidirectional translation, from QoS-QoE and QoE-QoS. Our dataset provides a foundation for benchmarking LLMs in QoS-QoE translation and for supporting future LLM-based reasoning for multimedia quality prediction and optimization. The complete dataset and code are publicly available at https://yyu6969.github.io/qos-qoe-translation-page/, for full reproducibility and open access.
【16】EvoLen: Evolution-Guided Tokenization for DNA Language Model
标题:EvoLen:DNA语言模型的进化引导代币化
链接:https://arxiv.org/abs/2604.08698
作者:Nan Huang,Xiaoxiao Zhou,Junxia Cui,Mario Tapia-Pacheco,Tiffany Amariuta,Yang Li,Jingbo Shang
摘要:标记作为DNA语言模型(DNALMs)的基本表示单位,但它们的设计仍有待探索。与自然语言不同,DNA缺乏固有的标记边界或预定义的组成规则,使得标记化成为基本的建模决策,而不是自然指定的决策。虽然像字节对编码(BPE)这样的现有方法擅长捕捉反映人类生成的语言符号的标记结构,但DNA是由生物功能和进化约束而不是语言惯例组织的。我们认为,DNA标记化应该优先考虑功能序列模式,如调控基序-短,在进化约束下重复出现的片段,通常在物种间保存。我们通过EvoLen将进化信息直接纳入标记化过程,EvoLen是一种标记器,它将进化分层与长度感知解码相结合,以更好地保留基序规模的功能序列单元。EvoLen使用跨物种进化信号对DNA序列进行分组,在每个组上训练单独的BPE标记器,通过优先考虑保留模式的规则合并产生的词汇表,并使用动态编程应用长度感知解码。通过对照实验,EvoLen改善了功能序列模式的保留、基因组背景之间的差异以及与进化约束的比对,同时在不同的DNILM基准中匹配或优于标准BPE。这些结果表明,标记化引入了一个关键的归纳偏差,并纳入进化信息产生更多的生物学意义和可解释的序列表示。
摘要:Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.
【17】3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
标题:3D-BCD:通过视觉对比解码减轻3D-LLM受试者的幻觉
链接:https://arxiv.org/abs/2604.08645
作者:Makanjuola Ogunleye,Eman Abdelrahman,Ismini Lourentzou
备注:8 pages, 6 figures, Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
摘要:大型多模态模型越来越多地被用作在3D环境中操作的实体代理的推理核心,但它们仍然容易产生幻觉,从而产生不安全和不可靠的决策。现有的推理时间幻觉缓解方法主要针对2D视觉语言设置,并且不转移到具体的3D推理,其中失败来自对象存在,空间布局和几何基础,而不是像素级不一致。我们介绍了3D-VCD,第一个推理时间视觉对比解码框架的幻觉缓解在3D体现代理。3D-VCD通过对以对象为中心的表示施加语义和几何扰动来构造扭曲的3D场景图,例如类别替换和坐标或范围损坏。通过对比原始和扭曲的3D上下文下的预测,我们的方法抑制了对接地场景证据不敏感的标记,因此可能由语言先验驱动。我们在3D-POPE和HEAL基准测试中评估了3D-VCD,结果表明,它在不进行任何再训练的情况下,始终提高了接地推理能力,在结构化3D表示上建立了推理时间对比解码,这是一种有效和实用的途径,可以实现更可靠的体现智能。
摘要
:Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.
【18】AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs
标题:AlphaLab:利用前沿LL进行跨优化领域的自主多智能体研究
链接:https://arxiv.org/abs/2604.08590
作者:Brendan R. Hogan,Xiwen Chen,James T. Wilson,Kashif Rasul,Adel Boyarsky,Thomas Kamei,Anderson Schneider,Yuriy Nevmyvaka
备注:43 pages, 12 figures
摘要:我们提出了AlphaLab,这是一个自主研究工具,利用前沿LLM代理功能,在定量,计算密集型领域自动化整个实验周期。只给一个数据集和一个自然语言目标,AlphaLab在没有人为干预的情况下通过三个阶段进行:(1)适应领域并探索数据,编写分析代码并生成研究报告;(2)构建并对抗性验证自己的评估框架;以及(3)它通过Strategist/Worker循环运行大规模GPU实验,在持久的剧本中积累领域知识,作为在线即时优化的一种形式。所有特定于域的行为都被考虑到由模型本身生成的适配器中,因此相同的管道处理性质不同的任务而无需修改。我们用两个前沿LLM评估AlphaLab(GPT-5.2和Claude Opus 4.6)在三个领域:CUDA内核优化,其中它编写的GPU内核运行速度比torch快4.4倍。LLM预训练,整个系统的验证损失比使用相同模型的单次基线低22%;和交通预测,在研究和实施文献中已发布的模型系列后,它比标准基线高出23-25%。这两个模型在每个领域发现了质的不同的解决方案(两者都不占主导地位),这表明多模型活动提供了互补的搜索覆盖范围。我们还在附录中报告了财务时间序列预测的结果,并在https://brendanhogan.github.io/alphalab-paper/上发布了所有代码。
摘要:We present AlphaLab, an autonomous research harness that leverages frontier LLM agentic capabilities to automate the full experimental cycle in quantitative, computation-intensive domains. Given only a dataset and a natural-language objective, AlphaLab proceeds through three phases without human intervention: (1) it adapts to the domain and explores the data, writing analysis code and producing a research report; (2) it constructs and adversarially validates its own evaluation framework; and (3) it runs large-scale GPU experiments via a Strategist/Worker loop, accumulating domain knowledge in a persistent playbook that functions as a form of online prompt optimization. All domain-specific behavior is factored into adapters generated by the model itself, so the same pipeline handles qualitatively different tasks without modification. We evaluate AlphaLab with two frontier LLMs (GPT-5.2 and Claude Opus 4.6) on three domains: CUDA kernel optimization, where it writes GPU kernels that run 4.4x faster than torch.compile on average (up to 91x); LLM pretraining, where the full system achieves 22% lower validation loss than a single-shot baseline using the same model; and traffic forecasting, where it beats standard baselines by 23-25% after researching and implementing published model families from the literature. The two models discover qualitatively different solutions in every domain (neither dominates uniformly), suggesting that multi-model campaigns provide complementary search coverage. We additionally report results on financial time series forecasting in the appendix, and release all code at https://brendanhogan.github.io/alphalab-paper/.
【19】Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
标题:行动还是升级?使用语言模型评估自动化中的升级行为
链接:https://arxiv.org/abs/2604.08588
作者:Matthew DosSantos DiSorbo,Harang Ju
摘要:有效的自动化取决于决定何时采取行动以及何时升级。我们将其建模为不确定性下的决策:LLM形成预测,估计其正确的概率,并比较行动和升级的预期成本。使用这个框架在五个领域的记录人类决策需求预测,内容推荐,内容审核,贷款审批和自动驾驶,并在多个模型家庭,我们发现显着差异的隐含阈值模型用来权衡这些成本。这些阈值变化很大,并且不是由架构或规模预测的,而自我估计则以特定于模型的方式被错误校准。然后,我们通过改变成本比率,提供准确性信号和训练模型来测试针对此决策过程的干预措施,以遵循所需的升级规则。推理主要用于推理模型。对思想链目标的SFT产生了最强大的策略,这些策略可以在数据集、成本比、提示框架和保留域之间进行推广。这些结果表明,升级行为是一种特定于模型的属性,应该在部署之前进行表征,并且鲁棒的对齐受益于训练模型来明确推理不确定性和决策成本。
摘要:Effective automation hinges on deciding when to act and when to escalate. We model this as a decision under uncertainty: an LLM forms a prediction, estimates its probability of being correct, and compares the expected costs of acting and escalating. Using this framework across five domains of recorded human decisions-demand forecasting, content recommendation, content moderation, loan approval, and autonomous driving-and across multiple model families, we find marked differences in the implicit thresholds models use to trade off these costs. These thresholds vary substantially and are not predicted by architecture or scale, while self-estimates are miscalibrated in model-specific ways. We then test interventions that target this decision process by varying cost ratios, providing accuracy signals, and training models to follow the desired escalation rule. Prompting helps mainly for reasoning models. SFT on chain-of-thought targets yields the most robust policies, which generalize across datasets, cost ratios, prompt framings, and held-out domains. These results suggest that escalation behavior is a model-specific property that should be characterized before deployment, and that robust alignment benefits from training models to reason explicitly about uncertainty and decision costs.
【20】CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
标题:CSAttintence:重心评分注意力以加速LLM推理
链接:https://arxiv.org/abs/2604.08584
作者:Chuxu Song,Zhencan Peng,Jiuqi Wei,Chuanhui Yang
摘要:长上下文LLM越来越依赖于代理和域Q&A的扩展的、可重用的预填充提示,这使得注意力和KV缓存成为主要的解码时间瓶颈。虽然稀疏注意力降低了计算和传输成本,但由于密钥和密钥之间的固有分布偏移,它通常难以在高稀疏级别上保持准确性。我们提出了质心评分注意力(CSAttention),一个训练免费稀疏注意力方法优化的高吞吐量服务的可重用的上下文。CSAttention采用了一种针对离线预填充/在线解码设置的存储计算策略:它将计算前加载到一次性离线预填充阶段,可以跨多个查询进行摊销,同时积极优化每步解码延迟。具体而言,CSAttention在离线预填充期间构建以查询为中心的查找表,其大小在解码期间保持固定,并支持在线解码,以高效的表查找和GPU友好的分数累积取代全上下文扫描。大量的实验表明,CSAttention达到接近完全注意力的准确性。在高稀疏性(95%)和长上下文设置(32 K-128 K)下,CSAttention在模型准确性和推理速度方面始终优于最先进的稀疏注意方法,在上下文长度为128 K的最准确基线上实现了高达4.6倍的推理加速。
摘要
:Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention reduces computation and transfer costs, it often struggles to maintain accuracy at high sparsity levels due to the inherent distribution shift between Queries and Keys. We propose Centroid-Scoring Attention (CSAttention), a training-free sparse attention method optimized for high-throughput serving of reusable contexts. CSAttention adopts a storage-for-computation strategy tailored to the offline-prefill/online-decode setting: it front-loads computation into a one-time offline prefill phase that can be amortized across multiple queries, while aggressively optimizing per-step decoding latency. Specifically, CSAttention constructs query-centric lookup tables during offline prefill, whose size remains fixed during decoding, and enables online decoding to replace full-context scans with efficient table lookups and GPU-friendly score accumulation. Extensive experiments demonstrate that CSAttention achieves near-identical accuracy to full attention. Under high sparsity (95%) and long-context settings (32K-128K), CSAttention consistently outperforms state-of-the-art sparse attention methods in both model accuracy and inference speed, achieving up to 4.6x inference speedup over the most accurate baseline at a context length of 128K.
【21】QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
标题:QuanBench+:基于LLM的量子代码生成的统一多框架基准
链接:https://arxiv.org/abs/2604.08570
作者:Ali Slim,Haydar Hamieh,Jawad Kotaich,Yehya Ghosn,Mahdi Chehimi,Ammar Mohanna,Hasan Abed Al Kader Hammoud,Bernard Ghanem
备注:24 pages total, 25 figures, 5 tables, including supplementary material. Accepted to the ICLR 2026 Workshop on I Can't Believe It's Not Better
摘要:大型语言模型(LLM)越来越多地用于代码生成,但量子代码生成仍然主要在单个框架内进行评估,因此很难将量子推理与框架熟悉度分开。我们介绍了QuanBench+,一个统一的基准跨越Qiskit,PennyLane和Cirq,有42个对齐的任务,涵盖量子算法,门分解和状态准备。 我们使用可执行的功能测试评估模型,报告Pass@1和Pass@5,并使用基于KL发散的概率输出验收。我们还研究了基于反馈的修复后的Pass@1,其中模型可能会在运行时错误或错误答案后修改代码。在所有框架中,最强的一次得分在Qiskit中达到59.5%,在Cirq中达到54.8%,在PennyLane中达到42.9%;在基于反馈的修复中,最好的得分分别上升到83.3%,76.2%和66.7%。这些结果显示了明显的进展,但可靠的多框架量子代码生成仍然没有解决,仍然强烈依赖于特定于框架的知识。
摘要:Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.
【22】Sentiment Classification of Gaza War Headlines: A Comparative Analysis of Large Language Models and Arabic Fine-Tuned BERT Models
标题:加沙战争标题的情绪分类:大型语言模型和阿拉伯语微调BERT模型的比较分析
链接:https://arxiv.org/abs/2604.08566
作者:Amr Eleraqi,Hager H. Mustafa,Abdul Hadi N. Ahmed
备注:45 pages, 6 figures (including diagrams), 8 tables. Dataset available at this https URL . Previously posted at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FFENX3
摘要:本研究以2023年加沙战争为例,探讨了不同的人工智能架构如何解释与冲突相关的媒体话语中的情绪。基于Eleraqi 2026年的10,990条阿拉伯语新闻标题语料库,对三种大型语言模型和六种微调后的阿拉伯语BERT模型进行了对比分析。该研究采用了一种认识论方法,将情感分类视为模型架构产生的解释行为,而不是针对单一人类注释的黄金标准来评估准确性。为了量化模型之间的系统差异,该分析采用了信息论和分布度量,包括香农熵、詹森-香农距离和方差分数,用于测量与聚合模型行为的偏差。结果显示,情绪分布的显着和非随机的分歧。微调的BERT模型,特别是MARBERT,表现出对中性分类的强烈偏见,而LLM一直在放大负面情绪,LLaMA-3.1-8B几乎完全崩溃为负面情绪。框架条件分析进一步表明,GPT-4.1调整情感判断与叙事框架一致(例如,人道主义,法律,安全),而其他LLMs显示有限的上下文调制。这些发现表明,模型的选择构成了解释镜头的选择,塑造了冲突叙事如何在算法上框架和情感上评估。这项研究有助于媒体研究和计算社会科学的前景算法的差异作为分析对象,并强调了在战争和危机的背景下,将自动情绪输出作为媒体基调的中立或可互换措施的风险。
摘要:This study examines how different artificial intelligence architectures interpret sentiment in conflict-related media discourse, using the 2023 Gaza War as a case study. Drawing on a corpus of 10,990 Arabic news headlines (Eleraqi 2026), the research conducts a comparative analysis between three large language models and six fine-tuned Arabic BERT models. Rather than evaluating accuracy against a single human-annotated gold standard, the study adopts an epistemological approach that treats sentiment classification as an interpretive act produced by model architectures. To quantify systematic differences across models, the analysis employs information-theoretic and distributional metrics, including Shannon Entropy, Jensen-Shannon Distance, and a Variance Score measuring deviation from aggregate model behavior. The results reveal pronounced and non-random divergence in sentiment distributions. Fine-tuned BERT models, particularly MARBERT, exhibit a strong bias toward neutral classifications, while LLMs consistently amplify negative sentiment, with LLaMA-3.1-8B showing near-total collapse into negativity. Frame-conditioned analysis further demonstrates that GPT-4.1 adjusts sentiment judgments in line with narrative frames (e.g., humanitarian, legal, security), whereas other LLMs display limited contextual modulation. These findings suggest that the choice of model constitutes a choice of interpretive lens, shaping how conflict narratives are algorithmically framed and emotionally evaluated. The study contributes to media studies and computational social science by foregrounding algorithmic discrepancy as an object of analysis and by highlighting the risks of treating automated sentiment outputs as neutral or interchangeable measures of media tone in contexts of war and crisis.
【23】Attention-Based Sampler for Diffusion Language Models
标题:基于注意力的扩散语言模型采样器
链接:https://arxiv.org/abs/2604.08564
作者:Yuyan Zhou,Kai Syun Hou,Weiyu Chen,James Kwok
摘要:自回归模型(ARMs)已经建立了语言建模的主导范式。然而,他们严格的顺序解码范式对推理效率和建模灵活性都施加了基本约束。为了解决这些限制,基于扩散的大型语言模型(DLLM)已经提出,提供了并行解码和灵活的语言建模的潜力。尽管有这些优点,但当前的dLLM解码策略主要依赖于令牌级信息,这不能考虑全局序列结构,并且经常产生次优结果。本文从对数似然最大化的角度研究了译码顺序的选择问题。我们从理论上证明,最佳序列似然可以近似实现解码令牌的注意矩阵列和的降序。这一发现为注意力引导解码提供了一个原则性的理由,并为贪婪搜索提供了一个理论上的替代方案。我们在一个新的免训练解码算法Attn-Sampler中实例化了这一理论见解,并进一步提出了一个块注意力近似和动态注意力阈值来实现实际加速。在多个基准测试中的大量实验验证了我们所提出的方法的有效性,表明它实现了优越的生成质量,同时提高了解码并行性。
摘要:Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential decoding paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel decoding and flexible language modeling. Despite these advantages, current dLLMs decoding strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the decoding order selection problem from the perspective of log-likelihood maximization. We theoretically demonstrate that optimal sequence likelihood can be approximately achieved by decoding tokens in descending order of their attention matrix column sums. This finding provides a principled justification for attention-guided decoding and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free decoding algorithm, termed Attn-Sampler, and further propose a block attention approximation and dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the decoding parallelism.
【24】Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
标题:扩展推理大型语言模型中预算策略的温度依赖性性能
链接:https://arxiv.org/abs/2604.08563
作者:Mousa Salah,Amgad Muneer
备注:3 Figures, 2 Tables
摘要:扩展的推理模型代表了大型语言模型(LLM)功能的变革性转变,它支持复杂问题解决的显式测试时计算。然而,这些系统的采样温度和提示策略的最佳配置仍然在很大程度上探索不足。我们使用Grok-4.1系统地评估了四种温度设置(0.0,0.4,0.7和1.0)下的思维链和zero-shot提示,并对来自AMO-Bench的39个数学问题进行了扩展推理,AMO-Bench是一个具有挑战性的国际数学奥林匹克水平基准。我们发现,zero-shot提示在中等温度下达到峰值性能,在T=0.4和T=0.7时达到59%的准确度,而思维链提示在极端温度下表现最好。最值得注意的是,扩展推理的好处从T=0.0时的6倍增加到T=1.0时的14.3倍。这些结果表明,温度应该与提示策略一起优化,挑战了使用T=0进行推理任务的常见做法。
摘要:Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.
【25】GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback
标题:GNN作为评委:利用GNN反馈释放LLM进行图学习的力量
链接:https://arxiv.org/abs/2604.08553
作者:Ruiyao Xu,Kaize Ding
备注:ICLR 2026
摘要:大型语言模型(LLM)在文本属性图(TAG)上表现出强大的性能,这是由于它们对文本节点特征的语义理解能力。然而,它们作为低资源环境中的预测因子的有效性仍然受到限制,因为微调LLM通常需要足够的标记数据,特别是当TAG显示复杂的结构模式时。从本质上讲,本文针对两个关键挑战:(i)在LLM的TAG上生成和选择可靠的伪标签的难度,以及(ii)在微调具有伪标签的LLM时需要减轻潜在的标签噪声。为了应对这些挑战,我们提出了一个新的框架,GNN-as-Judge,它可以通过结合图神经网络(GNN)的结构归纳偏差来释放LLM在TAG上进行Few-Shot半监督学习的能力。具体来说,GNN-as-Judge引入了一种协作伪标记策略,该策略首先从标记节点中识别出受影响最大的未标记节点,然后利用LLM和GNN之间的一致和不一致模式来生成可靠的标签。此外,我们开发了一种弱监督LLM微调算法,可以从信息丰富的伪标签中提取知识,同时减轻潜在的标签噪声。在多个TAG数据集上的实验表明,GNN-as-Judge显著优于现有方法,特别是在标记数据稀缺的低资源制度中。
摘要:Large Language Models (LLMs) have shown strong performance on text-attributed graphs (TAGs) due to their superior semantic understanding ability on textual node features. However, their effectiveness as predictors in the low-resource setting, where labeled nodes are severely limited and scarce, remains constrained since fine-tuning LLMs usually requires sufficient labeled data, especially when the TAG shows complex structural patterns. In essence, this paper targets two key challenges: (i) the difficulty of generating and selecting reliable pseudo labels on TAGs for LLMs, and (ii) the need to mitigate potential label noise when fine-tuning LLMs with pseudo labels. To counter the challenges, we propose a new framework, GNN-as-Judge, which can unleash the power of LLMs for few-shot semi-supervised learning on TAGs by incorporating the structural inductive bias of Graph Neural Networks (GNNs). Specifically, GNN-as-Judge introduces a collaborative pseudo-labeling strategy that first identifies the most influenced unlabeled nodes from labeled nodes, then exploits both the agreement and disagreement patterns between LLMs and GNNs to generate reliable labels. Furthermore, we develop a weakly-supervised LLM fine-tuning algorithm that can distill the knowledge from informative pseudo labels while mitigating the potential label noise. Experiments on multiple TAG datasets demonstrate that GNN-as-Judge significantly outperforms existing methods, particularly in low-resource regimes where labeled data are scarce.
【26】SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
标题:SUPERNova:通过自然指令上的强化学习在LLM中激发一般推理
链接:https://arxiv.org/abs/2604.08477
作者:Ashima Suvarna,Kendrick Phan,Mehrab Beikzadeh,Hritik Bansal,Saadia Gabriel
备注:23 Pages, 4 figures
摘要:带有可验证奖励的强化学习(RLVR)显著改善了数学和代码等正式领域的大型语言模型(LLM)推理。尽管取得了这些进步,LLM仍然难以完成需要因果推理和时间理解等能力的一般推理任务。将RLVR扩展到一般推理从根本上受到缺乏高质量,可验证的训练数据的限制,这些数据涵盖了各种推理技能。为了应对这一挑战,我们提出了SUPERNOVA,这是一个用于RLVR的数据管理框架,旨在增强一般推理。我们的关键见解是,包含专家注释的地面实况的推理调整数据集编码了丰富的推理模式,可以系统地适应RLVR。为了研究这一点,我们进行了100多个受控RL实验,以分析数据设计选择如何影响下游推理性能。特别是,我们研究了三个关键因素:(i)源任务选择,(ii)任务混合策略,(iii)提高数据质量的综合干预措施。我们的分析表明,源任务的选择是不平凡的,下游推理性能有显着的影响。此外,根据单个目标任务的性能选择任务优于基于整体平均性能的策略。最后,在SUPERNOVA上训练的模型优于强基线(例如,Qwen3.5)在具有挑战性的推理基准测试中,包括BBEH、Zebralogic和MMLU-Pro。特别是,SUPERNOVA上的训练在不同模型大小的BBEH上产生了高达52.8%的相对改善,证明了RLVR原则性数据管理的有效性。我们的研究结果为策划人类注释的资源提供了实用的见解,以将RLVR扩展到一般推理。代码和数据可在https://github.com/asuvarna31/supernova上获得。
摘要:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
【27】Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
标题:迈向现实世界人类行为模拟:在长视野、跨场景、异类行为痕迹上对大型语言模型进行基准测试
链接
:https://arxiv.org/abs/2604.08362
作者:Jiawei Chen,Ruoxi Xu,Boxi Cao,Ruotong Pan,Yunfei Zhang,Yifei Hu,Yong Du,Tingting Gao,Yaojie Lu,Yingfei Sun,Xianpei Han,Le Sun,Xiangyu Wu,Hongyu Lin
摘要:大型语言模型(LLM)的出现为通用的用户模拟器提供了一个潜在的解决方案。然而,现有的基准仍然局限于孤立的场景,狭窄的行动空间或合成数据,无法捕捉真实人类行为的整体性质。为了弥合这一差距,我们引入了OmniBehavior,这是第一个完全从真实世界数据构建的用户模拟基准,将长期,跨场景和异构行为模式集成到一个统一的框架中。基于这个基准,我们首先提供了经验证据,证明以前的孤立场景数据集存在隧道视野,而现实世界的决策依赖于长期的跨场景因果链。对最先进的LLM的广泛评估表明,当前的模型很难准确地模拟这些复杂的行为,即使上下文窗口扩展,性能也会趋于稳定。至关重要的是,模拟行为和真实行为之间的系统比较揭示了一个基本的结构性偏见:LLM倾向于向积极的普通人收敛,表现出过度活跃,角色同质化和乌托邦偏见。这导致个体差异和长尾行为的损失,突出了未来高保真仿真研究的关键方向。
摘要:The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.
Graph相关(图学习|图神经网络|图优化等)(8篇)
【1】Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL
标题:NetForge_RL中用于同步多智能体网络防御的事件驱动时态图网络
链接:https://arxiv.org/abs/2604.09523
作者:Igor Jankowski
备注:26 pages, 14 figures, 5 tables
摘要:多智能体强化学习(MARL)策略从模拟网络战争游戏到可操作的安全运营中心(SOC)的过渡从根本上受到Sim 2 Real差距的影响。传统的模拟器抽象出网络协议物理,依赖于同步滴答,并提供干净的状态向量,而不是真实的,嘈杂的遥测。为了解决这些局限性,我们引入了NetForge_RL:一个高保真的网络作战模拟器,它将网络防御重新定义为一个异步的、连续时间的部分可观测半马尔可夫决策过程(POSMDP)。NetForge强制执行零信任网络访问(ZTNA)约束,并要求防御者处理NLP编码的SIEM遥测。至关重要的是,NetForge通过双模式引擎原生地弥合了Sim 2 Real的差距,允许在模拟虚拟机管理程序中进行高吞吐量的MARL训练,并对Docker虚拟机管理程序中的实时漏洞进行zero-shot评估。为了导航这个连续时间POSMDP,我们提出了连续时间图MARL(CT-GMARL),利用固定步长的神经常微分方程(ODE)来处理不规则采样的警报。我们根据离散基线(R-MAPPO,QMIX)评估我们的框架。实证结果表明,CT-GMARL实现了57,135的收敛中值Blue奖励-比R-MAPPO提高了2.0倍,比QMIX提高了2.1倍。至关重要的是,CT-GMARL通过避免通过破坏网络实用程序来最小化风险的“焦土”故障模式,恢复比最强基线多12倍的受损服务。在zero-shot传输到实时Docker环境时,CT-GMARL策略实现了98,026的中值奖励,验证了Sim 2 Real桥。
摘要:The transition of Multi-Agent Reinforcement Learning (MARL) policies from simulated cyber wargames to operational Security Operations Centers (SOCs) is fundamentally bottlenecked by the Sim2Real gap. Legacy simulators abstract away network protocol physics, rely on synchronous ticks, and provide clean state vectors rather than authentic, noisy telemetry. To resolve these limitations, we introduce NetForge_RL: a high-fidelity cyber operations simulator that reformulates network defense as an asynchronous, continuous-time Partially Observable Semi-Markov Decision Process (POSMDP). NetForge enforces Zero-Trust Network Access (ZTNA) constraints and requires defenders to process NLP-encoded SIEM telemetry. Crucially, NetForge bridges the Sim2Real gap natively via a dual-mode engine, allowing high-throughput MARL training in a mock hypervisor and zero-shot evaluation against live exploits in a Docker hypervisor. To navigate this continuous-time POSMDP, we propose Continuous-Time Graph MARL (CT-GMARL), utilizing fixed-step Neural Ordinary Differential Equations (ODEs) to process irregularly sampled alerts. We evaluate our framework against discrete baselines (R-MAPPO, QMIX). Empirical results demonstrate that CT-GMARL achieves a converged median Blue reward of 57,135 - a 2.0x improvement over R-MAPPO and 2.1x over QMIX. Critically, CT-GMARL restores 12x more compromised services than the strongest baseline by avoiding the "scorched earth" failure mode of trivially minimizing risk by destroying network utility. On zero-shot transfer to the live Docker environment, CT-GMARL policies achieve a median reward of 98,026, validating the Sim2Real bridge.
【2】NOMAD: Generating Embeddings for Massive Distributed Graphs
标题:NOMAD:生成大规模分布式图的嵌入
链接:https://arxiv.org/abs/2604.09419
作者:Aishwarya Sarkar,Sayan Ghosh,Nathan R. Tallent,Ali Jannesari
摘要:在图或网络上成功的机器学习需要嵌入,不仅要将节点和边表示为低维向量,还要保持图的结构。用于生成嵌入的已建立方法需要通过重复使用随机游走来灵活地探索整个图,随机游走捕获具有节点和边缘样本的图结构。这些方法为具有数百万到数十亿条边的大规模图带来了可扩展性挑战,因为单节点解决方案的内存和处理能力不足。 我们提出NOMAD,一个分布式内存图嵌入框架,使用消息传递接口(MPI)的分布式图。NOMAD实现了广泛流行的LINE(大规模信息网络嵌入)算法中提出的基于邻近的模型。我们提出了几个实际的权衡,以提高不规则和分布式图嵌入方法所面临的可扩展性和通信开销,迎合网络和科学领域中出现的大规模图。NOMAD在基于CPU的NERSC Perlmutter集群上展示了相对于多线程LINE和node 2 vec的流行参考实现的10/100倍的中值加速,在分布式PBG上展示了35- 76倍的中值加速,以及相对于LINE,node 2 vec和GraphVite的竞争性嵌入质量,同时在真实世界的图形上产生了12- 370倍的端到端加速。
摘要:Successful machine learning on graphs or networks requires embeddings that not only represent nodes and edges as low-dimensional vectors but also preserve the graph structure. Established methods for generating embeddings require flexible exploration of the entire graph through repeated use of random walks that capture graph structure with samples of nodes and edges. These methods create scalability challenges for massive graphs with millions-to-billions of edges because single-node solutions have inadequate memory and processing capabilities. We present NOMAD, a distributed-memory graph embedding framework using the Message Passing Interface (MPI) for distributed graphs. NOMAD implements proximity-based models proposed in the widely popular LINE (Large-scale Information Network Embedding) algorithm. We propose several practical trade-offs to improve the scalability and communication overheads confronted by irregular and distributed graph embedding methods, catering to massive-scale graphs arising in web and science domains. NOMAD demonstrates median speedups of 10/100x on CPU-based NERSC Perlmutter cluster relative to the popular reference implementations of multi-threaded LINE and node2vec, 35-76x over distributed PBG, and competitive embedding quality relative to LINE, node2vec, and GraphVite, while yielding 12-370x end-to-end speedups on real-world graphs.
【3】EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers
标题:EquiformerV 3:缩放高效、表达性和通用SE(3)-等变图注意力变形器
链接:https://arxiv.org/abs/2604.09130
作者:Yi-Lun Liao,Alexander J. Hoffman,Sabrina C. Shen,Alexandre Duval,Sam Walton Norwood,Tess Smidt
摘要:随着$SE(3)$-等变图神经网络作为3D原子建模的核心工具的成熟,提高其效率,表达能力和物理一致性已成为大规模应用的核心挑战。在这项工作中,我们介绍了EquiformerV 3,第三代的$SE(3)$-等变图注意力Transformer,旨在提高所有三个方面:效率,表现力和通用性。在EquiformerV 2的基础上,我们有以下三个关键的进步。首先,我们优化了软件实现,实现了1.75倍的加速比。其次,我们对EquiformerV 2进行了简单有效的修改,包括等变合并层归一化,改进的前馈网络超参数,以及平滑半径截止的注意力。第三,我们提出了SwigLU-$S^2$激活,以纳入多体相互作用,以获得更好的理论表现力,并保持严格的等方差,同时降低采样$S^2$网格的复杂性。SwiGLU-$S^2$激活和平滑截止注意力共同实现了平滑变化势能面(PES)的精确建模,将EquiformerV 3推广到需要节能模拟和PES高阶导数的任务。通过这些改进,使用非平衡结构去噪(DeNS)辅助任务训练的EquiformerV 3在OC 20、OMat 24和Mat长凳Discovery上取得了最先进的结果。
摘要:As $SE(3)$-equivariant graph neural networks mature as a core tool for 3D atomistic modeling, improving their efficiency, expressivity, and physical consistency has become a central challenge for large-scale applications. In this work, we introduce EquiformerV3, the third generation of the $SE(3)$-equivariant graph attention Transformer, designed to advance all three dimensions: efficiency, expressivity, and generality. Building on EquiformerV2, we have the following three key advances. First, we optimize the software implementation, achieving $1.75\times$ speedup. Second, we introduce simple and effective modifications to EquiformerV2, including equivariant merged layer normalization, improved feedforward network hyper-parameters, and attention with smooth radius cutoff. Third, we propose SwiGLU-$S^2$ activations to incorporate many-body interactions for better theoretical expressivity and to preserve strict equivariance while reducing the complexity of sampling $S^2$ grids. Together, SwiGLU-$S^2$ activations and smooth-cutoff attention enable accurate modeling of smoothly varying potential energy surfaces (PES), generalizing EquiformerV3 to tasks requiring energy-conserving simulations and higher-order derivatives of PES. With these improvements, EquiformerV3 trained with the auxiliary task of denoising non-equilibrium structures (DeNS) achieves state-of-the-art results on OC20, OMat24, and Matbench Discovery.
【4】Beyond Isolated Clients: Integrating Graph-Based Embeddings into Event Sequence Models
标题:超越孤立客户端:将基于图形的嵌入集成到事件序列模型中
链接:https://arxiv.org/abs/2604.09085
作者:Harry Proshian,Nikita Severin,Sergey Nikolenko,Kireev Ivan,Andrey Savchenko,Ivan Sergeev,Maria Postnova,Ilya Makarov
备注:Short paper accepted at ACM Web Conference 2026 (WWW '26)
摘要:大规模数字平台生成数十亿个带时间戳的用户-项目交互(事件),这些交互对于预测用户属性至关重要,例如,欺诈预防和建议。虽然自监督学习(SSL)有效地对事件的时间顺序进行建模,但它通常忽略了用户-项目交互图的全局结构。为了弥合这一差距,我们提出了三个模型无关的策略,将这种结构化信息集成到对比SSL:丰富的事件嵌入,对齐客户端表示与图形嵌入,并添加一个结构化的借口任务。在四个金融和电子商务数据集上的实验表明,我们的方法始终如一地提高了准确性(高达2.3%AUC),并揭示了图密度是选择最佳集成策略的关键因素。
摘要:Large-scale digital platforms generate billions of timestamped user-item interactions (events) that are crucial for predicting user attributes in, e.g., fraud prevention and recommendations. While self-supervised learning (SSL) effectively models the temporal order of events, it typically overlooks the global structure of the user-item interaction graph. To bridge this gap, we propose three model-agnostic strategies for integrating this structural information into contrastive SSL: enriching event embeddings, aligning client representations with graph embeddings, and adding a structural pretext task. Experiments on four financial and e-commerce datasets demonstrate that our approach consistently improves the accuracy (up to a 2.3% AUC) and reveals that graph density is a key factor in selecting the optimal integration strategy.
【5】Neighbourhood Transformer: Switchable Attention for Monophily-Aware Graph Learning
标题:邻里Transformer:单一意识图形学习的可切换注意力
链接:https://arxiv.org/abs/2604.08980
作者:Yi Luo,Xu Sun,Guangchun Luo,Aiguo Chen
摘要:图神经网络(GNN)已被广泛应用于社会网络分析、化学研究和计算机视觉等工程应用中。然而,它们的有效性受到固有的同质性假设的严重影响,该假设不适用于经常连接不同节点的异嗜图。为了解决图学习中的这一基本限制,我们首先从最近发现的现实世界图的单态属性中汲取灵感,并提出了邻域Transformers(NT),这是一种新的范式,它在每个局部邻域内应用自注意力,而不是像传统的消息传递GNN那样将消息聚合到中心节点。这种设计使NT具有固有的单亲性,并在理论上保证其表现力不弱于传统的消息传递框架。对于实际的工程部署,我们进一步开发了一种配备可切换注意力的邻域划分策略,该策略将NT的空间消耗减少了95%以上,时间消耗减少了92.67%,显着扩展了其适用于更大的图。在10个真实世界数据集(5个heterophilic和5个homophilic图)上的广泛实验表明,NT在节点分类任务上优于所有当前最先进的方法,证明了其优越的性能和跨域适应性。这项工作的完整实现代码可在https://github.com/cf020031308/MoNT上公开获得,以促进可重复性和工业采用。
摘要:Graph neural networks (GNNs) have been widely adopted in engineering applications such as social network analysis, chemical research and computer vision. However, their efficacy is severely compromised by the inherent homophily assumption, which fails to hold for heterophilic graphs where dissimilar nodes are frequently connected. To address this fundamental limitation in graph learning, we first draw inspiration from the recently discovered monophily property of real-world graphs, and propose Neighbourhood Transformers (NT), a novel paradigm that applies self-attention within every local neighbourhood instead of aggregating messages to the central node as in conventional message-passing GNNs. This design makes NT inherently monophily-aware and theoretically guarantees its expressiveness is no weaker than traditional message-passing frameworks. For practical engineering deployment, we further develop a neighbourhood partitioning strategy equipped with switchable attentions, which reduces the space consumption of NT by over 95% and time consumption by up to 92.67%, significantly expanding its applicability to larger graphs. Extensive experiments on 10 real-world datasets (5 heterophilic and 5 homophilic graphs) show that NT outperforms all current state-of-the-art methods on node classification tasks, demonstrating its superior performance and cross-domain adaptability. The full implementation code of this work is publicly available at https://github.com/cf020031308/MoNT to facilitate reproducibility and industrial adoption.
【6】A Closer Look at the Application of Causal Inference in Graph Representation Learning
标题:近距离观察因果推理在图表示学习中的应用
链接:https://arxiv.org/abs/2604.08890
作者:Hang Gao,Kunyu Li,Huang Hong,Baoquan Cui,Fengge Wu
摘要:图表示学习中的因果关系建模仍然是一个根本性的挑战。现有的方法通常借鉴因果推理的理论和方法来识别因果子图或减轻混淆。然而,由于图结构数据固有的复杂性,这些方法经常将不同的图元素聚集到单个因果变量中,这种操作可能会违反因果推理的核心假设。在这项工作中,我们证明,这种聚合妥协的因果有效性。基于这一结论,我们提出了一个理论模型接地在最小的不可分割的单元图数据,以确保因果有效性得到保证。通过这个模型,我们进一步分析了在图表示学习中实现精确因果建模的成本,并确定了可以简化问题的条件。为了从经验上支持我们的理论,我们构建了一个反映现实世界因果结构的可控合成数据集,并进行了广泛的实验验证。最后,我们开发了一个因果建模增强模块,可以无缝集成到现有的图学习管道中,并通过全面的比较实验证明了其有效性。
摘要
:Modeling causal relationships in graph representation learning remains a fundamental challenge. Existing approaches often draw on theories and methods from causal inference to identify causal subgraphs or mitigate confounders. However, due to the inherent complexity of graph-structured data, these approaches frequently aggregate diverse graph elements into single causal variables, an operation that risks violating the core assumptions of causal inference. In this work, we prove that such aggregation compromises causal validity. Building on this conclusion, we propose a theoretical model grounded in the smallest indivisible units of graph data to ensure that the causal validity is guaranteed. With this model, we further analyze the costs of achieving precise causal modeling in graph representation learning and identify the conditions under which the problem can be simplified. To empirically support our theory, we construct a controllable synthetic dataset that reflects realworld causal structures and conduct extensive experiments for validation. Finally, we develop a causal modeling enhancement module that can be seamlessly integrated into existing graph learning pipelines, and we demonstrate its effectiveness through comprehensive comparative experiments.
【7】SenBen: Sensitive Scene Graphs for Explainable Content Moderation
标题:SenBen:可解释内容审核的敏感场景图
链接:https://arxiv.org/abs/2604.08819
作者:Fatih Cagatay Akyon,Alptekin Temizel
备注:Accepted at CVPRW 2026
摘要:内容审核系统将图像分类为安全或不安全,但缺乏空间基础和可解释性:它们无法解释检测到什么敏感行为,涉及谁,或者发生在哪里。我们引入了敏感基准(SenBen),这是第一个针对敏感内容的大规模场景图基准,包括来自157部电影的13,999帧,这些电影用Visual Genome风格的场景图(25个对象类,28个属性,包括疼痛,恐惧,侵略和痛苦等情感状态,14个谓词)和5个类别的16个敏感度标签进行注释。我们使用多任务配方将前沿VLM提取为紧凑的241 M学生模型,该配方通过基于后缀的对象身份,词汇感知召回(VAR)损失和具有不对称损失的解耦Query 2Label标签头来解决自回归场景图生成中的词汇不平衡问题,从而使SenBen Recall比标准交叉熵训练提高了+6.4个百分点。在接地场景图指标上,我们的学生模型优于所有评估的VLM(Gemini模型除外)和所有商业安全API,同时在所有模型中实现了最高的对象检测和字幕得分,推理速度加快了7.6\times $,GPU内存减少了16\times $。
摘要:Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.
【8】R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
标题:R2G:一个从RTL到GDSII的多视图电路图基准测试套件
链接:https://arxiv.org/abs/2604.08810
作者:Zewei Zhou,Jiajun Zou,Jiajia Zhang,Ao Yang,Ruichao He,Haozheng Zhou,Ao Liu,Jiawei Liu,Leilei Jin,Shan Shen,Daying Sun
备注:Accepted as a poster by CVPR2026
摘要:图形神经网络(GNN)越来越多地应用于物理设计任务,如拥塞预测和线长估计,但进展受到不一致的电路表示和缺乏受控评估协议的阻碍。我们提出了R2 G(RTL到GDSII),一个多视图电路图基准套件,它在30个开源IP核(最多10^6 $节点/边)上以信息奇偶校验(每个视图编码相同的属性集,仅在功能连接处有所不同)来识别五个阶段感知视图。R2 G提供了一个端到端的DEF到图形管道,跨越合成,放置和路由阶段,以及加载器,统一拆分,域度量和可复制基线。通过将表示选择与模型选择解耦,R2 G隔离了先前EDA和图ML基准测试不受控制的混淆。在对GINE、GAT和ResGatedGCN的系统研究中,我们发现:(i)视图选择主导模型选择,对于固定GNN,测试R^2 $在表示上的变化超过0.3;(ii)以节点为中心的视图在布局和路由上的泛化效果最好;以及(iii)解码器头深度(3- 4层)是主要的准确性驱动器,将发散训练转变为近乎完美的预测(R$^2$$>$0.99)。代码和数据集可在https://github.com/ShenShan123/R2G上获得。
摘要:Graph neural networks (GNNs) are increasingly applied to physical design tasks such as congestion prediction and wirelength estimation, yet progress is hindered by inconsistent circuit representations and the absence of controlled evaluation protocols. We present R2G (RTL-to-GDSII), a multi-view circuit-graph benchmark suite that standardizes five stage-aware views with information parity (every view encodes the same attribute set, differing only in where features attach) over 30 open-source IP cores (up to $10^6$ nodes/edges). R2G provides an end-to-end DEF-to-graph pipeline spanning synthesis, placement, and routing stages, together with loaders, unified splits, domain metrics, and reproducible baselines. By decoupling representation choice from model choice, R2G isolates a confound that prior EDA and graph-ML benchmarks leave uncontrolled. In systematic studies with GINE, GAT, and ResGatedGCN, we find: (i) view choice dominates model choice, with Test R$^2$ varying by more than 0.3 across representations for a fixed GNN; (ii) node-centric views generalize best across both placement and routing; and (iii) decoder-head depth (3--4 layers) is the primary accuracy driver, turning divergent training into near-perfect predictions (R$^2$$>$0.99). Code and datasets are available at https://github.com/ShenShan123/R2G.
Transformer(5篇)
【1】Integrated electro-optic attention nonlinearities for transformers
标题:Transformer的集成光电注意力非线性
链接:https://arxiv.org/abs/2604.09512
作者:Luis Mickeler,Kai Lion,Alfonso Nardi,Jost Kellner,Pierre Didier,Bhavin J. Shastri,Niao He,Rachel Grange
摘要:Transformers已经成为主导的神经网络架构,在语言处理和计算机视觉方面实现了最先进的性能。这些模型的核心是注意力机制,它需要使用Softmax函数进行非线性、非负映射。然而,尽管Softmax操作占总操作计数的不到1%,但它们会不成比例地阻碍整体推理延迟。在这里,我们使用薄膜锂酸盐(TFLN)马赫-曾德尔调制器(MZM)作为模拟非线性计算元件,以大大减少非线性计算的延迟。我们实现了数字Softmax和Sigmoid的电光替代品,并评估了它们在Vision Transformers和大型语言模型中的性能。我们的系统保持高度竞争力的准确性,即使在积极的4位输入输出量化的模拟单位。我们进一步表征编码速度高达10 GBaud的系统噪声,并在各种噪声条件下评估模型的鲁棒性。我们的研究结果表明,TFLN调制器可以作为混合共封装硬件内的非线性功能单元,实现高速和节能的非线性计算。
摘要:Transformers have emerged as the dominant neural-network architecture, achieving state-of-the-art performance in language processing and computer vision. At the core of these models lies the attention mechanism, which requires a nonlinear, non-negative mapping using the Softmax function. However, although Softmax operations account for less than 1% of the total operation count, they can disproportionately bottleneck overall inference latency. Here, we use thin-film lithium niobate (TFLN) Mach-Zehnder modulators (MZMs) as analog nonlinear computational elements to drastically reduce the latency of nonlinear computations. We implement electro-optic alternatives to digital Softmax and Sigmoid, and evaluate their performance in Vision Transformers and Large Language Models. Our system maintains highly competitive accuracy, even under aggressive 4-bit input-output quantization of the analog units. We further characterize system noise at encoding speeds up to 10 GBaud and assess model robustness under various noise conditions. Our findings suggest that TFLN modulators can serve as nonlinear function units within hybrid co-packaged hardware, enabling high-speed and energy-efficient nonlinear computation.
【2】Generalization and Scaling Laws for Mixture-of-Experts Transformers
标题:专家混合变形机的推广和缩放定律
链接:https://arxiv.org/abs/2604.09175
作者:Mansour Zoubeirou a Mayaki
摘要:我们开发了一个理论的泛化和缩放混合专家(MoE)Transformers,干净地分离\n {主动}每输入容量从路由组合。通过调节固定的路由模式和工会绑定在他们身上,我们得到一个超规范覆盖数界的度量熵规模与活动参数预算,并招致一个特定的MOE路由开销。结合平方损失的标准ERM分析,这在$d$维流形数据模型和$C^β$目标下产生了一个泛化界,表明一旦适当地考虑了活动参数,近似和估计就像在密集网络中一样。我们进一步证明了一个建设性的近似定理MoE架构,表明,根据近似建设,误差可以减少通过缩放活动容量或增加专家的数量,这取决于占主导地位的瓶颈。从这些结果中,我们得出神经模型的大小,数据大小和计算最佳的权衡比例的法律。总的来说,我们的研究结果提供了一个透明的统计参考点推理MoE缩放,澄清哪些行为是由最坏情况下的理论证明,必须从数据相关的路由结构或优化动态。
摘要:We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^β$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.
【3】Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
标题:分层核心Transformer:具有信息理论逼近分析的多尺度注意力
链接:https://arxiv.org/abs/2604.08829
作者:Giansalvo Cirrincione
备注:20 pages, 3 figures, 8 tables submitted to Neurocomputing
摘要:分层内核Transformer(HKT)是一种多尺度注意力机制,其通过可训练的因果下采样来处理L个分辨率级别的序列,通过学习的凸权重来组合特定于级别的得分矩阵。总计算成本是标准注意力的4/3倍,L = 3时达到1.3125x。建立了四个理论结果。(i)分层得分矩阵在对称双线性形式的充分条件下定义了一个半正定核(命题3.1)。(ii)不对称的分数矩阵分解成一个对称的部分控制相互注意和反对称的部分控制方向注意; HKT提供了L个独立的跨尺度的这样的对,每个分辨率水平一个(命题3.5-3.6)。(iii)近似误差分解为三个可解释的分量,具有显式的非高斯校正和L中的几何衰减界限(定理4.3,命题4.4)。(iv)HKT严格地包含了单头标准注意力和因果卷积(命题3.4)。在3个随机种子上的实验显示出与重新训练的标准注意力基线相比的一致增益:+4.77pp合成ListOps(55.10 ± 0.29% vs 50.33 ± 0.12%,T = 512),序贯CIFAR-10组+1.44pp(35.45+-0.09% vs 34.01+-0.19%,T = 1,024),IMDB角色级别情感+7.47pp(70.19+-0.57% vs 62.72+-0.40%,T = 1,024),所有开销均为1.31倍。
摘要:The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an explicit non-Gaussian correction and a geometric decay bound in L (Theorem 4.3, Proposition 4.4). (iv) HKT strictly subsumes single-head standard attention and causal convolution (Proposition 3.4). Experiments over 3 random seeds show consistent gains over retrained standard attention baselines: +4.77pp on synthetic ListOps (55.10+-0.29% vs 50.33+-0.12%, T = 512), +1.44pp on sequential CIFAR-10 (35.45+-0.09% vs 34.01+-0.19%, T = 1,024), and +7.47pp on IMDB character-level sentiment (70.19+-0.57% vs 62.72+-0.40%, T = 1,024), all at 1.31x overhead.
【4】A fast and Generic Energy-Shifting Transformer for Hybrid Monte Carlo Radiotherapy Calculation
标题:用于混合蒙特卡罗放射治疗计算的快速通用能量转移Transformer
链接:https://arxiv.org/abs/2604.09157
作者:Chi-Hieu Pham,Didier Benoit,Vincent Bourbonne,Ulrike Schick,Julien Bert
备注:13 pages, 6 figures, 6 tables
摘要:我们介绍了一种新的学习框架加速蒙特卡罗(MC)剂量计算称为能量转移。这种方法利用深度学习直接从相同射束配置下的单能输入合成6 MV TrueBeam直线加速器(LINAC)剂量分布。与传统的去噪技术不同,传统的去噪技术依赖于影响射束轮廓完整性的噪声低计数剂量图,我们的方法通过将高保真解剖纹理和源特定射束相似性集成到模型的输入空间中,在看不见的数据集上实现了卓越的跨域泛化。此外,我们提出了一种新的3D架构,称为transUNetSE 3D,具有Transformer块的全局上下文和残差挤压和激励(SE)模块的自适应通道的功能重新校准。这些块的分层表示与主要剂量图参数一起融合到网络的潜在空间中,允许物理感知重建。这种混合设计在空间精度和结构保留方面优于现有的UNet和基于Transformer的基准测试,同时保持实时使用所需的执行速度。与MC参考相比,我们提出的管道实现了超过98%(3%/3 mm)的伽马通过率,在前列腺放疗治疗计划系统(TPS)的框架内进行了评估。这些结果为适应性放射治疗中的快速体积剂量学提供了一个可靠的解决方案。
摘要:We introduce a novel learning framework for accelerated Monte Carlo (MC) dose calculation termed Energy-Shifting. This approach leverages deep learning to synthesize 6 MV TrueBeam Linear Accelerator (LINAC) dose distributions directly from monoenergetic inputs under identical beam configurations. Unlike conventional denoising techniques, which rely on noisy low-count dose maps that compromise beam profile integrity, our method achieves superior cross-domain generalization on unseen datasets by integrating high-fidelity anatomical textures and source-specific beam similarity into the model's input space. Furthermore, we propose a novel 3D architecture termed TransUNetSE3D, featuring Transformer blocks for global context and Residual Squeeze-and-Excitation (SE) modules for adaptive channel-wise feature recalibration. Hierarchical representations of these blocks are fused into the network's latent space alongside the primary dose-map parameters, allowing physics-aware reconstruction. This hybrid design outperforms existing UNet and Transformer-based benchmarks in both spatial precision and structural preservation, while maintaining the execution speed necessary for real-time use. Our proposed pipeline achieves a Gamma Passing Rate exceeding 98% (3%/3mm) compared to the MC reference, evaluated within the framework of a treatment planning system (TPS) for prostate radiotherapy. These results offer a robust solution for fast volumetric dosimetry in adaptive radiotherapy.
【5】MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification
标题:MedFormer-ur:用于医学图像分类的不确定路由Transformer
链接:https://arxiv.org/abs/2604.08868
作者:Mohammed Maaz Sibhai,Abedalrhman Alkhateeb,Saad B. Ahmed
摘要
:为了确保安全的临床整合,深度学习模型必须提供的不仅仅是高准确性;它们需要可靠的不确定性量化。虽然目前的医疗Vision Transformers表现良好,但它们经常与过度自信的预测和缺乏透明度作斗争,这些问题被临床数据的噪声和不平衡性质放大。为了解决这个问题,我们增强了修改后的医疗Transformer(MedFormer),它结合了基于原型的学习和不确定性引导路由,通过利用每个令牌证据不确定性的Dirichlet分布,我们的框架可以实时量化和定位模糊性。这种不确定性不仅仅是一种输出,而是训练过程中的积极参与者,过滤掉不可靠的特征更新。此外,使用特定于类的原型确保嵌入空间保持结构化,允许基于视觉相似性的决策。通过四种模式(乳腺X线摄影、超声、MRI和组织病理学)的测试证实,我们的方法显著增强了模型校准,将预期校准误差(ECE)降低了35%,并提高了选择性预测,即使在准确性增益不大的情况下也是如此。
摘要:To ensure safe clinical integration, deep learning models must provide more than just high accuracy; they require dependable uncertainty quantification. While current Medical Vision Transformers perform well, they frequently struggle with overconfident predictions and a lack of transparency, issues that are magnified by the noisy and imbalanced nature of clinical data. To address this, we enhanced the modified Medical Transformer (MedFormer) that incorporates prototype-based learning and uncertainty-guided routing, by utilizing a Dirichlet distribution for per-token evidential uncertainty, our framework can quantify and localize ambiguity in real-time. This uncertainty is not just an output but an active participant in the training process, filtering out unreliable feature updates. Furthermore, the use of class-specific prototypes ensures the embedding space remains structured, allowing for decisions based on visual similarity. Testing across four modalities (mammography, ultrasound, MRI, and histopathology) confirms that our approach significantly enhances model calibration, reducing expected calibration error (ECE) by up to 35%, and improves selective prediction, even when accuracy gains are modest.
GAN|对抗|攻击|生成相关(9篇)
【1】XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers
标题:XPED:针对拜占庭稳健联邦分类器的非共谋模型中毒攻击
链接:https://arxiv.org/abs/2604.09489
作者:Israt Jahan Mouri,Muhammad Ridowan,Muhammad Abdullah Adnan
备注:21 pages, 9 figures, 7 tables
摘要:模型中毒攻击对联邦学习(FL)构成了严重的安全威胁。大多数现有的模型中毒攻击依赖于共谋,要求敌对客户端通过交换本地良性模型和同步生成中毒更新来进行协调。然而,在现实世界的FL部署中,维持这种协调越来越不切实际,因为它实际上需要对许多设备进行类似僵尸网络的控制。这种方法维护成本高,而且极易被发现。这种情况提出了一个基本问题:模型中毒攻击在攻击者之间没有任何通信的情况下仍然有效吗?为了应对这一挑战,我们引入并形式化了\textbf{非合谋攻击模型},在该模型中,所有受损的客户端共享一个共同的对抗目标,但独立操作。在这种模式下,每个攻击者生成其恶意更新,而无需与其他对手通信,访问其他客户端的更新,或依赖于服务器端防御的任何知识。为了证明这种威胁模型的可行性,我们提出了\textbf{XFED},第一个聚合不可知的,非共谋模型中毒攻击。我们对六个基准数据集的实证评估表明,XFED绕过了八种最先进的防御,并优于六种现有的模型中毒攻击。这些研究结果表明,FL系统的安全性远低于以前认为的,并强调迫切需要更强大和实用的防御机制。
摘要:Model poisoning attacks pose a significant security threat to Federated Learning (FL). Most existing model poisoning attacks rely on collusion, requiring adversarial clients to coordinate by exchanging local benign models and synchronizing the generation of their poisoned updates. However, sustaining such coordination is increasingly impractical in real-world FL deployments, as it effectively requires botnet-like control over many devices. This approach is costly to maintain and highly vulnerable to detection. This context raises a fundamental question: Can model poisoning attacks remain effective without any communication between attackers? To address this challenge, we introduce and formalize the \textbf{non-collusive attack model}, in which all compromised clients share a common adversarial objective but operate independently. Under this model, each attacker generates its malicious update without communicating with other adversaries, accessing other clients' updates, or relying on any knowledge of server-side defenses. To demonstrate the feasibility of this threat model, we propose \textbf{XFED}, the first aggregation-agnostic, non-collusive model poisoning attack. Our empirical evaluation across six benchmark datasets shows that XFED bypasses eight state-of-the-art defenses and outperforms six existing model poisoning attacks. These findings indicate that FL systems are substantially less secure than previously believed and underscore the urgent need for more robust and practical defense mechanisms.
【2】ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
标题:ECHO:通过一步块扩散高效生成胸部X射线报告
链接:https://arxiv.org/abs/2604.09450
作者:Lifeng Chen,Tianqi You,Hao Liu,Zhimin Bao,Jile Jiao,Xiao Han,Zhicai Ou,Tao Sun,Xiaofeng Mou,Xiaojie Jin,Yi Xu
摘要:胸部X射线报告生成(CXR-RG)有可能大大减轻放射科医生的工作量。然而,传统的自回归视觉语言模型(VLM)遭受高推理延迟,由于顺序令牌解码。基于扩散的模型通过并行生成提供了一种有前途的替代方案,但它们仍然需要多次去噪迭代。将多步去噪压缩为一步可以进一步减少延迟,但由于标记分解去噪器引入的平均场偏差,通常会降低文本的连贯性。为了应对这一挑战,我们提出了\textbf{ECHO},一个有效的基于扩散的VLM(dVLM)的胸部X射线报告生成。ECHO通过一种新的直接条件蒸馏(DCD)框架实现了稳定的逐块一步推理,该框架通过从策略扩散轨迹构建未因式分解的监督来编码联合令牌依赖性,从而减轻了平均场限制。此外,我们还引入了响应非对称扩散(RAD)训练策略,在保持模型有效性的同时进一步提高了训练效率。大量实验表明,ECHO超越了最先进的自回归方法,分别将RaTE和SemScore提高了\textbf{64.33\%}和\textbf{60.58\%},同时在不影响临床准确性的情况下实现了\textbf{$8\times$}推理加速。
摘要:Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving an \textbf{$8\times$} inference speedup without compromising clinical accuracy.
【3】Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis
标题:通过留一分析生成VG的结构评估工作表
链接:https://arxiv.org/abs/2604.08809
作者:Haonan Zhu,Adrienne Deganutti,Elad Hirsch,Purvanshi Mehta
摘要:可缩放矢量图形(SVG)将可视内容表示为结构化的、可编辑的代码。每个元素(路径、形状或文本节点)都可以单独检查、变换或删除。这种结构可编辑性是SVG生成的主要动机,但主流的评估协议主要将输出减少到与参考图像或输入文本的单一相似性分数,衡量结果如何忠实地再现图像或遵循指令,但不考虑它如何保留使SVG有价值的结构属性。特别是,现有的指标无法确定哪些生成的元素对整体视觉质量有积极的贡献,视觉概念如何映射到代码的特定部分,或者生成的输出是否支持有意义的下游编辑。我们引入元素级留一法(LOO)分析,灵感来自经典的刀切估计。该过程呈现SVG与每个元素,测量所产生的视觉变化,并得出一套结构质量指标。尽管它很简单,但刀切法将聚合统计量分解为每个样本贡献的能力直接转化为这种设置。从一个单一的机制,我们得到:(1)质量分数,每个元素通过LOO评分,使zero-shot文物检测;(2)概念元素属性,每个元素映射到它所服务的视觉概念;和(3)四个结构度量,纯度,覆盖率,紧凑性和局部性,量化SVG模块化从互补的角度来看。我们在5代系统和3个复杂性层的19,000多个编辑(5种类型)上验证了这些指标。
摘要
:Scalable Vector Graphics (SVG) represent visual content as structured, editable code. Each element (path, shape, or text node) can be individually inspected, transformed, or removed. This structural editability is a main motivation for SVG generation, yet prevailing evaluation protocols primarily reduce the output to a single similarity score against a reference image or input texts, measuring how faithfully the result reproduces an image or follows the instructions, but not how well it preserves the structural properties that make SVG valuable. In particular, existing metrics cannot determine which generated elements contribute positively to overall visual quality, how visual concepts map to specific parts of the code, or whether the generated output supports meaningful downstream editing. We introduce element-level leave-one-out (LOO) analysis, inspired by the classic jackknife estimator. The procedure renders the SVG with and without each element, measures the resulting visual change, and derives a suite of structural quality metrics. Despite its simplicity, the jackknife's capacity to decompose an aggregate statistic into per-sample contributions translates directly to this setting. From a single mechanism, we obtain: (1) quality scores per element through LOO scoring that enable zero-shot artifact detection; (2) concept-element attribution that maps each element to the visual concept it serves; and (3) four structural metrics, purity, coverage, compactness, and locality, that quantify SVG modularity from complementary perspectives. We validate these metrics on over 19,000 edits (5 types) across 5 generation systems and 3 complexity tiers.
【4】Adversarial Sensor Errors for Safe and Robust Wind Turbine Fleet Control
标题:对抗性传感器错误,实现安全稳健的风力涡轮机机队控制
链接:https://arxiv.org/abs/2604.08750
作者:Julian Quick,Marcus Binder Nilsen,Andreas Bechmann,Tran Nguyen Le,Pierre-Elouan Mikael Rethore
备注:Submitted to Journal of Physics: Conference Series (Torque 2026). This is the Accepted Manuscript version of an article accepted for publication in Journal of Physics: Conference Series. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it. This Accepted Manuscript is published under a CC BY licence
摘要:工厂级控制是一种新兴的风能技术,带来了机遇和挑战。通过经由中央控制器以协调的方式控制涡轮机,可以实现更高的风力发电厂效率。然而,存在测量误差会混淆过程的风险,甚至黑客会改变中央控制器接收的遥测信号。本文提出了一个开发安全工厂控制器的框架,通过训练它与一个对抗代理设计混淆它。这就需要训练对手混淆控制器,创造一种循环逻辑或“军备竞赛”。“本文研究了三种广泛的训练方法,用于共同训练主角和对手,发现军备竞赛方法产生最佳效果。这些初步结果表明,军备竞赛对抗训练将最坏情况下的性能下降从39%的功率损失减少到7.9%的功率增益。
摘要:Plant-level control is an emerging wind energy technology that presents opportunities and challenges. By controlling turbines in a coordinated manner via a central controller, it is possible to achieve greater wind power plant efficiency. However, there is a risk that measurement errors will confound the process, or even that hackers will alter the telemetry signals received by the central controller. This paper presents a framework for developing a safe plant controller by training it with an adversarial agent designed to confound it. This necessitates training the adversary to confound the controller, creating a sort of circular logic or "Arms Race." This paper examines three broad training approaches for co-training the protagonist and adversary, finding that an Arms Race approach yields the best results. These initial results indicate that the Arms Race adversarial training reduced worst-case performance degradation from 39% power loss to 7.9% power gain relative to a baseline operational strategy.
【5】Semantic Intent Fragmentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines
标题:语义意图碎片化:对多智能体人工智能管道的单次合成攻击
链接:https://arxiv.org/abs/2604.08608
作者:Tanzim Ahad,Ismail Hossain,Md Jahangir Alam,Sai Puppala,Yoonpyo Lee,Syed Bahauddin Alam,Sajedul Talukder
备注:This paper got accepted for AAAI 2026 Summer Symposium
摘要:我们介绍了语义意图碎片(SIF),一种针对LLM编排系统的攻击类,其中单个合法措辞的请求导致编排器将任务分解为单独良性但共同违反安全策略的子任务。当前的安全机制在子任务级别上运行,因此每一步都会清除现有的分类器--违规只出现在组合计划中。SIF通过四种机制利用OWASP LLM 06:2025:批量范围升级,静默数据渗出,嵌入式触发器部署和准标识符聚合,不需要注入内容,不需要系统修改,并且在初始请求后没有攻击者交互。我们构建了一个基于OWASP,MITRE ATLAS和NIST框架的三阶段红队管道,以生成现实的企业场景。在涵盖财务报告、信息安全和人力资源分析的14个场景中,GPT-20 B协调器在71%的情况下(10/14)生成违反策略的计划,而每个子任务都是良性的。三个独立的信号验证了这一点:确定性污点分析,思维链评估和0%误报的跨模型合规性判断。更强的协调性提高SIF成功率。计划级信息流跟踪与合规性评估相结合,可在执行前检测所有攻击,表明组合安全漏洞是可以关闭的。
摘要:We introduce Semantic Intent Fragmentation (SIF), an attack class against LLM orchestration systems where a single, legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. Current safety mechanisms operate at the subtask level, so each step clears existing classifiers -- the violation only emerges at the composed plan. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. We construct a three-stage red-teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks to generate realistic enterprise scenarios. Across 14 scenarios spanning financial reporting, information security, and HR analytics, a GPT-20B orchestrator produces policy-violating plans in 71% of cases (10/14) while every subtask appears benign. Three independent signals validate this: deterministic taint analysis, chain-of-thought evaluation, and a cross-model compliance judge with 0% false positives. Stronger orchestrators increase SIF success rates. Plan-level information-flow tracking combined with compliance evaluation detects all attacks before execution, showing the compositional safety gap is closable.
【6】Joint Interference Detection and Identification via Adversarial Multi-task Learning
标题:通过对抗多任务学习的联合干扰检测和识别
链接:https://arxiv.org/abs/2604.08607
作者:H. Xu,B. He,S. Wang
备注:13 pages, 13 figures. Submitted to IEEE Transactions on Cognitive Communications and Networking
摘要:精确的干扰检测和识别对于提高通信系统在非协作无线环境中的生存性至关重要。虽然深度学习(DL)已经推进了这一领域,但现有的单任务学习(STL)方法忽略了固有的任务相关性。此外,新兴的多任务学习(MTL)方法往往缺乏量化和建模任务关系的理论基础。为了弥补这一差距,我们建立了一个理论接地MTL框架联合干扰检测,调制识别,干扰识别。首先,我们推导出MTL框架中加权预期损失的上界。这个界限明确连接MTL性能的任务相似性,量化的Wasserstein距离和可学习的任务关系系数。在这一理论的指导下,我们提出了对抗性多任务干扰检测和识别网络(AMTIDIN),它集成了对抗性训练,以最大限度地减少任务之间的分布差异,并使用自适应系数来动态建模任务相关性。至关重要的是,我们进行了任务相似性的定量分析,揭示内在的任务关系,特别是调制识别和干扰识别共享一个实质性的功能重叠,不同于干扰检测。大量的比较实验表明,AMTIDIN在鲁棒性和泛化方面显著优于其特定任务的STL基线和最先进的MTL基线,特别是在训练数据有限、信号长度短和信噪比低的挑战性条件下。
摘要
:Precise interference detection and identification are crucial for enhancing the survivability of communication systems in non-cooperative wireless environments. While deep learning (DL) has advanced this field, existing single-task learning (STL) approaches neglect inherent task correlations. Furthermore, emerging multi-task learning (MTL) methods often lack a theoretical foundation for quantifying and modeling task relationships. To bridge this gap, we establish a theoretically grounded MTL framework for joint interference detection, modulation identification, and interference identification. First, we derive an upper bound for the weighted expected loss in MTL frameworks. This bound explicitly connects MTL performance to task similarity, quantified by the Wasserstein distance and learnable task relation coefficients. Guided by this theory, we present the adversarial multi-task interference detection and identification network (AMTIDIN), which integrates adversarial training to minimize distributional discrepancies across tasks and uses adaptive coefficients to model task correlations dynamically. Crucially, we conducted a quantitative analysis of task similarity to reveal intrinsic task relationships, specifically that modulation identification and interference identification share a substantial feature overlap distinct from interference detection. Extensive comparative experiments demonstrate that AMTIDIN significantly outperforms both its task-specific STL baseline and state-of-the-art MTL baselines in robustness and generalization, particularly under challenging conditions with limited training data, short signal lengths, and low signal-to-noise ratios (SNRs).
【7】GAN-Enhanced Deep Reinforcement Learning for Semantic-Aware Resource Allocation in 6G Network Slicing
标题:GAN增强的深度强化学习用于6G网络切片中的语义感知资源分配
链接:https://arxiv.org/abs/2604.08576
作者:Daniel Benniah John
备注:15 pages, 8 figures. Under review. Simulation-based evaluation for 6G network slicing
摘要:第六代(6 G)无线网络必须支持异构服务:需要1 Tbps数据速率的增强型移动宽带(eMBB)、每公里支持1000万台设备的大规模机器类型通信(mMTC)以及具有0.1-1 ms延迟的超可靠低延迟通信(URLLC)。当前的资源分配受到三个限制:(1)语义盲,浪费35%的带宽在冗余数据上,(2)离散动作量化,以及(3)有限的训练多样性。本文提出了GAN-DDPG,这是一种生成对抗网络增强的深度确定性策略梯度框架,集成了用于流量合成的条件GANs,连续动作DDPG和语义感知奖励优化。具有统计验证的广泛模拟表明了显著的改进:与基线DDPG相比,URLLC提高了22%,eMBB提高了20%,mMTC频谱效率提高了25%(所有p < 0.001),延迟降低了18%,丢包率降低了31%。
摘要:Sixth-generation (6G) wireless networks must support heterogeneous services: enhanced Mobile Broadband (eMBB) requiring 1 Tbps data rates, massive Machine-Type Communications (mMTC) supporting 10 million devices per km, and Ultra-Reliable Low-Latency Communications (URLLC) with 0.1-1 ms latency. Current resource allocation suffers from three limitations: (1) semantic blindness wasting 35% bandwidth on redundant data, (2) discrete action quantization, and (3) limited training diversity. This paper proposes GAN-DDPG, a Generative Adversarial Network-enhanced Deep Deterministic Policy Gradient framework integrating conditional GANs for traffic synthesis, continuous action DDPG, and semantic-aware reward optimization. Extensive simulations with statistical validation demonstrate significant improvements: 22% URLLC, 20% eMBB, 25% mMTC spectral efficiency gains (all p < 0.001) compared to baseline DDPG, with 18% latency and 31% packet loss reduction.
【8】MolPaQ: Modular Quantum-Classical Patch Learning for Interpretable Molecular Generation
标题:MolPaQ:用于可解释分子生成的模块化量子经典补丁学习
链接:https://arxiv.org/abs/2604.08575
作者:Syed Rameez Naqvi,Lu Peng
摘要:分子生成模型必须共同确保有效性,多样性和属性控制,但现有的方法通常在这些目标之间进行权衡。我们提出了MOLPAQ,一个模块化的量子经典发生器,组装分子从量子产生的潜在补丁。在QM 9上预先训练的\b{eta}-VAE学习化学对齐的潜在流形;简化的条件将分子描述符映射到这个空间中;参数高效的量子补丁生成器产生纠缠节点嵌入,并将其重建为有效的分子图。具有潜在批评者和化学形状奖励的对抗性微调产生100%的RDKit有效性,99.75%的新颖性和0.905的多样性。除了聚合度量之外,由调节器引导的预训练量子生成器将平均QED提高了约10%。2.3\%,并增加芳香基序发生率约。10-12\%,相对于参数匹配的经典生成器,突出了其作为紧凑拓扑成形算子的作用。
摘要:Molecular generative models must jointly ensure validity, diversity, and property control, yet existing approaches typically trade off among these objectives. We present MOLPAQ, a modular quantum-classical generator that assembles molecules from quantum-generated latent patches. A \b{eta}-VAE pretrained on QM9 learns a chemically aligned latent manifold; a reduced conditioner maps molecular descriptors into this space; and a parameter-efficient quantum patch generator produces entangled node embeddings that a valence-aware aggregator reconstructs into valid molecular graphs. Adversarial fine-tuning with a latent critic and chemistry-shaped reward yields 100\% RDKit validity, 99.75\% novelty, and 0.905 diversity. Beyond aggregate metrics, the pretrained quantum generator, steered by the conditioner, improves mean QED by approx. 2.3\% and increases aromatic motif incidence by approx. 10-12\% relative to a parameter-matched classical generator, highlighting its role as a compact topology-shaping operator.
【9】Weak Adversarial Neural Pushforward Method for the Wigner Transport Equation
标题:Wigner输运方程的弱对抗神经前推方法
链接:https://arxiv.org/abs/2604.08763
作者:Andrew Qing He,Wei Cai,Sihong Shao
备注:9 pages, 1 algorithm
摘要:我们将弱对抗神经前推方法推广到量子系统相空间动力学的Wigner输运方程。核心贡献是结构观察:将非局部伪微分势算子与平面波测试函数相结合,产生一个狄拉克δ,它正好反转了定义维格纳势核的傅里叶变换,将算子减少为两个位移参数的势的逐点有限差分。这适用于任意维,不需要截断的Moyal系列,并视为一个黑盒功能预言机没有衍生信息的潜力。为了处理维格纳准概率分布的负性,我们引入了一个有符号的前推架构,将解决方案分解为两个非负的相空间分布与可学习的权重混合。由此产生的方法继承了原始框架的无网格,无雅可比性和可扩展性,同时将其扩展到量子设置。
摘要:We extend the Weak Adversarial Neural Pushforward Method to the Wigner transport equation governing the phase-space dynamics of quantum systems. The central contribution is a structural observation: integrating the nonlocal pseudo-differential potential operator against plane-wave test functions produces a Dirac delta that exactly inverts the Fourier transform defining the Wigner potential kernel, reducing the operator to a pointwise finite difference of the potential at two shifted arguments. This holds in arbitrary dimension, requires no truncation of the Moyal series, and treats the potential as a black-box function oracle with no derivative information. To handle the negativity of the Wigner quasi-probability distribution, we introduce a signed pushforward architecture that decomposes the solution into two non-negative phase-space distributions mixed with a learnable weight. The resulting method inherits the mesh-free, Jacobian-free, and scalable properties of the original framework while extending it to the quantum setting.
半/弱/无/有监督|不确定性|主动学习(9篇)
【1】Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision
标题:基于案例的证据验证:构建证据敏感监督的框架
链接:https://arxiv.org/abs/2604.09537
作者:Soroosh Tayebi Arasteh,Mehdi Joodaki,Mahshad Lotfinia,Sven Nebelung,Daniel Truhn
摘要:基于证据的推理需要的不仅仅是将检索到的文本附加到预测中:模型应该根据所提供的证据是否支持目标声明来做出决策。在实践中,这往往是失败的,因为监督是薄弱的,证据只是松散地联系在一起的索赔,并评估不直接测试证据的依赖性。我们引入了基于案例的证据验证,这是一个通用框架,在这个框架中,模型接收本地案例背景、外部证据和结构化声明,并且必须决定证据是否支持该案例的声明。我们的主要贡献是监督建设过程中产生明确的支持的例子,连同语义控制的非支持的例子,包括反事实的错误状态和主题相关的负面,没有人工证据注释。我们实例化的框架在放射学和训练标准验证器上产生的支持任务。学习验证大大优于仅案例和仅证据基线,在正确的证据下保持强大,并在证据被删除或交换时崩溃,表明真正的证据依赖。这种行为在看不见的证据文章和外部病例分布中转移,尽管在证据源转移下性能会下降,并且对主干选择仍然敏感。总体而言,结果表明,证据基础的主要瓶颈不仅是模型能力,而且缺乏对证据因果作用进行编码的监督。
摘要:Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.
【2】Bringing Clustering to MLL: Weakly-Supervised Clustering for Partial Multi-Label Learning
标题:将集群引入MLL:用于部分多标签学习的弱监督集群
链接:https://arxiv.org/abs/2604.09359
作者:Yu Chen,Weijun Lv,Yue Huang,Xuhuan Zhu,Fang Li
摘要:多标签学习(MLL)中的标签噪声对模型训练提出了重大挑战,特别是在部分多标签学习(PML)中,候选标签包含相关和不相关标签。虽然聚类提供了一种利用数据结构进行噪声识别的自然方法,但由于基本的不兼容性,传统的聚类方法不能直接应用于多标签场景:聚类产生的成员值总和为每个实例一个,而多标签分配需要可以总和为任何数字的二进制值。我们提出了一种新的弱监督的PML聚类方法(WSC-PML),桥梁聚类和多标签学习通过隶属度矩阵分解。我们的关键创新将聚类隶属度矩阵$\mathbf{A}$分解为两个分量:$\mathbf{A} = \mathbf\odot \mathbf{F}$,其中$\mathbf\odot $保持聚类约束,而$\mathbf{F}$保持多标签特性。这种分解能够将无监督聚类与多标签监督无缝集成,以实现有效的标签噪音处理。WSC-PML采用三个阶段的过程:从噪声标签中学习初始原型,自适应基于置信度的弱监督构造,以及通过迭代聚类细化进行联合优化。在24个数据集上进行的大量实验表明,我们的方法在所有评估指标上都优于六种最先进的方法。
摘要:Label noise in multi-label learning (MLL) poses significant challenges for model training, particularly in partial multi-label learning (PML) where candidate labels contain both relevant and irrelevant labels. While clustering offers a natural approach to exploit data structure for noise identification, traditional clustering methods cannot be directly applied to multi-label scenarios due to a fundamental incompatibility: clustering produces membership values that sum to one per instance, whereas multi-label assignments require binary values that can sum to any number. We propose a novel weakly-supervised clustering approach for PML (WSC-PML) that bridges clustering and multi-label learning through membership matrix decomposition. Our key innovation decomposes the clustering membership matrix $\mathbf{A}$ into two components: $\mathbf{A} = \mathbfΠ \odot \mathbf{F}$, where $\mathbfΠ$ maintains clustering constraints while $\mathbf{F}$ preserves multi-label characteristics. This decomposition enables seamless integration of unsupervised clustering with multi-label supervision for effective label noise handling. WSC-PML employs a three-stage process: initial prototype learning from noisy labels, adaptive confidence-based weak supervision construction, and joint optimization via iterative clustering refinement. Extensive experiments on 24 datasets demonstrate that our approach outperforms six state-of-the-art methods across all evaluation metrics.
【3】PDE-regularized Dynamics-informed Diffusion with Uncertainty-aware Filtering for Long-Horizon Dynamics
标题:具有不确定性感知过滤的PED正规化动态传播
链接:https://arxiv.org/abs/2604.09058
作者:Min Young Baeg,Yoon-Yeong Kim
摘要:由于累积误差、噪声放大和现有模型缺乏物理一致性,长时间时空预测仍然是一个具有挑战性的问题。虽然扩散模型为建模不确定性提供了一个概率框架,但传统方法通常依赖于均方误差目标,无法捕获由物理定律支配的潜在动态。在这项工作中,我们提出了PDYffalent,一个动态信息扩散框架,集成了基于PDE的正则化和不确定性感知预测,用于稳定的长期预测。该方法由两个关键部分组成:PDE正则化的预测器和基于UKF的预测器。预测器采用差分算子来强制执行物理上一致的中间状态,而预测器利用Unscented卡尔曼滤波器来显式地建模不确定性并减轻迭代预测期间的误差积累。我们提供的理论分析表明,建议的预测器满足PDE约束的平滑特性,并且预测器在建议的损失公式下收敛。在多个动态数据集上进行的大量实验表明,PDYfflord在CRPS和MSE方面具有优异的性能,同时保持了SSR测量的稳定不确定性行为。我们进一步分析了预测精度和不确定性之间的内在权衡,表明我们的方法为长期预测提供了一个平衡和鲁棒的解决方案。
摘要:Long-horizon spatiotemporal prediction remains a challenging problem due to cumulative errors, noise amplification, and the lack of physical consistency in existing models. While diffusion models provide a probabilistic framework for modeling uncertainty, conventional approaches often rely on mean squared error objectives and fail to capture the underlying dynamics governed by physical laws. In this work, we propose PDYffusion, a dynamics-informed diffusion framework that integrates PDE-based regularization and uncertainty-aware forecasting for stable long-term prediction. The proposed method consists of two key components: a PDE-regularized interpolator and a UKF-based forecaster. The interpolator incorporates a differential operator to enforce physically consistent intermediate states, while the forecaster leverages the Unscented Kalman Filter to explicitly model uncertainty and mitigate error accumulation during iterative prediction. We provide theoretical analyses showing that the proposed interpolator satisfies PDE-constrained smoothness properties, and that the forecaster converges under the proposed loss formulation. Extensive experiments on multiple dynamical datasets demonstrate that PDYffusion achieves superior performance in terms of CRPS and MSE, while maintaining stable uncertainty behavior measured by SSR. We further analyze the inherent trade-off between prediction accuracy and uncertainty, showing that our method provides a balanced and robust solution for long-horizon forecasting.
【4】Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift
标题:域转移下的低数据监督自适应云分割的性能优于预算分配
链接:https://arxiv.org/abs/2604.08956
作者:Harshith Kethavath,Weiming Hu
备注:10 pages, 6 figures, to be published in EarthVision @ CVPR 2026
摘要:适应视觉语言模型遥感图像提出了一个根本性的挑战:卫星数据的视觉和语言分布远远超出自然图像预训练语料库。尽管如此,提示仍然是主要的部署范例,由特定领域的语言可以引导冻结的模型表示到专门的任务的假设驱动。我们直接在一个不匹配突出的域上测试这个假设:卫星图像的云分割。在CloudSEN 12+云分割基准上使用CLIPSeg,我们评估了60个提示变量,包括简单标签、领域术语、外观描述符和上下文线索,发现每个变量的表现都低于zero-shot基线(0.255 mIoU),工程提示的得分低至0.07 mIoU。再多的语言精炼也无法弥合CLIP的自然图像表示和卫星光谱图像之间的差距。相比之下,仅使用0.1%标记数据(约8张图像)的监督微调总体上超过了zero-shot性能,并且5-10%的数据恢复了约85%的最大可实现mIoU。完全微调始终优于低秩自适应0.03-0.09 mIoU,其中频谱模糊类的差距最大,并且在0.5%至1%的标记数据下,微调在恢复之前暂时降低了这些类的性能,聚合mIoU可以掩盖的监督下降。对于将视觉语言模型适应于专业图像的从业者来说,我们的结果传递了一个明确的信息:标记数据不是提示的昂贵替代品;这是值得的路径。
摘要
:Adapting vision-language models to remote sensing imagery presents a fundamental challenge: both the visual and linguistic distributions of satellite data lie far outside natural image pretraining corpora. Despite this, prompting remains the dominant deployment paradigm, driven by the assumption that domain-specific language can guide frozen model representations toward specialized tasks. We test this assumption directly on a domain where the mismatch is prominent: cloud segmentation for satellite imagery. Using CLIPSeg on the CloudSEN12+ cloud segmentation benchmark, we evaluate 60 prompt variants spanning simple labels, domain terminology, appearance descriptors, and contextual cues, finding that every variant underperforms the zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. No amount of linguistic refinement bridges the gap between CLIP's natural image representations and satellite spectral imagery. In contrast, supervised fine-tuning with just 0.1% labeled data (~8 images) surpasses zero-shot performance overall, and 5-10% data recovers ~85% of maximum achievable mIoU. Full fine-tuning consistently outperforms low-rank adaptation by 0.03-0.09 mIoU, with the largest gaps for spectrally ambiguous classes, and at 0.5 to 1% labeled data, fine-tuning temporarily degrades performance on these classes before recovering, a supervision dip that aggregate mIoU can mask. For practitioners adapting vision-language models to specialized imagery, our results deliver a clear message: labeled data is not the expensive alternative to prompting; it is the worthwhile path.
【5】Accurate and Reliable Uncertainty Estimates for Deterministic Predictions Extensions to Under and Overpredictions
标题:确定性预测的准确可靠的不确定性估计
链接:https://arxiv.org/abs/2604.08755
作者:Rileigh Bandy,Enrico Camporeale,Andong Hu,Thomas Berger,Rebecca Morrison
摘要:计算模型支持工程和科学领域的高风险决策,从业者越来越多地寻求概率预测来量化此类模型中的不确定性。现有的方法生成预测通过采样输入参数分布或通过增加确定性输出与不确定性表示,包括分布自由和分布的方法。然而,基于采样的方法通常在计算上对于实时应用是禁止的,并且许多现有的不确定性表示要么忽略输入依赖性,要么依赖于限制性高斯假设,这些假设无法捕获不对称性和重尾行为。因此,我们扩展了准确和可靠的不确定性估计(ACCRUE)框架,以学习依赖于输入的非高斯不确定性分布,特别是两段高斯和非对称拉普拉斯形式,使用一个神经网络训练的损失函数,平衡预测的准确性和可靠性。通过合成和真实世界的实验,我们表明,所提出的方法捕获了一个依赖于输入的不确定性结构,并提高了相对于现有方法的概率预测,同时保持灵活性,模型倾斜和非高斯误差。
摘要:Computational models support high-stakes decisions across engineering and science, and practitioners increasingly seek probabilistic predictions to quantify uncertainty in such models. Existing approaches generate predictions either by sampling input parameter distributions or by augmenting deterministic outputs with uncertainty representations, including distribution-free and distributional methods. However, sampling-based methods are often computationally prohibitive for real-time applications, and many existing uncertainty representations either ignore input dependence or rely on restrictive Gaussian assumptions that fail to capture asymmetry and heavy-tailed behavior. Therefore, we extend the ACCurate and Reliable Uncertainty Estimate (ACCRUE) framework to learn input-dependent, non-Gaussian uncertainty distributions, specifically two-piece Gaussian and asymmetric Laplace forms, using a neural network trained with a loss function that balances predictive accuracy and reliability. Through synthetic and real-world experiments, we show that the proposed approach captures an input-dependent uncertainty structure and improves probabilistic forecasts relative to existing methods, while maintaining flexibility to model skewed and non-Gaussian errors.
【6】Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
标题:证据转换网络:将预先训练的模型转化为证据模型,用于事后不确定性估计
链接:https://arxiv.org/abs/2604.08627
作者:Yongchan Chun,Chanhee Park,Jeongho Yoon,Jaehyung Seo,Heuiseok Lim
备注:Accepted to CVPR 2026 (Highlight)
摘要:预训练模型已经成为视觉和语言的标准,但它们通常不能提供可靠的置信度度量。现有的不确定性估计方法,如深度集成和MC辍学,往往是太昂贵的计算部署在实践中。证据深度学习(EDL)提供了一种更有效的替代方案,但它需要从一开始就训练模型以输出证据量,这对于预训练的网络来说很少是真的。为了在预训练模型中实现EDL风格的不确定性估计,我们提出了证据转换网络(ETN),这是一个轻量级的事后模块,可以将预训练的预测器转换为证据模型。ETN在logit空间中运行:它学习logits的样本相关仿射变换,并将变换后的输出解释为用于不确定性估计的Dirichlet分布的参数。我们评估ETN的图像分类和大型语言模型问答基准下的分布和分布设置。ETN始终提高事后基线的不确定性估计,同时保持准确性,只增加最小的计算开销。
摘要:Pretrained models have become standard in both vision and language, yet they typically do not provide reliable measures of confidence. Existing uncertainty estimation methods, such as deep ensembles and MC dropout, are often too computationally expensive to deploy in practice. Evidential Deep Learning (EDL) offers a more efficient alternative, but it requires models to be trained to output evidential quantities from the start, which is rarely true for pretrained networks. To enable EDL-style uncertainty estimation in pretrained models, we propose the Evidential Transformation Network (ETN), a lightweight post-hoc module that converts a pretrained predictor into an evidential model. ETN operates in logit space: it learns a sample-dependent affine transformation of the logits and interprets the transformed outputs as parameters of a Dirichlet distribution for uncertainty estimation. We evaluate ETN on image classification and large language model question-answering benchmarks under both in-distribution and out-of-distribution settings. ETN consistently improves uncertainty estimation over post-hoc baselines while preserving accuracy and adding only minimal computational overhead.
【7】Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing
标题:语音SNN的实用Bayesian推理:不确定性和损失景观平滑
链接:https://arxiv.org/abs/2604.08624
作者:Yesmine Abdennadher,Philip N. Garner
摘要:尖峰神经网络(SNN)由于其特定的动态特性,自然适合语音处理任务,这使它们能够处理时间数据。然而,SNN中基于阈值的尖峰生成直观地导致有角度或不规则的预测景观。我们探讨了使用贝叶斯学习方法对不规则预测景观的权重的影响。对于代理梯度SNN,我们还探讨了应用改进的变分在线牛顿(IVON)方法,这是一种有效的变分方法。所提出的方法的性能进行评估海德堡数字和语音命令数据集。假设贝叶斯方法将导致更平滑和更规则的预测景观,考虑到确定性预测景观的角度性质。所提出的方法的实验评估表明,改进的性能上的负对数似然和Brier得分。此外,与基于权重空间的一维切片的确定性方法相比,所提出的方法产生了更平滑和更规则的预测景观
摘要:Spiking Neural Networks (SNNs) are naturally suited for speech processing tasks due to their specific dynamics, which allows them to handle temporal data. However, the threshold-based generation of spikes in SNNs intuitively causes an angular or irregular predictive landscape. We explore the effect of using the Bayesian learning approach for the weights on the irregular predictive landscape. For the surrogate-gradient SNNs, we also explore the application of the Improved Variational Online Newton (IVON) approach, which is an efficient variational approach. The performance of the proposed approach is evaluated on the Heidelberg Digits and Speech Commands datasets. The hypothesis is that the Bayesian approach will result in a smoother and more regular predictive landscape, given the angular nature of the deterministic predictive landscape. The experimental evaluation of the proposed approach shows improved performance on the negative log-likelihood and Brier score. Furthermore, the proposed approach has resulted in a smoother and more regular predictive landscape compared to the deterministic approach, based on the one-dimensional slices of the weight space
【8】Variational Quantum Physics-Informed Neural Networks for Hydrological PDE-Constrained Learning with Inherent Uncertainty Quantification
标题:变分量子物理信息神经网络用于具有固有不确定性量化的水文PED约束学习
链接:https://arxiv.org/abs/2604.09374
作者:Prasad Nimantha Madusanka Ukwatta Hewage,Midhun Chakkravarthy,Ruvan Kumara Abeysekara
备注:25 pages, 6 tables. Code available at https://github.com/nimanpra/HQC-PINN-Flood-Prediction
摘要:我们提出了一种混合量子经典物理信息神经网络(HQC-PINN),它将参数化的变分量子电路集成到PINN框架中,用于水文PDE约束学习。我们的架构通过可训练的角度编码将多源遥感特征编码为量子态,通过具有纠缠层的硬件高效的变分分析来处理它们,并使用圣维南浅水方程和曼宁流动方程作为可微物理损失项来约束输出。量子测量固有的随机性为不确定性量化提供了一种自然的机制,而不需要显式的贝叶斯推理机制。我们还介绍了一种量子迁移学习协议,该协议在对洪水特定事件进行微调之前对多灾害数据进行预训练。对来自斯里兰卡卡卢河流域的多模态卫星和气象数据的数值模拟表明,与同等的经典PINN相比,HQC-PINN在约3倍的训练时间内实现收敛,并使用约44%的可训练参数,同时保持有竞争力的分类精度。理论分析表明,水文物理约束缩小了有效的优化景观,提供了一个自然的缓解对贫瘠的高原变量子电路。这项工作建立了量子增强物理学信息学习在水文预测中的首次应用,并展示了在环境科学中实现量子优势的可行途径。
摘要:We propose a Hybrid Quantum-Classical Physics-Informed Neural Network (HQC-PINN) that integrates parameterized variational quantum circuits into the PINN framework for hydrological PDE-constrained learning. Our architecture encodes multi-source remote sensing features into quantum states via trainable angle encoding, processes them through a hardware-efficient variational ansatz with entangling layers, and constrains the output using the Saint-Venant shallow water equations and Manning's flow equation as differentiable physics loss terms. The inherent stochasticity of quantum measurement provides a natural mechanism for uncertainty quantification without requiring explicit Bayesian inference machinery. We further introduce a quantum transfer learning protocol that pre-trains on multi-hazard disaster data before fine-tuning on flood-specific events. Numerical simulations on multi-modal satellite and meteorological data from the Kalu River basin, Sri Lanka, show that the HQC-PINN achieves convergence in ~3x fewer training epochs and uses ~44% fewer trainable parameters compared to an equivalent classical PINN, while maintaining competitive classification accuracy. Theoretical analysis indicates that hydrological physics constraints narrow the effective optimization landscape, providing a natural mitigation against barren plateaus in variational quantum circuits. This work establishes the first application of quantum-enhanced physics-informed learning to hydrological prediction and demonstrates a viable path toward quantum advantage in environmental science.
【9】Active Learning for Generalizable Detonation Performance Prediction of Energetic Materials
标题:含能材料可推广爆轰性能预测的主动学习
链接:https://arxiv.org/abs/2604.08744
作者:R. Seaton Ullberg,Megan C. Davis,Jeremy N. Schroeder,Andrew H. Salij,M. J. Cawkwell,Christopher J. Snyder,Wilton J. M. Kort-Kamp,Ivana Matanovic
摘要:新能源材料的发现对于推进从国防到私营工业的技术至关重要。然而,实验方法仍然缓慢且昂贵,而计算替代方案需要精确的材料性质输入,这些输入通常成本高昂,限制了它们在广阔的化学空间中有效预测爆炸性能的能力。我们通过主动学习策略来应对这一挑战,该策略集成了密度泛函理论计算、热化学建模、消息传递神经网络和贝叶斯优化。由此产生的高通量工作流程通过以有针对性的方式选择新分子来迭代扩展训练数据集,该方式平衡了对广泛化学空间的探索与对有前途的高性能候选物的利用。这种方法产生了最大的公开可用的潜在CHNO炸药数据库,该数据库从超过700亿个候选物的初始库中提取,并且产生了能够准确预测爆轰性能(R$^2$ > 0.98)的可推广的替代模型。对这一迄今为止最大的数据集进行的特征重要性分析表明,氧平衡是爆震性能的主要驱动因素,并辅之以局部电子结构、密度和特定官能团的存在。化学信息学分析突出了具有相似性能指标的高能材料如何倾向于聚集在不同的化学空间中,为未来的合成研究提供了更明确的方向。总之,替代模型,数据库和由此产生的化学见解为高通量筛选和有针对性地发现跨越化学空间的不同和以前未探索的区域的新高能材料提供了宝贵的基础。
摘要:The discovery of new energetic materials is critical for advancing technologies from defense to private industry. However, experimental approaches remain slow and expensive while computational alternatives require accurate material property inputs that are often costly to obtain, limiting their ability to efficiently predict detonation performance across a vast chemical space. We address this challenge through an active learning strategy that integrates density functional theory calculations, thermochemical modeling, message-passing neural networks, and Bayesian optimization. The resulting high-throughput workflow iteratively expands the training dataset by selecting new molecules in a targeted manner that balances the exploration of broad chemical space with the exploitation of promising high-performing candidates. This approach yields the largest publicly available database of potential CHNO explosives drawn from an initial pool of more than 70 billion candidates and a generalizable surrogate model capable of accurately predicting detonation performance (R$^2$ > 0.98). Feature importance analysis on this largest-to-date dataset reveals that oxygen balance is the dominant driver of detonation performance, complemented by contributions from local electronic structure, density, and the presence of specific functional groups. Cheminformatics analysis highlights how energetic materials with similar performance metrics tend to cluster in distinct chemical spaces offering a clearer direction for future synthesis studies. Together, the surrogate model, database, and resulting chemical insights provide a valuable foundation for high-throughput screening and targeted discovery of new energetic materials spanning diverse and previously unexplored regions of chemical space.
迁移|Zero/Few/One-Shot|自适应(9篇)
【1】ANTIC: Adaptive Neural Temporal In-situ Compressor
标题:ANNIC:自适应神经时态现场压缩器
链接:https://arxiv.org/abs/2604.09543
作者:Sandeep S. Cranganore,Andrei Bodnar,Gianluca Galleti,Fabian Paischer,Johannes Brandstetter
备注:31 pages, 19 figures, 9 Tables
摘要:高分辨率,时空演变领域的大规模和高维偏微分方程(PDE)的持久性存储需求已达到PB到EB的规模。瞬态模拟建模Navier-Stokes方程、磁流体力学、等离子体物理学或二元黑洞合并产生的数据量对于现代高性能计算(HPC)基础设施来说是令人望而却步的。为了解决这个瓶颈,我们引入了ANTIC(自适应神经时间原位压缩器),一个端到端的原位压缩管道。ANTIC由一个为高维物理量身定制的自适应时间选择器组成,该选择器在模拟时识别和过滤信息快照,并结合基于连续微调的空间神经压缩模块,该模块使用神经场学习相邻快照之间的残差更新。通过在单个流通道中操作,ANTIC能够实现时间和空间分量的组合压缩,并有效地消除了对整个时间演化轨迹的显式磁盘存储的需求。实验结果表明,如何存储减少几个数量级的物理精度。
摘要:The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate how storage reductions of several orders of magnitude relate to physics accuracy.
【2】AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning
标题:AdaCubic:一种用于深度学习的自适应立方正规化优化器
链接:https://arxiv.org/abs/2604.09437
作者:Ioannis Tsingalis,Constantine Kotropoulos,Corentin Briat
摘要:提出了一种新的正则化技术AdaCubic,该技术自适应立方项的权重。AdaCubic的核心是一个具有立方约束的辅助优化问题,它动态调整牛顿立方正则化方法中立方项的权重。我们使用Hutchinson的方法来近似Hessian矩阵,从而降低计算成本。我们证明了AdaCubic继承了三次正则化牛顿法的局部收敛保证。我们在计算机视觉,自然语言处理和信号处理任务中的实验表明,AdaCubic优于或竞争几个广泛使用的优化器。与其他需要超参数微调的自适应算法不同,AdaCubic是用一组固定的超参数来评估的,这使得它在微调不可行的情况下成为一个非常有吸引力的优化器。这使得AdaCubic成为研究人员和从业者的一个有吸引力的选择。据我们所知,AdaCubic是第一个在可扩展的深度学习应用程序中利用立方正则化的优化器。
摘要:A novel regularization technique, AdaCubic, is proposed that adapts the weight of the cubic term. The heart of AdaCubic is an auxiliary optimization problem with cubic constraints that dynamically adjusts the weight of the cubic term in Newton's cubic regularized method. We use Hutchinson's method to approximate the Hessian matrix, thereby reducing computational cost. We demonstrate that AdaCubic inherits the cubically regularized Newton method's local convergence guarantees. Our experiments in Computer Vision, Natural Language Processing, and Signal Processing tasks demonstrate that AdaCubic outperforms or competes with several widely used optimizers. Unlike other adaptive algorithms that require hyperparameter fine-tuning, AdaCubic is evaluated with a fixed set of hyperparameters, rendering it a highly attractive optimizer in settings where fine-tuning is infeasible. This makes AdaCubic an attractive option for researchers and practitioners alike. To our knowledge, AdaCubic is the first optimizer to leverage cubic regularization in scalable deep learning applications.
【3】Meta-Learned Basis Adaptation for Parametric Linear PDEs
标题:参数线性偏微分方程的元学习基自适应算法
链接:https://arxiv.org/abs/2604.09289
作者:Vikas Dwivedi,Monica Sigovan,Bruno Sixou
摘要:我们提出了一个混合物理知情的框架,通过将元学习预测器与最小二乘校正器相结合来求解参数线性偏微分方程(PDE)。该预测器被称为\textbf{KAPI}(Kernel-Adaptive Physics-Informed Meta-learner),是一个浅层任务条件模型,它将查询坐标和PDE参数映射到解值,同时在内部生成一个可解释的、任务自适应的高斯基几何。一个轻量级的元网络将PDE参数映射到基中心、宽度和活动模式,从而学习近似空间如何在参数族中适应。这个预测器生成的几何形状被转移到第二阶段校正器,该校正器用背景基对其进行增强,并通过一次性物理通知的极限学习机(PIELM)风格的最小二乘求解来计算最终解。我们评估的方法对四个线性偏微分方程的家庭跨越扩散,运输,混合平流扩散,变速运输。在这些情况下,预测器通过局部化和运输对齐的基础放置捕获有意义的物理,而校正器进一步提高精度,通常是一个或多个数量级。与参数PINN,物理信息DeepONet和均匀网格PIELM校正器的比较突出了预测指导基础自适应作为参数PDE求解的可解释和有效策略的价值。
摘要:We propose a hybrid physics-informed framework for solving families of parametric linear partial differential equations (PDEs) by combining a meta-learned predictor with a least-squares corrector. The predictor, termed \textbf{KAPI} (Kernel-Adaptive Physics-Informed meta-learner), is a shallow task-conditioned model that maps query coordinates and PDE parameters to solution values while internally generating an interpretable, task-adaptive Gaussian basis geometry. A lightweight meta-network maps PDE parameters to basis centers, widths, and activity patterns, thereby learning how the approximation space should adapt across the parametric family. This predictor-generated geometry is transferred to a second-stage corrector, which augments it with a background basis and computes the final solution through a one-shot physics-informed Extreme Learning Machine (PIELM)-style least-squares solve. We evaluate the method on four linear PDE families spanning diffusion, transport, mixed advection--diffusion, and variable-speed transport. Across these cases, the predictor captures meaningful physics through localized and transport-aligned basis placement, while the corrector further improves accuracy, often by one or more orders of magnitude. Comparisons with parametric PINNs, physics-informed DeepONet, and uniform-grid PIELM correctors highlight the value of predictor-guided basis adaptation as an interpretable and efficient strategy for parametric PDE solving.
【4】Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks
标题:在无人机辅助紧急通信网络中进行动态目标适应的可塑性增强多智能体专家混合
链接:https://arxiv.org/abs/2604.09028
作者:Wen Qiu,Zhiqiang He,Wei Zhao,Hiroshi Masui
备注:20 pages, 12 figures, 3 tables
摘要:作为空中基站的无人驾驶飞行器可以在灾害发生后迅速恢复连接,但用户移动性和交通需求的突然变化会改变服务质量的权衡,并导致强烈的非平稳性。深度强化学习策略在这种转变下会遭受可塑性损失,因为表征崩溃和神经元休眠会损害适应性。我们提出了可塑性增强的多智能体混合专家(PE-MAMoE),一个集中训练与分散执行框架建立在多智能体的近端政策优化。PE-MAMoE为每个无人机配备了一个稀疏门控的专家演员混合体,其路由器每步选择一个专家。非参数相位控制器在相位切换后注入简短的、仅限专家的随机扰动,重置动作对数标准差,退火熵和学习速率,并调度路由器温度,所有这些都是为了在不破坏安全行为的情况下重新调整策略。我们推导出一个动态的遗憾界显示的跟踪误差尺度与环境变化和累积噪声能量。在具有移动用户和3GPP风格信道的相位驱动模拟器中,PE-MAMoE在最佳基线上将归一化四分位数平均返回提高了26.3%,将服务用户容量提高了12.8%,并将冲突减少了约75%。诊断确认持续较高的专家功能排名和周期性休眠神经元恢复政权开关。
摘要:Unmanned aerial vehicles serving as aerial base stations can rapidly restore connectivity after disasters, yet abrupt changes in user mobility and traffic demands shift the quality of service trade-offs and induce strong non-stationarity. Deep reinforcement learning policies suffer from plasticity loss under such shifts, as representation collapse and neuron dormancy impair adaptation. We propose plasticity enhanced multi-agent mixture of experts (PE-MAMoE), a centralized training with decentralized execution framework built on multi-agent proximal policy optimization. PE-MAMoE equips each UAV with a sparsely gated mixture of experts actor whose router selects a single specialist per step. A non-parametric Phase Controller injects brief, expert-only stochastic perturbations after phase switches, resets the action log-standard-deviation, anneals entropy and learning rate, and schedules the router temperature, all to re-plasticize the policy without destabilizing safe behaviors. We derive a dynamic regret bound showing the tracking error scales with both environment variation and cumulative noise energy. In a phase-driven simulator with mobile users and 3GPP-style channels, PE-MAMoE improves normalized interquartile mean return by 26.3\% over the best baseline, increases served-user capacity by 12.8\%, and reduces collisions by approximately 75\%. Diagnostics confirm persistently higher expert feature rank and periodic dormant-neuron recovery at regime switches.
【5】ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering
标题:ASTRA:用于复杂表问题解答的自适应语义树推理体系结构
链接:https://arxiv.org/abs/2604.08999
作者:Xiaoke Guo,Songze Li,Zhiqiang Liu,Zhaoyan Gong,Yuanxiang Liu,Huajun Chen,Wen Zhang
摘要:表序列化仍然是大型语言模型(LLM)在复杂表问题回答中的关键瓶颈,受到结构忽略,表示间隙和推理不透明等挑战的阻碍。现有的序列化方法无法捕获显式的层次结构,缺乏模式的灵活性,而目前的基于树的方法遭受有限的语义适应性。为了解决这些限制,我们提出了ASTRA(自适应语义树推理架构),包括两个主要模块,AdaSTR和DuTR。首先,我们介绍了AdaSTR,它利用LLM的全局语义感知来将表重构为逻辑语义树。这种序列化显式地对层次依赖进行建模,并采用自适应机制来优化基于表规模的构造策略。其次,在此结构的基础上,我们提出了DuTR,一个双模式推理框架,集成了基于树搜索的文本导航语言对齐和符号代码执行精确验证。复杂表基准测试的实验表明,我们的方法达到了最先进的(SOTA)性能。
摘要
:Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.
【6】Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
标题:模式感知的Zero-Shot修剪和稀疏注意力实现高效多模式边缘推理
链接:https://arxiv.org/abs/2604.08971
作者:Yueyuan Sui,Payal Mohapatra,Doğaç Eldenk,Haodong Yang,Yiting Zhang,Haoyan Zhang,Qi Zhu,Stephen Xia
摘要:边缘设备越来越多地运行多模式传感管道,这些管道必须保持准确,尽管功率预算波动和不可预测的传感器丢失。现有的修剪方法在这些条件下失败:它们通常需要在压缩后进行微调,消耗超过10\times $的部署能量,并且它们分配静态重要性分数,而这些分数对存在哪些传感器是盲目的。我们提出了SentryCulture框架,它通过两个关键组成部分共同应对这两个挑战。首先,SentryGate通过一阶显着性监督在训练期间学习模态条件重要性分数,然后在部署时修剪注意力头和前馈通道,而无需微调。其次,SentryAttend用稀疏分组查询注意力取代了密集的自我注意力,这是当代多模态架构中的一个关键瓶颈,在三种不同的多模态架构中,GFLOPs净减少了15%。在三个应用程序和多模式骨干中,SentryGate在最强的修剪基线上实现了12.7%的平均准确性提高,在模式退出条件下提高了18%。总之,SentryCloud将内存减少了28.2%,并将延迟降低了高达1.63倍,而无需进一步微调,从而将模态感知zero-shot压缩作为在异构边缘硬件上实现多模态智能的实用途径。
摘要:Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10\times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to $1.63\times$ without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.
【7】WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
标题:WOMBET:基于世界模型的经验转移,实现稳健且样本高效的强化学习
链接:https://arxiv.org/abs/2604.08958
作者:Mintae Kim,Koushil Sreenath
备注:13 pages, 6 figures, 8th Annual Learning for Dynamics & Control Conference (L4DC)
摘要:机器人技术中的强化学习(RL)通常受到数据收集的成本和风险的限制,激励从源任务到目标任务的经验转移。离线到在线RL利用先前的数据,但通常假设给定的固定数据集,并且不解决如何生成可靠的数据进行传输。我们提出了基于世界模型的经验转移(WOMBET),一个框架,共同生成和利用先验数据。WOMBET在源任务中学习世界模型,并通过不确定性惩罚规划生成离线数据,然后过滤具有高回报和低认知不确定性的轨迹。然后,它使用离线和在线数据之间的自适应采样在目标任务中执行在线微调,从而实现从先前驱动的初始化到特定于任务的自适应的稳定过渡。我们发现,不确定性惩罚的目标提供了一个下界的真实回报,并获得一个有限样本的误差分解捕获分布失配和近似误差。从经验上讲,WOMBET在连续控制基准的强基线上提高了样本效率和最终性能,证明了联合优化数据生成和传输的好处。
摘要:Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose \textit{World Model-based Experience Transfer} (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.
【8】Adaptive Candidate Point Thompson Sampling for High-Dimensional Bayesian Optimization
标题:自适应候选点Thompson抽样用于多维Bayesian优化
链接:https://arxiv.org/abs/2604.08891
作者:Donney Fan,Geoff Pleiss
备注:AISTATS 2026
摘要:在贝叶斯优化中,汤普森采样通过从目标函数最大化器上的后验分布中采样来选择评估点。由于该采样问题对于高斯过程(GP)代理是棘手的,因此后验分布通常限于固定的离散化(即,候选点),其随着维度增加而指数地稀疏。虽然以前的工作旨在通过可扩展的GP近似增加候选点密度,我们的正交方法通过自适应地减少采样过程中的搜索空间来增加密度。具体来说,我们引入了自适应候选汤普森采样(Adaptive Candidate Thompson Sampling,缩写为XMF),它在代理模型样本的梯度引导下在子空间中生成候选点。ESTO是现有TS方法的简单替代品-包括那些使用信任区域或其他局部近似的方法-在合成和真实世界的基准测试中产生更好的最大值样本和改进的优化。
摘要:In Bayesian optimization, Thompson sampling selects the evaluation point by sampling from the posterior distribution over the objective function maximizer. Because this sampling problem is intractable for Gaussian process (GP) surrogates, the posterior distribution is typically restricted to fixed discretizations (i.e., candidate points) that become exponentially sparse as dimensionality increases. While previous works aim to increase candidate point density through scalable GP approximations, our orthogonal approach increases density by adaptively reducing the search space during sampling. Specifically, we introduce Adaptive Candidate Thompson Sampling (ACTS), which generates candidate points in subspaces guided by the gradient of a surrogate model sample. ACTS is a simple drop-in replacement for existing TS methods -- including those that use trust regions or other local approximations -- producing better samples of maxima and improved optimization across synthetic and real-world benchmarks.
【9】Transferable FB-GNN-MBE Framework for Potential Energy Surfaces: Data-Adaptive Transfer Learning in Deep Learned Many-Body Expansion Theory
标题:势面的可转移FB-GNN-MBE框架:深度学习多体膨胀理论中的数据自适应转移学习
链接:https://arxiv.org/abs/2604.09320
作者:Siqi Chen,Zhiqiang Wang,Yili Shen,Xianqi Deng,Xi Cheng,Cheng-Wei Ju,Jun Yi,Guo Ling,Dieaa Alhmoud,Hui Guan,Zhou Lin
备注:Under review with The Journal of Chemical Physics. Main text: 23 pages, 11 figures, and 1 table. Supplementary Materials: 28 pages, 6 figures, 15 tables, 4 pseudo-algorithms
摘要:复杂化学系统的机理理解和合理设计依赖于对单个构建块之外的电子结构的快速准确预测。然而,如果系统超过数百个原子,第一性原理量子力学(QM)建模变得不切实际。在这项研究中,我们开发了FB-GNN-MBE通过将基于片段的图神经网络(FB-GNN)集成到多体展开(MBE)理论中,并证明了其能够以可管理的准确性,复杂性和可解释性为分层结构系统再现第一原理势能面(PES)。具体来说,我们将整个系统划分为基本构建块(片段),使用QM模型评估它们的单片段能量,并使用FB-GNN训练的结构-性质关系解决多片段相互作用。我们的研究表明,FB-GNN-MBE在预测水,苯酚和混合物基准的两体(2B)和三体(3B)能量以及水和苯酚二聚体的一维解离曲线方面实现了化学准确性。为了以最小的计算成本和数据需求在各种系统中转移FB-GNN-MBE的成功,我们开发并验证了师生学习协议。在混合密度水团簇系综上训练的重量级FB-GNN(教师)提取其学到的知识,并将其传递给轻量级GNN(学生),然后对均匀密度(H2O)进行微调21集群系综。这种迁移学习策略导致了对不同大小的水簇的2B和3B能量的有效和准确的预测,而无需再训练。我们的可转移FB-GNN-MBE框架优于传统的非FB-GNN为基础的模型,并显示出高实用性的大规模分子模拟。
摘要:Mechanistic understanding and rational design of complex chemical systems depend on fast and accurate predictions of electronic structures beyond individual building blocks. However, if the system exceeds hundreds of atoms, first-principles quantum mechanical (QM) modeling becomes impractical. In this study, we developed FB-GNN-MBE by integrating a fragment-based graph neural network (FB-GNN) into the many-body expansion (MBE) theory and demonstrated its capacity to reproduce first-principles potential energy surfaces (PES) for hierarchically structured systems with manageable accuracy, complexity, and interpretability. Specifically, we divided the entire system into basic building blocks (fragments), evaluated their one-fragment energies using a QM model, and addressed many-fragment interactions using the structure-property relationships trained by FB-GNNs. Our investigation shows that FB-GNN-MBE achieves chemical accuracy in predicting two-body (2B) and three-body (3B) energies across water, phenol, and mixture benchmarks, as well as the one-dimensional dissociation curves of water and phenol dimers. To transfer the success of FB-GNN-MBE across various systems with minimal computational costs and data demands, we developed and validated a teacher-student learning protocol. A heavy-weight FB-GNN trained on a mixed-density water cluster ensemble (teacher) distills its learned knowledge and passes it to a light-weight GNN (student), which is later fine-tuned on a uniform-density (H2O)21 cluster ensemble. This transfer learning strategy resulted in efficient and accurate prediction of 2B and 3B energies for variously sized water clusters without retraining. Our transferable FB-GNN-MBE framework outperformed conventional non-FB-GNN-based models and showed high practicality for large-scale molecular simulations.
强化学习(8篇)
【1】SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
标题:SafeAdapt:深度强化学习中可证明安全的策略更新
链接:https://arxiv.org/abs/2604.09452
作者:Maksim Anisimov,Francesco Belardinelli,Matthew Wicker
备注:Code available at: https://github.com/maxanisimov/provably-safe-policy-updates
摘要:安全保证是在安全关键任务中部署强化学习(RL)代理的先决条件。通常,部署环境表现出非静态动态或受到不断变化的性能目标的影响,需要更新学习的策略。这导致了一个根本性的挑战:如何更新RL策略,同时在以前遇到的任务上保持其安全属性?目前的大多数方法要么不提供正式的保证,要么只在事后验证政策的安全性。我们提出了一种新的先验方法,通过引入罗生门集(Rashomon set)来实现持续强化学习中的安全策略更新:在演示数据分布中,策略参数空间中的一个区域被认证为满足安全约束。然后,我们表明,可以提供正式的,可证明的保证任意RL算法用于更新的政策,通过投影到罗生门集的更新。从经验上讲,我们验证这种方法在网格世界导航环境(冰冻湖和毒苹果),我们保证一个先验的可证明的确定性安全的源任务在下游适应。相比之下,我们观察到,基于监管的基线经历灾难性的忘记安全约束,而我们的方法使强大的适应与可证明的保证,安全性得到保护。
摘要:Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.
【2】On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach
标题:关于DAQ布局在能源感知云调度中的作用:一种基于GNN的深度强化学习方法
链接:https://arxiv.org/abs/2604.09202
作者:Anas Hattay,Fred Ngole Mboula,Eric Gascard,Zakaria Yahoun
摘要:云提供商必须将异构计算资源分配给工作流DAG,同时平衡完成时间、成本和能耗等竞争目标。在这项工作中,我们研究了一个单一的工作流,无负担的调度设置,并考虑了一个基于图神经网络(GNN)的深度强化学习调度器,旨在最大限度地减少工作流完成时间和能源使用。我们确定了基于GNN的深度强化学习算法失败的特定分布外(OOD)条件,并对这些失败的原因提供了原则性的解释。通过受控的OOD评估,我们证明了性能下降源于训练和部署环境之间的结构不匹配,这会破坏消息传递并破坏策略泛化。我们的分析揭示了当前基于GNN的调度器的基本局限性,并强调了需要更强大的表示,以确保可靠的调度性能下的分布变化。
摘要:Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.
【3】Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
标题:用于一步抽样强化学习的截短纠正流策略
链接:https://arxiv.org/abs/2604.09159
作者:Xubin Zhou,Yipeng Yang,Zhan Li
摘要
:最大熵强化学习(MaxEnt RL)已成为顺序决策的标准框架,但其标准高斯策略参数化本质上是单峰的,限制了其对复杂多模态动作分布建模的能力。这种限制促使越来越多的兴趣,在生成政策的基础上扩散和流匹配更有表现力的替代品。然而,将这些策略纳入MaxEnt RL具有挑战性,主要原因有两个:连续时间生成策略的似然性和熵通常是棘手的,多步采样引入了长时间反向传播不稳定性和大量的推理延迟。为了解决这些挑战,我们提出了截断整流政策(TRFP),一个框架建立在一个混合确定性随机架构。这种设计使得熵正则化优化易于处理,同时通过梯度截断和流矫直支持稳定的训练和有效的一步采样。在玩具多目标环境和10个MuJoCo基准测试上的实证结果表明,TRFP有效地捕捉了多模态行为,在标准采样下的大多数基准测试上优于强基线,并且在一步采样下仍然具有很强的竞争力。
摘要:Maximum entropy reinforcement learning (MaxEnt RL) has become a standard framework for sequential decision making, yet its standard Gaussian policy parameterization is inherently unimodal, limiting its ability to model complex multimodal action distributions. This limitation has motivated increasing interest in generative policies based on diffusion and flow matching as more expressive alternatives. However, incorporating such policies into MaxEnt RL is challenging for two main reasons: the likelihood and entropy of continuous-time generative policies are generally intractable, and multi-step sampling introduces both long-horizon backpropagation instability and substantial inference latency. To address these challenges, we propose Truncated Rectified Flow Policy (TRFP), a framework built on a hybrid deterministic-stochastic architecture. This design makes entropy-regularized optimization tractable while supporting stable training and effective one-step sampling through gradient truncation and flow straightening. Empirical results on a toy multigoal environment and 10 MuJoCo benchmarks show that TRFP captures multimodal behavior effectively, outperforms strong baselines on most benchmarks under standard sampling, and remains highly competitive under one-step sampling.
【4】Advantage-Guided Diffusion for Model-Based Reinforcement Learning
标题:基于模型的强化学习的概率引导扩散
链接:https://arxiv.org/abs/2604.09035
作者:Daniele Foffano,Arvid Eriksson,David Broman,Karl H. Johansson,Alexandre Proutiere
摘要:基于模型的强化学习(MBRL)与自回归世界模型遭受复合误差,而扩散世界模型通过联合生成轨迹段来减轻这一点。然而,现有的扩散指南要么是政策,丢弃价值信息,或奖励为基础的,这成为短视时,扩散地平线很短。我们引入了MBRL(AGD-MBRL),它使用代理的优势估计来引导反向扩散过程,以便采样集中在预期产生更高长期回报的轨迹上。我们开发了两个指南:(i)S形优势指南(SAG)和(ii)指数优势指南(EAG)。我们证明,通过SAG或EAG指导的扩散模型允许我们对轨迹进行重新加权采样,其中权重随着国家行动优势的增加而增加,这意味着在标准假设下的政策改进。此外,我们表明,从AGD-MBRL产生的轨迹遵循改进的政策(即,具有更高的价值)相比,一个无指导的扩散模型。AGD与PolyGRAD风格的架构无缝集成,通过引导状态组件,同时保持动作生成策略的条件,并且不需要改变扩散训练目标。在MuJoCo控制任务(HalfCheetah、Hopper、Walker 2D和Reacher)上,AGD-MBRL比PolyGRAD、在线扩散器式奖励指南和无模型基线(PPO/TRPO)提高了样本效率和最终回报,在某些情况下提高了2倍。这些结果表明,在扩散模型MBRL中,视觉感知引导是一种简单有效的短视野近视矫正方法。
摘要:Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.
【5】Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
标题:用于离线目标条件强化学习的高效分层隐流Q学习
链接:https://arxiv.org/abs/2604.08960
作者:Zhiqiang Dong,Teng Pang,Rongjian Xu,Guoqiang Wu
摘要:离线目标条件强化学习(Offline goal-conditioned reinforcement learning,GCRL)是一种实用的强化学习范式,旨在从无奖励的离线数据中学习目标条件策略。尽管最近在分层架构(如HIQL)方面取得了进展,但由于高斯策略的表达能力有限以及高级策略无法生成有效的子目标,离线GCRL中的长期控制仍然具有挑战性。为了解决这些局限性,我们提出了目标条件的平均流政策,它引入了一个平均速度场的分层离线GCRL的政策建模。具体而言,平均流策略通过学习的平均速度场捕获高级和低级策略的复杂目标分布,从而通过一步采样实现高效的动作生成。此外,考虑到目标表示的不足,我们引入了LeJEPA损失,在训练过程中排斥目标表示嵌入,从而鼓励更多的区分表示并提高泛化能力。实验结果表明,我们的方法在OGBench基准测试中实现了基于状态和基于像素的任务的强大性能。
摘要:Offline goal-conditioned reinforcement learning (GCRL) is a practical reinforcement learning paradigm that aims to learn goal-conditioned policies from reward-free offline data. Despite recent advances in hierarchical architectures such as HIQL, long-horizon control in offline GCRL remains challenging due to the limited expressiveness of Gaussian policies and the inability of high-level policies to generate effective subgoals. To address these limitations, we propose the goal-conditioned mean flow policy, which introduces an average velocity field into hierarchical policy modeling for offline GCRL. Specifically, the mean flow policy captures complex target distributions for both high-level and low-level policies through a learned average velocity field, enabling efficient action generation via one-step sampling. Furthermore, considering the insufficiency of goal representation, we introduce a LeJEPA loss that repels goal representation embeddings during training, thereby encouraging more discriminative representations and improving generalization. Experimental results show that our method achieves strong performance across both state-based and pixel-based tasks in the OGBench benchmark.
【6】Alleviating Community Fear in Disasters via Multi-Agent Actor-Critic Reinforcement Learning
标题:通过多Agent Actor-Critic强化学习减轻灾难中的社区恐惧
链接:https://arxiv.org/abs/2604.08802
作者:Yashodhan D. Hakke,Almuatazbellah M. Boker,Lamine Mili,Michael von Spakovsky,Hoda Eldardiry
备注:10 pages, 6 figures
摘要:在灾害期间,电网、通信网络和社会行为的连锁故障放大了社区的恐惧,破坏了合作。现有的网络物理社会(CPS)模型模拟这些耦合的动态,但缺乏积极干预的机制。我们扩展了Valinejad和Mili(2023)的CPS弹性模型,为三个机构(通信、电力和应急管理)提供控制渠道,并将生成的系统制定为通过在线行动者解决的三人非零和差分游戏-评论家强化学习。基于飓风哈维数据的模拟显示,随着基础设施恢复的改善,平均恐惧减少了70%;在飓风伊尔玛的情况下(没有改装),交叉验证实现了50%的恐惧减少,确认了普遍性。
摘要:During disasters, cascading failures across power grids, communication networks, and social behavior amplify community fear and undermine cooperation. Existing cyber-physical-social (CPS) models simulate these coupled dynamics but lack mechanisms for active intervention. We extend the CPS resilience model of Valinejad and Mili (2023) with control channels for three agencies, communication, power, and emergency management, and formulate the resulting system as a three-player non-zero-sum differential game solved via online actor-critic reinforcement learning. Simulations based on Hurricane Harvey data show 70% mean fear reduction with improved infrastructure recovery; cross-validation in the case of Hurricane Irma (without refitting) achieves 50% fear reduction, confirming generalizability.
【7】Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning
标题:无线通信增强的多Agent强化学习值分解
链接:https://arxiv.org/abs/2604.08728
作者:Diyi Hu,Bhaskar Krishnamachari
摘要:多智能体强化学习(MARL)中的合作受益于智能体间的通信,但大多数方法都假设理想化的通道,现有的价值分解方法忽略了谁成功地与谁共享信息。我们提出了CLOVER,一个合作的MARL框架,其集中值混频器的条件下实现的通信图在一个现实的无线信道。该图将关系归纳偏差引入到价值分解中,限制了基于实现的通信结构的各个实用程序的混合方式。混合器是由置换等变超网络生成的具有节点特定权重的GNN:沿着通信边缘的多跳传播重塑信用分配,使得不同的拓扑结构引起不同的混合。我们证明了这个混合器是置换不变的,单调的(保持IGM条件),严格比QMIX风格的混合器更有表现力。为了处理现实的渠道,我们制定了一个增强的MDP隔离随机通道效应从代理计算图,并采用随机感受野编码器可变大小的消息集,使端到端的可区分的训练。在p-CSMA无线信道下的Predator-Prey和Lumberjacks基准测试中,CLOVER始终提高了VDN、QMIX、TarMAC+VDN和TarMAC+QMIX的收敛速度和最终性能。行为分析证实,代理学习自适应信号和倾听策略,消融隔离的通信图归纳偏差的关键来源的改进。
摘要:Cooperation in multi-agent reinforcement learning (MARL) benefits from inter-agent communication, yet most approaches assume idealized channels and existing value decomposition methods ignore who successfully shared information with whom. We propose CLOVER, a cooperative MARL framework whose centralized value mixer is conditioned on the communication graph realized under a realistic wireless channel. This graph introduces a relational inductive bias into value decomposition, constraining how individual utilities are mixed based on the realized communication structure. The mixer is a GNN with node-specific weights generated by a Permutation-Equivariant Hypernetwork: multi-hop propagation along communication edges reshapes credit assignment so that different topologies induce different mixing. We prove this mixer is permutation invariant, monotonic (preserving the IGM condition), and strictly more expressive than QMIX-style mixers. To handle realistic channels, we formulate an augmented MDP isolating stochastic channel effects from the agent computation graph, and employ a stochastic receptive field encoder for variable-size message sets, enabling end-to-end differentiable training. On Predator-Prey and Lumberjacks benchmarks under p-CSMA wireless channels, CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX. Behavioral analysis confirms agents learn adaptive signaling and listening strategies, and ablations isolate the communication-graph inductive bias as the key source of improvement.
【8】StructRL: Recovering Dynamic Programming Structure from Learning Dynamics in Distributional Reinforcement Learning
标题:StructRL:从分布式强化学习中的学习动态恢复动态编程结构
链接:https://arxiv.org/abs/2604.08620
作者:Ivo Nowak
摘要:强化学习通常被视为一个统一的、数据驱动的优化过程,其中更新由奖励和时间差误差指导,而不显式地利用全局结构。相比之下,动态编程方法依赖于结构化的信息传播,从而实现高效和稳定的学习。在本文中,我们提供的证据表明,这种结构可以从分布式强化学习的学习动态中恢复。通过分析收益率分布的时间演化,我们识别出捕获学习在状态空间中何时何地发生的信号。特别是,我们引入了一个时间学习指标t*(s),它反映了一个状态在训练过程中何时经历了最强的学习更新。从经验上讲,这个信号会导致一种与动态编程式的信息传播相一致的状态排序。基于这一观察,我们提出了StructRL,这是一个利用这些信号来引导采样与新兴传播结构保持一致的框架。我们的初步研究结果表明,分布式学习动态提供了一种机制,恢复和利用动态编程的结构,而不需要一个明确的模型。这为强化学习提供了一个新的视角,学习可以被解释为一个结构化的传播过程,而不是一个纯粹的统一优化过程。
摘要:Reinforcement learning is typically treated as a uniform, data-driven optimization process, where updates are guided by rewards and temporal-difference errors without explicitly exploiting global structure. In contrast, dynamic programming methods rely on structured information propagation, enabling efficient and stable learning. In this paper, we provide evidence that such structure can be recovered from the learning dynamics of distributional reinforcement learning. By analyzing the temporal evolution of return distributions, we identify signals that capture when and where learning occurs in the state space. In particular, we introduce a temporal learning indicator t*(s) that reflects when a state undergoes its strongest learning update during training. Empirically, this signal induces an ordering over states that is consistent with a dynamic programming-style propagation of information. Building on this observation, we propose StructRL, a framework that exploits these signals to guide sampling in alignment with the emerging propagation structure. Our preliminary results suggest that distributional learning dynamics provide a mechanism to recover and exploit dynamic programming-like structure without requiring an explicit model. This offers a new perspective on reinforcement learning, where learning can be interpreted as a structured propagation process rather than a purely uniform optimization procedure.
蒸馏|知识提取(3篇)
【1】Automated Batch Distillation Process Simulation for a Large Hybrid Dataset for Deep Anomaly Detection
标题:用于深度异常检测的大型混合数据集的自动批量蒸馏过程模拟
链接:https://arxiv.org/abs/2604.09166
作者:Jennifer Werner,Justus Arweiler,Indra Jungjohann,Jochen Schmid,Fabian Jirasek,Hans Hasse,Michael Bortz
摘要:基于深度学习的化学过程中的异常检测(AD)提供了重要的机会,但需要大量、多样化和注释良好的训练数据集,而这些数据集很少能从工业操作中获得。在最近的一项工作中,我们介绍了一个大型的,充分注释的实验数据集在正常和异常操作条件下的间歇蒸馏。在本研究中,我们用相应的模拟数据集来增强这个数据集,创建一个新的混合数据集。模拟数据是在一个自动化的工作流程中生成的,该工作流程使用了一个基于Python的过程模拟器,该过程模拟器采用了针对底层微分代数方程的定制索引缩减策略。利用实验数据库丰富的元数据和结构化的异常注释,实验记录自动转换为模拟场景。在对单个参考实验进行校准之后,可以很好地预测其他实验的动力学。这使得能够为大量的实验运行完全自动地、一致地生成时间序列数据,包括正常操作和各种致动器和控制相关的异常。由此产生的混合数据集公开发布。从过程模拟的角度来看,这项工作演示了自动化,一致的模拟大规模的实验活动,使用间歇蒸馏作为一个例子。从数据驱动的AD角度来看,混合数据集为模拟到实验风格的转换,伪实验数据的生成以及未来对化学过程监控中的深度AD方法的研究提供了独特的基础。
摘要:Anomaly detection (AD) in chemical processes based on deep learning offers significant opportunities but requires large, diverse, and well-annotated training datasets that are rarely available from industrial operations. In a recent work, we introduced a large, fully annotated experimental dataset for batch distillation under normal and anomalous operating conditions. In the present study, we augment this dataset with a corresponding simulation dataset, creating a novel hybrid dataset. The simulation data is generated in an automated workflow with a novel Python-based process simulator that employs a tailored index-reduction strategy for the underlying differential-algebraic equations. Leveraging the rich metadata and structured anomaly annotations of the experimental database, experimental records are automatically translated into simulation scenarios. After calibration to a single reference experiment, the dynamics of the other experiments are well predicted. This enabled the fully automated, consistent generation of time-series data for a large number of experimental runs, covering both normal operation and a wide range of actuator- and control-related anomalies. The resulting hybrid dataset is released openly. From a process simulation perspective, this work demonstrates the automated, consistent simulation of large-scale experimental campaigns, using batch distillation as an example. From a data-driven AD perspective, the hybrid dataset provides a unique basis for simulation-to-experiment style transfer, the generation of pseudo-experimental data, and future research on deep AD methods in chemical process monitoring.
【2】Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective
标题:从实践角度重新审视思想链蒸馏的能力差距
链接:https://arxiv.org/abs/2604.08880
作者:Tokio Kajitsuka,Ukyo Honda,Sho Takase
备注:19 pages, 6 figures
摘要:思想链(CoT)蒸馏转移推理行为从一个强大的教师到一个较小的学生,但以前的工作报告的能力差距:蒸馏可能会失败时,教师和学生的能力不匹配很大。我们重新审视的能力差距,从实用的角度来看,重新检查常用的实验设置。值得注意的是,我们发现,CoT蒸馏往往会降低性能相比,学生的蒸馏前的基线,一个问题掩盖时,只有蒸馏后的比较报告。因此,我们提出了一个更现实的评估协议,并发现能力差距效应的影响并不总是占主导地位的任务和设置,特别是当候选教师的表现差异很大。我们的研究结果为CoT蒸馏中师生对的选择提供了实际指导。
摘要:Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student's pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.
【3】Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching
标题:通过嵌入匹配提取基因组模型以进行高效mRNA表示学习
链接:https://arxiv.org/abs/2604.08574
作者:Rasched Haidari,Sam Martin,Maxime Allard
备注:Accepted at the Tiny Papers Track for the Machine Learning for Genomics Explorations Workshop at ICLR 2026 an the Gen2 Workshop at ICLR 2026
摘要:大基因组基础模型最近取得了显着的成果和体内翻译能力。然而,这些模型很快就会增长到超过几十亿个参数,并且在计算有限的情况下运行成本很高。为了克服这一挑战,我们提出了一个蒸馏框架,用于将mRNA表示从最先进的基因组基础模型转移到专门用于mRNA序列的更小的模型中,将大小减少200倍。嵌入级蒸馏比基于logit的方法更好,我们发现后者不稳定。在mRNA工作台上的基准测试表明,蒸馏模型在具有可比大小的模型中实现了最先进的性能,并与更大的架构竞争mRNA相关的任务。我们的研究结果强调了基于嵌入的mRNA序列蒸馏作为生物基础模型的有效训练策略。这使得类似的高效和可扩展的序列建模在基因组学中,特别是当大型模型在计算上具有挑战性或不可行时。
摘要:Large Genomic Foundation Models have recently achieved remarkable results and in-vivo translation capabilities. However these models quickly grow to over a few Billion of parameters and are expensive to run when compute is limited. To overcome this challenge, we present a distillation framework for transferring mRNA representations from a state of the art genomic foundation model into a much smaller model specialized for mRNA sequences, reducing the size by 200-fold. Embedding-level distillation worked better than logit based methods, which we found unstable. Benchmarking on mRNA-bench demonstrates that the distilled model achieves state-of-the-art performance among models of comparable size and competes with larger architectures for mRNA-related tasks. Our results highlight embedding-based distillation of mRNA sequences as an effective training strategy for biological foundation models. This enables similar efficient and scalable sequence modelling in genomics, particularly when large models are computationally challenging or infeasible.
推荐(1篇)
【1】Creator Incentives in Recommender Systems: A Cooperative Game-Theoretic Approach for Stable and Fair Collaboration in Multi-Agent Bandits
标题:推荐系统中的创作者激励:多智能体盗贼稳定和公平合作的合作游戏理论方法
链接:https://arxiv.org/abs/2604.08643
作者:Ramakrishnan Krishnamurthy,Arpit Agarwal,Lakshminarayanan Subramanian,Maximilian Nickel
备注:Accepted in AISTATS 2026 as an Oral Presentation
摘要:在线推荐平台中的用户交互在内容创建者之间产生了相互依赖性:对一个创建者内容的反馈影响系统的学习,进而影响其他创建者内容的曝光。为了分析在这种情况下的激励措施,我们将协作建模为多代理随机线性强盗问题,具有可转移效用(TU)合作博弈公式,其中联盟的价值等于其成员累积遗憾的负和。 我们发现,对于相同的(同质)代理与固定的行动集,诱导TU游戏是凸的温和的算法条件下,这意味着一个非空的核心,包含Shapley值,并确保稳定性和公平性。对于异质代理,游戏仍然承认一个非空的核心,虽然凸性和Shapley值核心成员资格不再得到保证。为了解决这个问题,我们提出了一个简单的基于后悔的支付规则,满足四个Shapley公理中的三个,也是核心。MovieLens-100 k数据集上的实验说明了经验支出何时与不同设置和算法的Shapley公平性一致并偏离。
摘要:User interactions in online recommendation platforms create interdependencies among content creators: feedback on one creator's content influences the system's learning and, in turn, the exposure of other creators' contents. To analyze incentives in such settings, we model collaboration as a multi-agent stochastic linear bandit problem with a transferable utility (TU) cooperative game formulation, where a coalition's value equals the negative sum of its members' cumulative regrets. We show that, for identical (homogenous) agents with fixed action sets, the induced TU game is convex under mild algorithmic conditions, implying a non-empty core that contains the Shapley value and ensures both stability and fairness. For heterogeneous agents, the game still admits a non-empty core, though convexity and Shapley value core-membership are no longer guaranteed. To address this, we propose a simple regret-based payout rule that satisfies three out of the four Shapley axioms and also lies in the core. Experiments on MovieLens-100k dataset illustrate when the empirical payout aligns with -- and diverges from -- the Shapley fairness across different settings and algorithms.
自动驾驶|车辆|车道检测等(1篇)
【1】The causal relation between off-street parking and electric vehicle adoption in Scotland
标题:苏格兰街边停车与电动汽车采用之间的因果关系
链接:https://arxiv.org/abs/2604.09271
作者:Bernardino D'Amico,Achille Fonzone,Emma Hart
摘要:向电动汽车的过渡取决于最大限度地提高总采用率,同时促进公平获取。本研究探讨了有和没有路边停车的家庭之间的“收费鸿沟”是否反映了真正的基础设施限制或社会经济差距的副产品。超越传统的预测模型,我们应用概率因果框架的苏格兰家庭的全国代表性的数据集,使政策干预的估计,同时明确中和其他因果因素的混杂效应。研究结果揭示了电动汽车采用过程中的结构层次。私人路外停车场是转换催化剂:实现家庭充电将电动汽车拥有率从3.3%提高到5.6%(相对增长70%,绝对增长2.3个百分点)。然而,这种影响主要是加速已经在经济上有能力购买电动汽车的家庭,而不是招募新的进入者。相比之下,家庭收入是基本的负担能力上限。低收入和高收入阶层之间的因果对比显示,市场不参与率下降了23.1个百分点,将财务能力确定为进入电动汽车过渡漏斗的主要守门人。至关重要的是,分析表明,标准的观测模型夸大了街道外停车基础设施的孤立影响。这种明显的影响来自选择偏差:高收入家庭拥有私人停车场和购买电动汽车的能力的可能性不成比例。这些发现支持双轨政策战略:通过金融工具降低非参与者的负担能力上限,同时解决高密度城市环境中“潜在意图”人群的电动汽车家庭充电接入问题。
摘要
:The transition to electric mobility hinges on maximising aggregate adoption while also facilitating equitable access. This study examines whether the 'charging divide' between households with and without off-street parking reflects a genuine infrastructure constraint or a by-product of socio-economic disparity. Moving beyond conventional predictive models, we apply a probabilistic causal framework to a nationally representative dataset of Scottish households, enabling estimation of policy interventions while explicitly neutralising the confounding effect of other causal factors. The results reveal a structural hierarchy in the EV adoption process. Private off-street parking functions as a conversion catalyst: enabling access to home-charging increases the probability of EV ownership from 3.3% to 5.6% (a 70% relative, 2.3 percentage point absolute increase). However, this effect primarily accelerates households already economically positioned to purchase an EV rather than recruiting new entrants. By contrast, household income operates as the fundamental affordability ceiling. A causal contrast between lower- and higher-income strata, shows a reduction in market non-participation by 23.1 percentage points, identifying financial capacity as the principal gatekeeper to entering the EV transition funnel. Crucially, the analysis demonstrates that standard observational models overstate the isolated effect of off-street parking infrastructure. The apparent effect emerges from selection bias: higher-income households are disproportionately likely to possess both private parking and the means to purchase EVs. These findings support a dual-track policy strategy: lowering the affordability ceiling for non-participants through financial instruments, while addressing EV home-charging access for the 'latent intent' cohort in high-density urban contexts.
推理|分析|理解|解释(10篇)
【1】PhysInOne: Visual Physics Learning and Reasoning in One Suite
标题:PhysInOne:视觉物理学习和推理一体化
链接:https://arxiv.org/abs/2604.09415
作者:Siyuan Zhou,Hejun Wang,Hu Cheng,Jinxi Li,Dongsheng Wang,Junwei Jiang,Yixiao Jin,Jiayue Huang,Shiwei Mao,Shangjia Liu,Yafei Yang,Hongkang Song,Shenxing Wei,Zihui Zhang,Peng Huang,Shijie Liu,Zhengli Hao,Hao Li,Yitian Li,Wenqi Zhou,Zhihan Zhao,Zongqi He,Hongtao Wen,Shouwang Huang,Peng Yun,Bowen Cheng,Pok Kazaf Fu,Wai Kit Lai,Jiahao Chen,Kaiyuan Wang,Zhixuan Sun,Ziqi Li,Haochen Hu,Di Zhang,Chun Ho Yuen,Bing Wang,Zhihua Wang,Chuhang Zou,Bo Yang
备注:CVPR 2026. Siyuan, Hejun, Hu, Jinxi, Dongsheng, Junwei, Yixiao, Jiayue, and Shiwei are co-first authors. Project page: https://vlar-group.github.io/PhysInOne.html
摘要:我们提出了PhysInOne,这是一个大规模的合成数据集,解决了人工智能系统物理基础训练数据的严重短缺问题。与仅限于数百或数千个示例的现有数据集不同,PhysInOne提供了153,810个动态3D场景的200万个视频,涵盖了力学,光学,流体动力学和磁学中的71种基本物理现象。与以前的作品不同,我们的场景具有复杂背景下的多对象交互,具有全面的地面实况注释,包括3D几何,语义,动态运动,物理属性和文本描述。我们展示了PhysInOne在四个新兴应用中的功效:物理感知视频生成,长期/短期未来帧预测,物理属性估计和运动传输。实验表明,对PhysInOne上的基础模型进行微调可以显着提高物理可扩展性,同时也暴露了在模拟复杂物理动力学和估计固有特性方面的关键差距。作为同类产品中最大的数据集,超越了先前的作品数量级,PhysInOne为推进生成,模拟和嵌入式AI中的物理基础世界模型建立了新的基准。
摘要:We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.
【2】FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval
标题:FIRE-CIR:合成时尚图像检索的细粒度推理
链接:https://arxiv.org/abs/2604.09114
作者:François Gardères,Camille-Sovanneary Gauthier,Jean Ponce,Shizhe Chen
摘要:合成图像检索(CIR)的目的是检索一个目标图像,描述了一个参考图像修改的文本描述。虽然最近的视觉语言模型(VLM)通过将图像和文本嵌入到共享空间中进行检索来实现有希望的CIR性能,但它们通常无法推理保留什么和更改什么。这种限制阻碍了可解释性,并产生次优的结果,特别是在细粒度的领域,如时尚。在本文中,我们介绍了FIRE-CIR,一个模型,使组成推理和解释时尚CIR。FIRE-CIR不是仅仅依赖于嵌入相似性,而是执行问题驱动的视觉推理:它自动生成来自修改文本的以属性为中心的视觉问题,并在参考和候选图像中验证相应的视觉证据。为了训练这样一个推理系统,我们自动构建了一个大规模的时尚特定的视觉问答数据集,包含的问题需要单或双图像分析。在检索过程中,我们的模型利用这种显式推理来重新排列候选结果,过滤掉与预期修改不一致的图像。在Fashion IQ基准测试上的实验结果表明,FIRE-CIR在检索准确率方面优于最先进的方法。它还为检索决策提供了可解释的属性级见解。
摘要:Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.
【3】NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System
标题:NyayaMind-印度法律体系中透明法律推理和判断预测的框架
链接:https://arxiv.org/abs/2604.09069
作者:Parjanya Aditya Shukla,Shubham Kumar Nigam,Debtanu Datta,Balaramamahanthi Deepak Patnaik,Noel Shallum,Pradeep Reddy Vanga,Saptarshi Ghosh,Arnab Bhattacharya
摘要:法院判决预测和解释(CJPE)旨在预测司法判决,并根据事实,法律问题,论点,引用的法规和相关判例为给定案件提供有法律依据的解释。为了使这些系统在司法或法律研究环境中切实有用,它们不仅必须实现高预测性能,而且还必须生成与既定司法实践相一致的透明和结构化的法律推理。在这项工作中,我们介绍了NyayaMind,一个开源框架,旨在为印度司法部门提供透明和可扩展的法律推理。该框架集成了检索,推理和验证机制,以模拟结构化的决策过程中通常遵循的法院。具体来说,NyayaMind由两个主要组件组成:检索模块和预测模块。检索模块采用RAG管道从大型法律语料库中识别法律相关的法规和判例,而预测模块则利用针对印度法律领域进行微调的面向推理的LLM来生成结构化输出,包括问题,论点,理由和最终决定。我们广泛的结果和专家评估表明,与现有的CJPE方法相比,NyayaMind显著提高了解释和证据对齐的质量,为可信的人工智能辅助法律决策支持系统迈出了有希望的一步。
摘要
:Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.
【4】Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning
标题:桥接SFT和RL:鲁棒推理的动态策略优化
链接:https://arxiv.org/abs/2604.08926
作者:Taojie Zhu,Dongyang Xu,Ding Zou,Sen Zhao,Qiaobo Hao,Zhiguo Yang,Yonghong He
备注:ACL 2026 findings
摘要:大型语言模型(LLM)的后训练范式,主要是监督微调(SFT)和强化学习(RL),面临着一个根本的困境:SFT提供稳定性(低方差),但遭受高拟合偏差,而RL使探索(低偏差),但与高梯度方差作斗争。现有的统一优化策略通常采用朴素的损失加权,忽略了这些不同的梯度信号之间的统计冲突。在本文中,我们提供了一个严格的理论分析,这种偏差方差权衡,并提出\textbf{DYPO}(动态策略优化),一个统一的框架,旨在从结构上减轻这种冲突。DYPO集成了三个核心组件:(1)一个\textit{组对齐损失(GAL)},利用内在的组动态显着减少RL梯度方差;(2)一个\textit{多教师蒸馏}机制,通过不同的推理路径纠正SFT拟合偏差;(3)动态开发-勘探选通基于奖励反馈的稳定SFT和探索性RL之间的自适应仲裁机制。理论分析证实,DYPO线性降低拟合偏差和最小化总方差。大量的实验表明,DYPO的性能明显优于传统的顺序流水线,在复杂推理基准测试上平均提高了4.8%,在分布外任务上平均提高了13.3%.我们的代码可在https://github.com/Tocci-Zhu/DYPO上公开获取。
摘要:Post-training paradigms for Large Language Models (LLMs), primarily Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), face a fundamental dilemma: SFT provides stability (low variance) but suffers from high fitting bias, while RL enables exploration (low bias) but grapples with high gradient variance. Existing unified optimization strategies often employ naive loss weighting, overlooking the statistical conflict between these distinct gradient signals. In this paper, we provide a rigorous theoretical analysis of this bias-variance trade-off and propose \textbf{DYPO} (Dynamic Policy Optimization), a unified framework designed to structurally mitigate this conflict. DYPO integrates three core components: (1) a \textit{Group Alignment Loss (GAL)} that leverages intrinsic group dynamics to significantly reduce RL gradient variance; (2) a \textit{Multi-Teacher Distillation} mechanism that corrects SFT fitting bias via diverse reasoning paths; and (3) a \textit{Dynamic Exploitation-Exploration Gating} mechanism that adaptively arbitrates between stable SFT and exploratory RL based on reward feedback. Theoretical analysis confirms that DYPO linearly reduces fitting bias and minimizes overall variance. Extensive experiments demonstrate that DYPO significantly outperforms traditional sequential pipelines, achieving an average improvement of 4.8\% on complex reasoning benchmarks and 13.3\% on out-of-distribution tasks. Our code is publicly available at https://github.com/Tocci-Zhu/DYPO.
【5】Finite-Sample Analysis of Nonlinear Independent Component Analysis:Sample Complexity and Identifiability Bounds
标题:非线性独立成分分析的样本分析:样本复杂性和可识别性界
链接:https://arxiv.org/abs/2604.08850
作者:Yuwen Jiang
摘要:独立分量分析(ICA)是一种基本的无监督学习技术,通过将混合信号分离成独立的源信号来揭示数据中的潜在结构。虽然在建立非线性ICA的渐近可识别性保证方面取得了实质性进展,但学习算法的有限样本统计特性仍然知之甚少。这一差距给从业人员带来了重大挑战,他们必须确定适当的样本量,以进行可靠的来源恢复。本文提出了一个全面的有限样本分析的非线性ICA与神经网络编码器,提供了第一个完整的特征匹配的上限和下限。我们的理论发展介绍了三个关键的技术贡献。首先,我们建立了一个直接的过度风险和识别错误之间的关系,绕过参数空间参数,从而避免率退化,否则会产生次优缩放。其次,我们证明了匹配信息理论的下界,确认我们的样本复杂性结果的最优性。第三,我们将我们的分析扩展到实际的SGD优化,表明在标准景观假设下,有限迭代梯度下降可以实现相同的样本效率。我们通过精心设计的模拟实验验证我们的理论预测。这一差距指向了未来对神经网络训练的有限样本行为的有价值的研究,并强调了我们验证的尺度律对维度和多样性的重要性。
摘要:Independent Component Analysis (ICA) is a fundamental unsupervised learning technique foruncovering latent structure in data by separating mixed signals into their independent sources. While substantial progress has been made in establishing asymptotic identifiability guarantees for nonlinear ICA, the finite-sample statistical properties of learning algorithms remain poorly understood. This gap poses significant challenges for practitioners who must determine appropriate sample sizes for reliable source recovery. This paper presents a comprehensive finite-sample analysis of nonlinear ICA with neural network encoders, providing the first complete characterization with matching upper and lower bounds. Our theoretical development introduces three key technical contributions. First, we establish a direct relationship between excess risk and identification error that bypasses parameter-space arguments, thereby avoiding the rate degradation that would otherwise yield suboptimal scaling. Second, we prove matching information-theoretic lower bounds that confirm the optimality of our sample complexity results. Third, we extend our analysis to practical SGD optimization, showing that the same sample efficiency can be achieved with finite-iteration gradient descent under standard landscape assumptions. We validate our theoretical predictions through carefully designed simulation experiments. This gap points toward valuable future research on finite-sample behavior of neural network training and highlights the importance of our validated scaling laws for dimension and diversity.
【6】RansomTrack: A Hybrid Behavioral Analysis Framework for Ransomware Detection
标题:RansomTrack:用于勒索软件检测的混合行为分析框架
链接:https://arxiv.org/abs/2604.08739
作者:Busra Caliskan,Ibrahim Gulatas,H. Hakan Kilinc,A. Halim Zaim
备注:20 pages, 7 figures
摘要:勒索软件对关键系统构成严重且快速的威胁,通常在执行后几秒钟内加密文件。研究表明,就财务损失而言,勒索软件是报告最多的网络犯罪,这突出表明迫切需要在加密完成之前进行早期检测。在本文中,我们提出了RansomTrack,一个混合行为分析框架,以消除分别使用静态和动态检测方法的局限性。使用Radare 2沙箱提取静态特征,而使用Frida工具包获得动态行为,如内存保护更改,互斥体创建,注册表访问和网络活动。我们公开发布了165个不同的勒索软件和良性软件家族的数据集,提供了文献中已知的最高家族与样本比率。使用机器学习模型的实验评估表明,集成分类器(如XGBoost和Soft Voting)的准确率高达96%,ROC-AUC得分为0.99。在9.1秒内分析的每个样本包括模块化行为日志记录,运行时检测和基于SHAP的可解释性,以突出最有影响力的功能。此外,RansomTrack框架能够在9.2秒内检测勒索软件。总体而言,RansomTrack为实时勒索软件检测提供了可扩展、低延迟和可解释的解决方案。
摘要
:Ransomware poses a serious and fast-acting threat to critical systems, often encrypting files within seconds of execution. Research indicates that ransomware is the most reported cybercrime in terms of financial damage, highlighting the urgent need for early-stage detection before encryption is complete. In this paper, we present RansomTrack, a hybrid behavioral analysis framework to eliminate the limitations of using static and dynamic detection methods separately. Static features are extracted using the Radare2 sandbox, while dynamic behaviors such as memory protection changes, mutex creation, registry access and network activity are obtained using the Frida toolkit. Our dataset of 165 different ransomware and benign software families is publicly released, offering the highest family-to-sample ratio known in the literature. Experimental evaluation using machine learning models shows that ensemble classifiers such as XGBoost and Soft Voting achieve up to 96% accuracy and a ROC-AUC score of 0.99. Each sample analyzed in 9.1 seconds includes modular behavioral logging, runtime instrumentation, and SHAP-based interpretability to highlight the most influential features. Additionally, RansomTrack framework is able to detect ransomware under 9.2 seconds. Overall, RansomTrack offers a scalable, low-latency, and explainable solution for real-time ransomware detection.
【7】Unified Multimodal Uncertain Inference
标题:统一的多模态不确定性推理
链接:https://arxiv.org/abs/2604.08701
作者:Dengjia Zhang,Alexander Martin,William Jurayj,Kenton Murray,Benjamin Van Durme,Reno Kriz
摘要:我们介绍统一多模态不确定推理(UMUI),一个多模态推理任务,跨越文本,音频和视频,其中模型必须产生校准的概率估计假设条件的前提下,在任何形式或组合。虽然不确定推理已在文本中进行了探索,但扩展到其他模态仅限于单模态二元蕴涵判断,没有留下在其他模态中或跨其他模态进行细粒度概率推理的框架。为了解决这个问题,我们策划了一个人类注释的评估集,在音频,视觉和视听设置中具有标量概率判断,并对现有的文本和音频基准进行评估。我们介绍了CLUE(校准的潜在不确定性估计),它结合了自洽的教师校准和基于分布的信心探测,以产生校准的预测。我们证明了我们的3B参数模型在所有模式中达到了与基线相同或更强的性能,最高可达32B参数。
摘要:We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.
【8】EngageTriBoost: Predictive Modeling of User Engagement in Digital Mental Health Intervention Using Explainable Machine Learning
标题:EngageTriBoost:使用可解释机器学习对数字心理健康干预中的用户参与度进行预测建模
链接:https://arxiv.org/abs/2604.08589
作者:Ha Na Cho,Daniel Eisenberg,Cheryl King,Kai Zheng
摘要:年轻人的心理健康挑战正在上升,需要有效的解决方案,如数字心理健康干预(DMHI)。尽管DMHIs有希望,但它们面临着重大的采用障碍,包括初始吸收率低和辍学率高。这项研究利用机器学习(ML)来分析DMHI用户的行为模式,eBridge旨在通过基于动机访谈的在线咨询来提高高危大学生对专业心理健康服务的利用率。我们的集成模型EngageTriBoost在预测参与度方面达到了84%的准确率,通过登录和顾问互动来衡量。然后,我们应用Shapley Additive exPlanations(SHAP)分析,该分析为影响用户参与的关键因素提供了清晰,可解释的见解,如情绪失调和感知的耻辱,突出了它们对DMHI采用的关键影响。这项研究展示了可解释的ML的力量,可以更好地理解用户对DMHI的参与,以提高他们的采用率和对心理健康结果的可实现影响。
摘要:Mental health challenges among young adults, are on the rise, necessitating effective solutions such as digital mental health interventions (DMHIs). Despite their promise, DMHIs face significant adoption barriers, including low initial uptake and high dropout rates. This study leverages machine learning (ML) to analyze behavioral patterns of users of a DMHI, eBridge, designed to increase the utilization of professional mental health services among at-risk college students through motivational interviewing-based online counseling. Our ensemble model, EngageTriBoost, achieved up to 84% accuracy in predicting engagement, measured by sign-ins and counselor interactions. We then applied the Shapley Additive exPlanations (SHAP) analysis which provided clear, interpretable insights into key factors influencing user engagement such as emotional dysregulation and perceived stigma, highlighting their critical effect on DMHI adoption. This study demonstrates the power of explainable ML for better understanding user engagement with DMHI to improve their adoption and achievable impact on mental health outcomes.
【9】Robust Reasoning Benchmark
标题:稳健推理基准
链接:https://arxiv.org/abs/2604.08571
作者:Pavel Golikov,Evgenii Opryshko,Gennady Pekhimenko,Mark C. Jeffrey
摘要:虽然大型语言模型(LLM)在标准数学基准上实现了高性能,但它们的底层推理过程仍然非常适合标准文本格式。我们提出了一个扰动流水线由14个技术来评估鲁棒性LLM推理。我们将此管道应用于AIME 2024数据集,并在由此产生的基准上评估了8个最先进的模型。虽然前沿模型表现出弹性,但开放权重推理模型遭受灾难性的崩溃(在扰动下平均准确率下降高达55%,在某些情况下高达100%),暴露出结构脆弱性。为了进一步将机械解析失败与下游推理失败分开,我们通过强制模型在单个上下文窗口内顺序解决多个未受干扰的数学问题来严格隔离模型的工作记忆容量。我们的研究结果表明,从7B到120B参数的开放权重模型和克劳德作品4.6表现出后续问题的精度衰减。这种退化表明中间推理步骤永久地污染了标准的密集注意力机制。我们认为,要实现可靠的推理,未来的推理架构必须集成明确的上下文重置模型的思想链,导致原子推理任务的最佳粒度的基本开放的问题。
摘要:While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models' working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model's own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.
【10】High-dimensional inference for the $γ$-ray sky with differentiable programming
标题:用可微编程对$γ$-射线天空进行多维推断
链接:https://arxiv.org/abs/2604.08648
作者:Siddharth Mishra-Sharma,Tracy R. Slatyer,Yitian Sun,Yuqing Wu
备注:17 pages, 13 figures. Code available at https://github.com/smsharma/fermi-prob-prog
摘要:我们鼓励使用微分概率规划技术,以占大的模型空间固有的天体物理$γ$射线分析。针对长期存在的银河系中心γ射线过剩(GCE)难题,我们构建了可微的前向模型和可能性,充分利用GPU加速和矢量化,以便同时考虑与GCE发射一致的连续的可能空间形态。我们的设置允许使用变分方法在大模型空间上进行有效的推断。除了应用于γ射线数据之外,这项工作的目标是展示如何将可微概率规划用作一种工具,以灵活地分析天体物理数据集。
摘要
:We motivate the use of differentiable probabilistic programming techniques in order to account for the large model-space inherent to astrophysical $γ$-ray analyses. Targeting the longstanding Galactic Center $γ$-ray Excess (GCE) puzzle, we construct differentiable forward model and likelihood that make liberal use of GPU acceleration and vectorization in order to simultaneously account for a continuum of possible spatial morphologies consistent with the GCE emission in a fully probabilistic manner. Our setup allows for efficient inference over the large model space using variational methods. Beyond application to $γ$-ray data, a goal of this work is to showcase how differentiable probabilistic programming can be used as a tool to enable flexible analyses of astrophysical datasets.
检测相关(8篇)
【1】Detecting Diffusion-generated Images via Dynamic Assembly ForestsDetecting Diffusion-generated Images via Dynamic Assembly Forests
标题:通过动态装配林检测扩散生成图像通过动态装配林检测扩散生成图像
链接:https://arxiv.org/abs/2604.09106
作者:Mengxin Fu,Yuezun Li
摘要:扩散模型以生成高质量图像而闻名,引起严重的安全问题。为了解决这个问题,大多数努力都依赖于深度神经网络(例如,CNN和Transformers),而在很大程度上忽视了传统机器学习模型的潜力。在本文中,我们新鲜调查这样的替代品,并提出了一种新的动态组装森林模型(DNF)检测扩散生成的图像。基于深度森林范式,ESTO解决了特征学习和可扩展训练的固有局限性,使其成为有效的扩散生成图像检测器。与现有的基于DNN的方法相比,该方法具有更少的参数,更低的计算成本,并且可以在没有GPU的情况下部署,同时在标准评估协议下实现具有竞争力的性能。这些结果突出了所提出的方法作为资源受限场景中重量级DNN模型的实际替代品的强大潜力。我们的代码和模型可以在https://github.com/OUC-VAS/DAF上找到。
摘要:Diffusion models are known for generating high-quality images, causing serious security concerns. To combat this, most efforts rely on deep neural networks (e.g., CNNs and Transformers), while largely overlooking the potential of traditional machine learning models. In this paper, we freshly investigate such alternatives and proposes a novel Dynamic Assembly Forest model (DAF) to detect diffusion-generated images. Built upon the deep forest paradigm, DAF addresses the inherent limitations in feature learning and scalable training, making it an effective diffusion-generated image detector. Compared to existing DNN-based methods, DAF has significantly fewer parameters, much lower computational cost, and can be deployed without GPUs, while achieving competitive performance under standard evaluation protocols. These results highlight the strong potential of the proposed method as a practical substitute for heavyweight DNN models in resource-constrained scenarios. Our code and models are available at https://github.com/OUC-VAS/DAF.
【2】CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
标题:CLIP-Inspector:通过OOD触发倒置对预算调谐的CLIP进行模型级后门检测
链接:https://arxiv.org/abs/2604.09101
作者:Akshit Jindal,Saket Anand,Chetan Arora,Vikram Goyal
备注:17 pages (8 main + 2 references + 7 supplementary), Accepted to CVPR Findings 2026
摘要:数据和计算资源有限的组织越来越多地将模型训练外包给机器学习即服务(MLaaS)提供商,这些提供商通过及时调整而不是从头开始训练来调整CLIP等视觉语言模型(VLM)以适应下游任务。这种半诚实的设置会产生安全风险,恶意提供商可以遵循恶意调优协议,但植入后门,迫使触发的输入被分类到攻击者选择的类别中,即使是分发外(OOD)数据。这样的后门不会影响编码器,使它们无法被专注于编码器损坏的现有方法检测到。其他在训练之前或推理期间对数据进行清理的数据级方法也无法回答关键问题,“交付的模型是否有后门?”为了解决这个模型级验证问题,我们引入了CLIP-Inspector(CI),这是一种专为未经调整的CLIP模型设计的后门检测方法。假设白盒访问交付的模型和一个未标记的OOD图像池,CI为每个类重建可能的触发器,以确定模型是否表现出后门行为。此外,我们证明,使用CI的重构触发器对正确标记的触发输入进行微调,使我们能够重新调整模型并降低后门的有效性。通过对10个数据集和4次后门攻击的广泛实验,我们证明了CI可以在一个时期内仅使用1,000个OOD图像重建有效的触发器,达到94%的检测准确率(47/50模型)。与自适应的Becker反演基线相比,CI产生了显著更高的AUROC评分(0.973 vs 0.495/0.687),从而能够审查和事后修复未经调整的CLIP模型,以确保安全部署。
摘要:Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, "Is the delivered model backdoored or not?" To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI's reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.
【3】Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
标题:用于社会工程检测的非结构化信息源中命名实体的识别和分析
链接:https://arxiv.org/abs/2604.09016
作者:Carlos Jimeno Miguel,Raul Orduna,Francesco Zola
摘要:该研究解决了创建用于网络犯罪分析的数据集的挑战,同时遵守《通用数据保护条例》(GDPR)和《刑法典》第10/1995号组织法等法规的要求。为此,提出了一种系统,用于从Telegram平台收集信息,包括文本,音频和图像;实施包含信号增强技术的语音到文本转录模型;以及评估不同的命名实体识别(NER)解决方案,包括Microsoft Presidio和使用基于transformer的架构设计的AI模型。实验结果表明,鹦鹉实现了最好的性能,在音频转录,而所提出的NER解决方案实现了最高的f1分数值在检测敏感信息。此外,还提出了匿名化指标,可以评估数据中结构一致性的保持情况,同时保证在当前法律框架内保护个人信息并支持网络安全研究。
摘要:This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.
【4】Tracing the Chain: Deep Learning for Stepping-Stone Intrusion Detection
标题:追踪链:用于垫脚石入侵检测的深度学习
链接
:https://arxiv.org/abs/2604.08800
作者:Nate Mathews,Nicholas Hopper,Matthew Wright
摘要:垫脚石入侵(SSIs)是一种流行的网络规避技术,攻击者通过受感染的中间主机链路由会话以掩盖其来源。有效的SSI检测需要以极低的误报率在每个中继主机处关联传入和传出流-这是一个严格的要求,使得经典的统计方法在操作设置中不充分。我们将ESTO(一种深度学习流关联模型,结合了基于transformer的特征提取网络、时间对齐的多通道间隔特征和在线三元组度量学习)应用于踏脚石入侵检测问题。为了支持培训和评估,我们开发了一个合成数据收集工具,可以跨五种隧道协议生成逼真的垫脚石流量:SSH,SOCAT,DNS和混合多协议链。在所有五种协议中,在主机模式和网络模式检测场景中,ESTERO的性能大大优于最先进的DeepCoFFEA基线,在网络模式下标准突发协议的误报率为10 ^{-3}$的情况下,实现了超过0.99的真阳性率。我们进一步展示了链长度预测作为区分恶意和良性旋转的工具,并进行了系统的鲁棒性分析,揭示了基于时间的扰动是基于相关性的垫脚石检测器的主要弱点。
摘要:Stepping-stone intrusions (SSIs) are a prevalent network evasion technique in which attackers route sessions through chains of compromised intermediate hosts to obscure their origin. Effective SSI detection requires correlating the incoming and outgoing flows at each relay host at extremely low false positive rates -- a stringent requirement that renders classical statistical methods inadequate in operational settings. We apply ESPRESSO, a deep learning flow correlation model combining a transformer-based feature extraction network, time-aligned multi-channel interval features, and online triplet metric learning, to the problem of stepping-stone intrusion detection. To support training and evaluation, we develop a synthetic data collection tool that generates realistic stepping-stone traffic across five tunneling protocols: SSH, SOCAT, ICMP, DNS, and mixed multi-protocol chains. Across all five protocols and in both host-mode and network-mode detection scenarios, ESPRESSO substantially outperforms the state-of-the-art DeepCoFFEA baseline, achieving a true positive rate exceeding 0.99 at a false positive rate of $10^{-3}$ for standard bursty protocols in network-mode. We further demonstrate chain length prediction as a tool for distinguishing malicious from benign pivoting, and conduct a systematic robustness analysis revealing that timing-based perturbations are the primary vulnerability of correlation-based stepping-stone detectors.
【5】Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach
标题:数字取证中的仇恨和威胁检测:案例驱动的多模式方法
链接:https://arxiv.org/abs/2604.08609
作者:Ponkoj Chandra Shill
备注:8 pages, 4 figures
摘要:数字取证调查越来越依赖于图像、扫描文档和上下文报告等异构证据。这些人工制品可能包含明确或隐含的伤害,仇恨,威胁,暴力或恐吓的表达,但现有的自动化方法通常假设干净的文本输入或应用视觉模型,而无需法医证明。本文提出了一种案例驱动的多模态方法,用于法医分析中的仇恨和威胁检测。拟议的框架明确确定文本证据的存在和来源,区分嵌入文本、相关上下文文本和仅图像证据。基于所识别的证据配置,该框架选择性地应用文本分析、多模态融合或使用具有Vision Transformer骨干(ViT)的视觉语言模型的仅图像语义推理。通过对证据可用性的条件推理,该方法反映了法医决策,提高了证据的可追溯性,并避免了不合理的模态假设。法医风格的图像证据的实验评估表明,在异构的证据场景的一致性和可解释的行为。
摘要:Digital forensic investigations increasingly rely on heterogeneous evidence such as images, scanned documents, and contextual reports. These artifacts may contain explicit or implicit expressions of harm, hate, threat, violence, or intimidation, yet existing automated approaches often assume clean text input or apply vision models without forensic justification. This paper presents a case-driven multimodal approach for hate and threat detection in forensic analysis. The proposed framework explicitly determines the presence and source of textual evidence, distinguishing between embedded text, associated contextual text, and image-only evidence. Based on the identified evidence configuration, the framework selectively applies text analysis, multimodal fusion, or image-only semantic reasoning using vision language models with vision transformer backbones (ViT). By conditioning inference on evidence availability, the approach mirrors forensic decision-making, improves evidentiary traceability, and avoids unjustified modality assumptions. Experimental evaluation on forensic-style image evidence demonstrates consistent and interpretable behavior across heterogeneous evidence scenarios.
【6】Multivariate Time Series Anomaly Detection via Dual-Branch Reconstruction and Autoregressive Flow-based Residual Density Estimation
标题:基于双分支重构和自回归流残差密度估计的多元时间序列异常检测
链接:https://arxiv.org/abs/2604.08582
作者:Jun Liu,Ying Chen,Ziqian Lu,Qinyue Tong,Jun Tang
备注:12 pages, 3 figures,
摘要:多变量时间序列异常检测(MTSAD)对于工业控制和航空航天系统等实际监控场景至关重要。主流的基于重建的异常检测方法受到两个关键限制:第一,过度拟合交叉变量建模引起的虚假相关性;第二,通过简单地总结多变量重建误差来生成误导性异常分数,这使得难以区分难以重建的样本和真正的异常。为了解决这些问题,我们提出了DBR-AF,一种新的框架,集成了双分支重建(DBR)编码器和自回归流(AF)模块。DBR编码器将交叉变量相关性学习和变量内统计特性建模相结合,以减轻伪相关性,而AF模块采用多个堆叠可逆变换来对复杂的多变量残差分布进行建模,并进一步利用密度估计来准确识别具有大重构误差的正态样本。在七个基准数据集上进行的大量实验表明,DBR-AF达到了最先进的性能,消融研究验证了其核心组件的不可或缺性。
摘要:Multivariate Time Series Anomaly Detection (MTSAD) is critical for real-world monitoring scenarios such as industrial control and aerospace systems. Mainstream reconstruction-based anomaly detection methods suffer from two key limitations: first, overfitting to spurious correlations induced by an overemphasis on cross-variable modeling; second, the generation of misleading anomaly scores by simply summing up multivariable reconstruction errors, which makes it difficult to distinguish between hard-to-reconstruct samples and genuine anomalies. To address these issues, we propose DBR-AF, a novel framework that integrates a dual-branch reconstruction (DBR) encoder and an autoregressive flow (AF) module. The DBR encoder decouples cross-variable correlation learning and intra-variable statistical property modeling to mitigate spurious correlations, while the AF module employs multiple stacked reversible transformations to model the complex multivariate residual distribution and further leverages density estimation to accurately identify normal samples with large reconstruction errors. Extensive experiments on seven benchmark datasets demonstrate that DBR-AF achieves state-of-the-art performance, with ablation studies validating the indispensability of its core components.
【7】Fully Autonomous Z-Score-Based TinyML Anomaly Detection on Resource-Constrained MCUs Using Power Side-Channel Data
标题:使用电源侧通道数据在资源受限MCU上进行基于Z-Score的完全自主TinyML异常检测
链接:https://arxiv.org/abs/2604.08581
作者:Abdulrahman Albaiz,Fathi Amsaad
备注:SaTC 2026 Conference
摘要:本文提出了一种基于Z-Score的完全自主的Tiny Machine Learning(TinyML)异常检测系统,该系统部署在低功耗微控制器上,用于使用电源侧信道数据实时监控家电行为。与现有的依赖离线训练或云辅助分析的物联网(IoT)异常检测方法不同,所提出的系统直接在资源受限的微控制器上执行模型训练和推理,而无需外部计算或连接。系统在初始训练阶段连续采样电流消耗,计算设备上的均方根(RMS)值,并导出统计参数。使用轻量级Z-Score阈值检测异常,从而实现适用于嵌入式部署的可解释和计算高效的推理。该架构在基于STM 32的平台上实现,并使用从正常操作和受控异常条件下的家用迷你冰箱收集的14天数据集进行评估。结果表明,完美的检测性能,精度和召回率为1.00,推理延迟的顺序为几十微秒,总内存占用约3.3 KB SRAM和63 KB闪存。这些结果证实,可以在低成本微控制器上实现强大且完全自主的TinyML异常检测。未来的工作包括扩展框架,以纳入额外的轻量级模型和多设备学习场景。
摘要
:This paper presents a fully autonomous Tiny Machine Learning (TinyML) Z-Score-based anomaly detection system deployed on a low-power microcontroller for real-time monitoring of appliance behavior using power side-channel data. Unlike existing Internet of Things (IoT) anomaly detection approaches that rely on offline training or cloud-assisted analytics, the proposed system performs both model training and inference directly on a resource-constrained microcontroller without external computation or connectivity. The system continuously samples current consumption, computes Root Mean Square (RMS) values on-device, and derives statistical parameters during an initial training phase. Anomalies are detected using lightweight Z-Score thresholds, enabling interpretable and computationally efficient inference suitable for embedded deployment. The architecture was implemented on an STM32-based platform and evaluated using a 14-day dataset collected from a household mini-fridge under normal operation and controlled anomaly conditions. Results demonstrate perfect detection performance, with Precision and Recall of 1.00, inference latencies on the order of tens of microseconds, and a total memory footprint of approximately 3.3 KB SRAM and 63 KB Flash. These results confirm that robust and fully autonomous TinyML anomaly detection can be achieved on low-cost microcontrollers. Future work includes extending the framework to incorporate additional lightweight models and multi-device learning scenarios.
【8】Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection
标题:后组织分发外检测的排序激活转变
链接:https://arxiv.org/abs/2604.08572
作者:Gianluca Guglielmo,Marc Masana
备注:Code is available at https://github.com/gigug/RAS
摘要:现有技术的事后分布外检测方法依赖于中间层激活编辑。然而,它们在数据集和模型之间表现出不一致的性能。我们发现,这种不稳定性是由激活分布的差异驱动的,并确定了一个失败的模式,当倒数第二层激活不纠正时,出现的基于缩放的方法。出于这种分析的动机,我们提出了一种无超参数的事后方法,该方法用固定的分布参考曲线代替了排序的激活幅度。我们简单的即插即用方法在数据集和架构上表现出强大而一致的性能,而无需对倒数第二层激活函数进行假设,也无需任何超参数调整,同时通过构造保持分布分类精度。我们进一步分析了是什么推动了这种改善,表明抑制和兴奋的激活转变都独立地有助于更好地区分分布。
摘要:State-of-the-art post-hoc out-of-distribution detection methods rely on intermediate layer activation editing. However, they exhibit inconsistent performance across datasets and models. We show that this instability is driven by differences in the activation distributions, and identify a failure mode of scaling-based methods that arises when penultimate layer activations are not rectified. Motivated by this analysis, we propose \ours, a hyperparameter-free post-hoc method that replaces sorted activation magnitudes with a fixed in-distribution reference profile. Our simple plug-and-play method shows strong and consistent performance across datasets and architectures without assumptions on the penultimate layer activation function, and without requiring any hyperparameter tuning, while preserving in-distribution classification accuracy by construction. We further analyze what drives the improvement, showing that both inhibiting and exciting activation shifts independently contribute to better out-of-distribution discrimination.
分类|识别(6篇)
【1】Are Independently Estimated View Uncertainties Comparable? Unified Routing for Trusted Multi-View Classification
标题:独立估计视图是否具有不可比拟的可比性?可信多视图分类的统一路由
链接:https://arxiv.org/abs/2604.09288
作者:Yilin Zhang,Cai Xu,Haishun Chen,Ziyu Guan,Wei Zhao
备注:14pages, Under Review
摘要:可信多视图分类通常依赖于视图级证据融合过程:每个视图独立地产生类别证据和不确定性,并且通过聚合这些独立的意见来获得最终的预测。虽然这种设计是模块化和不确定性的,但它隐含地假设来自不同观点的证据在数量上是可比的。然而,在实践中,这种假设是脆弱的。不同的视图通常在特征空间、噪声水平和语义粒度上有所不同,而独立训练的分支仅针对预测正确性进行优化,而没有任何约束来强制执行证据强度的跨视图一致性。因此,用于融合的不确定性可能由特定于分支的规模偏差而不是真正的样本级可靠性决定。为了解决这个问题,我们提出了可信的多视图学习与统一路由(TMUR),这使得特定于视图的证据提取融合仲裁。TMUR使用视图私有专家和一个协作专家,并采用一个统一的路由器,观察全球多视图上下文,以生成样本级专家权重。软负载平衡和多样性规则化进一步鼓励平衡的专家利用率和更具歧视性的专家专业化。我们还提供了理论分析,为什么独立的证据监督不确定一个共同的跨视图证据尺度,以及为什么统一的全球路由是最好的分支本地仲裁时,可靠性是样本依赖。
摘要:Trusted multi-view classification typically relies on a view-wise evidential fusion process: each view independently produces class evidence and uncertainty, and the final prediction is obtained by aggregating these independent opinions. While this design is modular and uncertainty-aware, it implicitly assumes that evidence from different views is numerically comparable. In practice, however, this assumption is fragile. Different views often differ in feature space, noise level, and semantic granularity, while independently trained branches are optimized only for prediction correctness, without any constraint enforcing cross-view consistency in evidence strength. As a result, the uncertainty used for fusion can be dominated by branch-specific scale bias rather than true sample-level reliability. To address this issue, we propose Trusted Multi-view learning with Unified Routing (TMUR), which decouples view-specific evidence extraction from fusion arbitration. TMUR uses view-private experts and one collaborative expert, and employs a unified router that observes the global multi-view context to generate sample-level expert weights. Soft load-balancing and diversity regularization further encourage balanced expert utilization and more discriminative expert specialization. We also provide theoretical analysis showing why independent evidential supervision does not identify a common cross-view evidence scale, and why unified global routing is preferable to branch-local arbitration when reliability is sample-dependent.
【2】Towards Lifelong Aerial Autonomy: Geometric Memory Management for Continual Visual Place Recognition in Dynamic Environments
标题:迈向终身空中自主:动态环境中连续视觉位置识别的几何记忆管理
链接:https://arxiv.org/abs/2604.09038
作者:Xingyu Shao,Zhiqiang Yan,Liangzheng Sun,Mengfan He,Chao Chen,Jinhui Zhang,Chunyu Li,Ziyang Meng
摘要
:Robust geo-localization in changing environmental conditions is critical for long-term aerial autonomy. While visual place recognition (VPR) models perform well when airborne views match the training domain, adapting them to shifting distributions during sequential missions triggers catastrophic forgetting. Existing continual learning (CL) methods often fail here because geographic features exhibit severe intra-class variations. In this work, we formulate aerial VPR as a mission-based domain-incremental learning (DIL) problem and propose a novel heterogeneous memory framework. To respect strict onboard storage constraints, our "Learn-and-Dispose" pipeline decouples geographic knowledge into static satellite anchors (preserving global geometric priors) and a dynamic experience replay buffer (retaining domain-specific features). We introduce a spatially-constrained allocation strategy that optimizes buffer selection based on sample difficulty or feature space diversity. To facilitate systematic assessment, we provide three evaluation criteria and a comprehensive benchmark derived from 21 diverse mission sequences. Extensive experiments demonstrate that our architecture significantly boosts spatial generalization; our diversity-driven buffer selection outperforms the random baseline by 7.8% in knowledge retention. Unlike class-mean preservation methods that fail in unstructured environments, maximizing structural diversity achieves a superior plasticity-stability balance and ensures order-agnostic robustness across randomized sequences. These results prove that maintaining structural feature coverage is more critical than sample difficulty for resolving catastrophic forgetting in lifelong aerial autonomy.
【3】IKKA: Inversion Classification via Critical Anomalies for Robust Visual Servoing
标题:IKKA:通过关键异常进行倒置分类以实现鲁棒视觉伺服
链接:https://arxiv.org/abs/2604.08754
作者:Darya Pavlenko
备注:9 pages, 2 figures, 3 tables. Submitted to NeurIPS 2026
摘要:We introduce IKKA (Inversion Classification via Critical Anomalies), a topologically motivated weighting framework for robust visual servoing under distribution shift. Unlike conventional outlier handling, IKKA treats maverick points as structurally informative observations: points where small perturbations can induce qualitatively different control responses or class assignments. The method combines local extremality, boundary transversality, and multi-scale persistence into a single anomaly weight, W(x) = E(x) x T(x) x M(x), which modulates control updates near ambiguous decision regions. We instantiate IKKA in a CPU-only embedded visual-servoing pipeline on Raspberry Pi 4 and evaluate it across 230 reproducible runs under nominal and stress conditions. In stress scenarios involving dim illumination and transient occlusion, IKKA reduces the 95th-percentile lateral error by 24% relative to a hybrid baseline (0.124 to 0.094) while increasing throughput from 20.0 to 24.8 Hz. Non-parametric analysis confirms a large effect size (Cliff's delta = 0.79).
【4】EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition
标题:EfficientSign:一种用于印度手语识别的注意力增强的轻量级架构
链接:https://arxiv.org/abs/2604.08694
作者:Rishabh Gupta,Shravya R. Nalla
备注:Submitted to IEEE Transactions on Human-Machine Systems
摘要:How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other approaches on 12,637 images of Indian Sign Language alphabets, all 26 classes, using 5-fold cross-validation. EfficientSign achieves the accuracy of 99.94% (+/-0.05%), which matches the performance of ResNet18's 99.97% accuracy, but with 62% fewer parameters (4.2M vs 11.2M). We also experimented with feeding deep features (1,280-dimensional vectors pulled from EfficientNet-B0's pooling layer) into classical classifiers. SVM achieved the accuracy of 99.63%, Logistic Regression achieved the accuracy of 99.03% and KNN achieved accuracy of 96.33%. All of these blow past the 92% that SURF-based methods managed on a similar dataset back in 2015. Our results show that attention-enhanced learning model provides an efficient and deployable solution for ISL recognition without requiring a massive model or hand-tuned feature pipelines anymore.
【5】An Open-Source, Open Data Approach to Activity Classification from Triaxial Accelerometry in an Ambulatory Setting
标题:一种开源、开放数据方法在流动环境中从三轴加速度测量进行活动分类
链接:https://arxiv.org/abs/2604.09451
作者:Sepideh Nikookar,Edward Tian,Harrison Hoffman,Matthew Parks,J. Lucas McKay,Yashar Kiarashi,Tommy T. Thomas,Alex Hall,David W. Wright,Gari D. Clifford
摘要:The accelerometer has become an almost ubiquitous device, providing enormous opportunities in healthcare monitoring beyond step counting or other average energy estimates in 15-60 second epochs. Objective: To develop an open data set with associated open-source code for processing 50 Hz tri-axial accelerometry-based to classify patient activity levels and natural types of movement. Approach: Data were collected from 23 healthy subjects (16 males and seven females) aged between 23 and 62 years using an ambulatory device, which included a triaxial accelerometer and synchronous lead II equivalent ECG for an average of 26 minutes each. Participants followed a standardized activity routine involving five distinct activities: lying, sitting, standing, walking, and jogging. Two classifiers were constructed: a signal processing technique to distinguish between high and low activity levels and a convolutional neural network (CNN)-based approach to classify each of the five activities. Main results: The binary (high/low) activity classifier exhibited an F1 score of 0.79. The multi-class CNN-based classifier provided an F1 score of 0.83. The code for this analysis has been made available under an open-source license together with the data on which the classifiers were trained and tested. Significance: The classification of behavioral activity, as demonstrated in this study, offers valuable context for interpreting traditional health metrics and may provide contextual information to support the future development of clinical decision-making tools for patient monitoring, predictive analytics, and personalized health interventions.
【6】Iterative Identification Closure: Amplifying Causal Identifiability in Linear SEMs
标题:迭代识别闭合:扩大线性SEM中的因果可识别性
链接:https://arxiv.org/abs/2604.09309
作者:Ziyi Ding,Xiao-Ping Zhang
摘要
:The Half-Trek Criterion (HTC) is the primary graphical tool for determining generic identifiability of causal effect coefficients in linear structural equation models (SEMs) with latent confounders. However, HTC is inherently node-wise: it simultaneously resolves all incoming edges of a node, leaving a gap of "inconclusive" causal effects (15-23% in moderate graphs). We introduce Iterative Identification Closure (IIC), a general framework that decouples causal identification into two phases: (1) a seed function S_0 that identifies an initial set of edges from any external source of information (instrumental variables, interventions, non-Gaussianity, prior knowledge, etc.); and (2) Reduced HTC propagation that iteratively substitutes known coefficients to reduce system dimension, enabling identification of edges that standard HTC cannot resolve. The core novelty is iterative identification propagation: newly identified edges feed back to unlock further identification -- a mechanism absent from all existing graphical criteria, which treat each edge (or node) in isolation. This propagation is non-trivial: coefficient substitution alters the covariance structure, and soundness requires proving that the modified Jacobian retains generic full rank -- a new theoretical result (Reduced HTC Theorem). We prove that IIC is sound, monotone, converges in O(|E|) iterations (empirically <=2), and strictly subsumes both HTC and ancestor decomposition. Exhaustive verification on all graphs with n<=5 (134,144 edges) confirms 100% precision (zero false positives); with combined seeds, IIC reduces the HTC gap by over 80%. The propagation gain is gamma~4x (2 seeds identifying ~3% of edges to 97.5% total identification), far exceeding gamma<=1.2x of prior methods that incorporate side information without iterative feedback.
表征(4篇)
【1】Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations
标题:学习分析中的暂时辍学风险:动态和早期窗口表示的协调生存基准
链接:https://arxiv.org/abs/2604.08870
作者:Rafael da Silva,Jeff Eicher,Gregory Longo
备注:34 pages, 14 figures, 18 tables. Includes appendix with reliability diagrams, sensitivity analyses, and dataset audit tables
摘要:Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families -- tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.
【2】On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment
标题:跨模式表示的谱几何:多模式对齐的功能地图诊断
链接:https://arxiv.org/abs/2604.08579
作者:Krisanu Sarkar
备注:Under review at ACMMM Brave New Ideas Track
摘要:We study cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework from computational geometry, which represents correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. While the framework underperforms Procrustes alignment and relative representations for cross-modal retrieval across all supervision budgets, it reveals a structural property of multimodal representations. We find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043), indicating that independently trained models develop manifolds of comparable intrinsic complexity. However, the functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15), showing that the eigenvector bases are effectively unaligned. We term this decoupling the spectral complexity--orientation gap: models converge in how much structure they capture but not in how they organize it. This gap defines a boundary condition for spectral alignment methods and motivates three diagnostic quantities : diagonal dominance, orthogonality deviation, and Laplacian commutativity error for characterizing cross-modal representation compatibility.
【3】Silhouette Loss: Differentiable Global Structure Learning for Deep Representations
标题:剪影损失:深度表示的差异性全球结构学习
链接:https://arxiv.org/abs/2604.08573
作者:Matheus Vinícius Todescato,Joel Luís Carbonera
摘要:Learning discriminative representations is a central goal of supervised deep learning. While cross-entropy (CE) remains the dominant objective for classification, it does not explicitly enforce desirable geometric properties in the embedding space, such as intra-class compactness and inter-class separation. Existing metric learning approaches, including supervised contrastive learning (SupCon) and proxy-based methods, address this limitation by operating on pairwise or proxy-based relationships, but often increase computational cost and complexity. In this work, we introduce Soft Silhouette Loss, a novel differentiable objective inspired by the classical silhouette coefficient from clustering analysis. Unlike pairwise objectives, our formulation evaluates each sample against all classes in the batch, providing a batch-level notion of global structure. The proposed loss directly encourages samples to be closer to their own class than to competing classes, while remaining lightweight. Soft Silhouette Loss can be seamlessly combined with cross-entropy, and is also complementary to supervised contrastive learning. We propose a hybrid objective that integrates them, jointly optimizing local pairwise consistency and global cluster structure. Extensive experiments on seven diverse datasets demonstrate that: (i) augmenting CE with Soft Silhouette Loss consistently improves over CE and other metric learning baselines; (ii) the hybrid formulation outperforms SupCon alone; and (iii) the combined method achieves the best performance, improving average top-1 accuracy from 36.71% (CE) and 37.85% (SupCon2) to 39.08%, while incurring substantially lower computational overhead. These results suggest that classical clustering principles can be reinterpreted as differentiable objectives for deep learning, enabling efficient optimization of both local and global structure in representation spaces.
【4】A Representation-Level Assessment of Bias Mitigation in Foundation Models
标题:基金会模型中偏差缓解的代表级评估
链接:https://arxiv.org/abs/2604.08561
作者
:Svetoslav Nizhnichenkov,Rahul Nair,Elizabeth Daly,Brian Mac Namee
备注:Accepted at ECML-PKDD 2025 (5th Workshop on Bias and Fairness in AI)
摘要:We investigate how successful bias mitigation reshapes the embedding space of encoder-only and decoder-only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias-mitigated variants of the models. Our findings show that bias mitigation reduces gender-occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder-only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (https://github.com/winodec/wino-dec)
编码器(3篇)
【1】Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
标题:更多数据值得付出代价吗?微型仅注意解码器中的数据集缩放定律
链接:https://arxiv.org/abs/2604.09389
作者:Götz-Henrik Wiegand,Lorena Raichle,Rico Städeli,Tomas Hrycej,Bernhard Bermeitinger,Siegfried Handschuh
备注:Presented as a paper at 3rd DATA-FM workshop @ ICLR 2026, Brazil. Published at 13th IEEE Swiss Conference on Data Science and AI (SDS 2026)
摘要:Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments, such as small research labs and exploratory model development.
【2】Biologically-Grounded Multi-Encoder Architectures as Developability Oracles for Antibody Design
标题:生物接地多编码器架构作为抗体设计的可开发性先知
链接:https://arxiv.org/abs/2604.09369
作者:Simon J. Crouzet
备注:ICLR 2026 Workshop on Generative and Experimental Perspectives for Biomolecular Design
摘要:Generative models can now propose thousands of \emph{de novo} antibody sequences, yet translating these designs into viable therapeutics remains constrained by the cost of biophysical characterization. Here we present CrossAbSense, a framework of property-specific neural oracles that combine frozen protein language model encoders with configurable attention decoders, identified through a systematic hyperparameter campaign totaling over 200 runs per property. On the GDPa1 benchmark of 242 therapeutic IgGs, our oracles achieve notable improvements of 12--20\% over established baselines on three of five developability assays and competitive performance on the remaining two. The central finding is that optimal decoder architectures \emph{invert} our initial biological hypotheses: self-attention alone suffices for aggregation-related properties (hydrophobic interaction chromatography, polyreactivity), where the relevant sequence signatures -- such as CDR-H3 hydrophobic patches -- are already fully resolved within single-chain embeddings by the high-capacity 6B encoder. Bidirectional cross-attention, by contrast, is required for expression yield and thermal stability -- properties that inherently depend on the compatibility between heavy and light chains. Learned chain fusion weights independently confirm heavy-chain dominance in aggregation ($w_H = 0.62$) versus balanced contributions for stability ($w_H = 0.51$). We demonstrate practical utility by deploying CrossAbSense on 100 IgLM-generated antibody designs, illustrating a path toward substantial reduction in experimental screening costs.
【3】CERBERUS: A Three-Headed Decoder for Vertical Cloud Profiles
标题:CERBERUS:垂直云轮廓的三头解码器
链接:https://arxiv.org/abs/2604.08772
作者:Emily K. deJong,Nipun Gunawardena,Kevin Smalley,Hassan Beydoun,Peter Caldwell
备注:Accepted for oral presentation at 2026 ICLR workshop on Machine Learning for Remote Sensing
摘要:Atmospheric clouds exhibit complex three-dimensional structure and microphysical details that are poorly constrained by the predominantly two-dimensional satellite observations available at global scales. This mismatch complicates data-driven learning and evaluation of cloud processes in weather and climate models, contributing to ongoing uncertainty in atmospheric physics. We introduce CERBERUS, a probabilistic inference framework for generating vertical radar reflectivity profiles from geostationary satellite brightness temperatures, near-surface meteorological variables, and temporal context. CERBERUS employs a three-headed encoder-decoder architecture to predict a zero-inflated (ZI) vertically-resolved distribution of radar reflectivity. Trained and evaluated using ground-based Ka-band radar observations at the ARM Southern Great Plains site, CERBERUS recovers coherent structures across cloud regimes, generalizes to withheld test periods, and provides uncertainty estimates that reflect physical ambiguity, particularly in multilayer and dynamically complex clouds. These results demonstrate the value of distribution-based learning targets for bridging observational scales, introducing a path toward model-relevant synthetic observations of clouds.
优化|敛散性(8篇)
【1】Efficient Unlearning through Maximizing Relearning Convergence Delay
标题:通过最大化再学习收敛延迟实现有效的非学习
链接:https://arxiv.org/abs/2604.09391
作者:Khoa Tran,Simon S. Woo
摘要:Machine unlearning poses challenges in removing mislabeled, contaminated, or problematic data from a pretrained model. Current unlearning approaches and evaluation metrics are solely focused on model predictions, which limits insight into the model's true underlying data characteristics. To address this issue, we introduce a new metric called relearning convergence delay, which captures both changes in weight space and prediction space, providing a more comprehensive assessment of the model's understanding of the forgotten dataset. This metric can be used to assess the risk of forgotten data being recovered from the unlearned model. Based on this, we propose the Influence Eliminating Unlearning framework, which removes the influence of the forgetting set by degrading its performance and incorporates weight decay and injecting noise into the model's weights, while maintaining accuracy on the retaining set. Extensive experiments show that our method outperforms existing metrics and our proposed relearning convergence delay metric, approaching ideal unlearning performance. We provide theoretical guarantees, including exponential convergence and upper bounds, as well as empirical evidence of strong retention and resistance to relearning in both classification and generative unlearning tasks.
【2】Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications
标题:利用压缩通信的分布式在线凸优化:最佳遗憾及其应用
链接:https://arxiv.org/abs/2604.09276
作者:Sifan Yang,Dan-Yue Li,Lijun Zhang
摘要:Distributed online convex optimization (D-OCO) is a powerful paradigm for modeling distributed scenarios with streaming data. However, the communication cost between local learners and the central server is substantial in large-scale applications. To alleviate this bottleneck, we initiate the study of D-OCO with compressed communication. Firstly, to quantify the compression impact, we establish the $Ω(δ^{-1/2}\sqrt{T})$ and $Ω(δ^{-1}\log{T})$ lower bounds for convex and strongly convex loss functions, respectively, where $δ\in (0,1]$ is the compression ratio. Secondly, we propose an optimal algorithm, which enjoys regret bounds of $O(δ^{-1/2}\sqrt{T})$ and $O(δ^{-1} \log T)$ for convex and strongly convex loss functions, respectively. Our method incorporates the error feedback mechanism into the Follow-the-Regularized-Leader framework to address the coupling between the compression error and the projection error. Furthermore, we employ the online compression strategy to mitigate the accumulated error arising from the bidirectional compression. Our online method has great generality, and can be extended to the offline stochastic setting via online-to-batch conversion. We establish convergence rates of $O(δ^{-1/2}T^{-1/2})$ and $O(δ^{-1} T^{-1})$ for convex and strongly convex loss functions, respectively, providing the first guarantees for distributed non-smooth optimization with compressed communication and domain constraints.
【3】StaRPO: Stability-Augmented Reinforcement Policy Optimization
标题:StaRPO:增强稳定性的强化策略优化
链接:https://arxiv.org/abs/2604.08905
作者:Jinghan Zhang,Fengran Mo,Tharindu Cyril Weerasooriya,Ruimin Dai,Xiaoyan Han,Yanjie Fu,Dakuo Wang,Kunpeng Liu
摘要:Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.
【4】$p1$: Better Prompt Optimization with Fewer Prompts
标题:$p1$:以更少的预算进行更好的即时优化
链接:https://arxiv.org/abs/2604.08801
作者:Zhaolin Gao,Yu,Wang,Bo Liu,Thorsten Joachims,Kianté Brantley,Wen Sun
摘要:Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.
【5】Skip-Connected Policy Optimization for Implicit Advantage
标题:跳过相关政策优化以获得隐性优势
链接:https://arxiv.org/abs/2604.08690
作者:Fengwei Teng,Jinyi Bai,Xinhao Yao,Demi Ruohan Wang,Jiahao Zhao,Zhijiang Guo
摘要
:Group Relative Policy Optimization (GRPO) has proven effective in RLVR by using outcome-based rewards. While fine-grained dense rewards can theoretically improve performance, we reveal that under practical sampling budgets, Monte Carlo estimation yields high-variance and sign-inconsistent advantages for early reasoning tokens, paradoxically underperforming outcome-only GRPO. We propose Skip-Connected Optimization (SKPO), which decomposes reasoning into upstream and downstream phases: upstream receives dense rewards from downstream Monte Carlo sampling with single-stream optimization; downstream maintains group-relative optimization, where a skip connection concatenates the upstream segment with the original problem, enabling the model to leverage helpful upstream reasoning while preserving the freedom to bypass flawed reasoning through direct problem access. Experiments demonstrate improvements of 3.91% and 6.17% relative gains over the strongest baselines on Qwen2.5-Math-7B and Llama-3.2-3B respectively across mathematical benchmarks and out-of-domain tasks including general reasoning and code generation. Further analysis reveals an implicit advantage: SKPO generates trajectories with higher intermediate-step quality even when matched for final correctness.
【6】Distributionally Robust Token Optimization in RLHF
标题:WLHF中的分布式鲁棒代币优化
链接:https://arxiv.org/abs/2604.08577
作者:Yeping Jin,Jiaming Hu,Ioannis Ch. Paschalidis
摘要:Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17\% improvement on GSM8K and 2.49% improvement on MathQA.
【7】Memory-Guided Trust-Region Bayesian Optimization (MG-TuRBO) for High Dimensions
标题:面向多维的内存引导信任区域Bayesian优化(MG-TuRBO)
链接:https://arxiv.org/abs/2604.08569
作者:Abhilasha Saroj,Shaked Regev,Guanhao Xu,Jinghui Yuan,Roy Luo,Ross Wang
摘要:Traffic simulation and digital-twin calibration is a challenging optimization problem with a limited simulation budget. Each trial requires an expensive simulation run, and the relationship between calibration inputs and model error is often nonconvex, and noisy. The problem becomes more difficult as the number of calibration parameters increases. We compare a commonly used automatic calibration method, a genetic algorithm (GA), with Bayesian optimization methods (BOMs): classical Bayesian optimization (BO), Trust-Region BO (TuRBO), Multi-TuRBO, and a proposed Memory-Guided TuRBO (MG-TuRBO) method. We compare performance on 2 real-world traffic simulation calibration problems with 14 and 84 decision variables, representing lower- and higher-dimensional (14D and 84D) settings. For BOMs, we study two acquisition strategies, Thompson sampling and a novel adaptive strategy. We evaluate performance using final calibration quality, convergence behavior, and consistency across runs. The results show that BOMs reach good calibration targets much faster than GA in the lower-D problem. MG-TuRBO performs comparably in our 14D setting, it demonstrates noticeable advantages in the 84D problem, particularly when paired with our adaptive strategy. Our results suggest that MG-TuRBO is especially useful for high-D traffic simulation calibration and potentially for high-D problems in general.
【8】Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control
标题:最优控制中随机最大值原理的伴随匹配
链接:https://arxiv.org/abs/2604.08580
作者:Carles Domingo-Enrich,Jiequn Han
摘要:Reward fine-tuning of diffusion and flow models and sampling from tilted or Boltzmann distributions can both be formulated as stochastic optimal control (SOC) problems, where learning an optimal generative dynamics corresponds to optimizing a control under SDE constraints. In this work, we revisit and generalize Adjoint Matching, a recently proposed SOC-based method for learning optimal controls, and place it on a rigorous footing by deriving it from the Stochastic Maximum Principle (SMP). We formulate a general Hamiltonian adjoint matching objective for SOC problems with control-dependent drift and diffusion and convex running costs, and show that its expected value has the same first variation as the original SOC objective. As a consequence, critical points satisfy the Hamilton--Jacobi--Bellman (HJB) stationarity conditions. In the important practical case of state- and control-independent diffusion, we recover the lean adjoint matching loss previously introduced in adjoint matching, which avoids second-order terms and whose critical points coincide with the optimal control under mild uniqueness assumptions. Finally, we show that adjoint matching can be precisely interpreted as a continuous-time method of successive approximations induced by the SMP, yielding a practical and implementable alternative to classical SMP-based algorithms, which are obstructed by intractable martingale terms in the stochastic setting. These results are also of independent interest to the stochastic control community, providing new implementable objectives and a viable pathway for SMP-based iterations in stochastic problems.
预测|估计(8篇)
【1】Drift-Aware Online Dynamic Learning for Nonstationary Multivariate Time Series: Application to Sintering Quality Prediction
标题:非平稳多元时间序列的漂移感知在线动态学习:在烧制质量预测中的应用
链接:https://arxiv.org/abs/2604.09358
作者:Yumeng Zhao,Shengxiang Yang,Xianpeng Wang
摘要
:Accurate prediction of nonstationary multivariate time series remains a critical challenge in complex industrial systems such as iron ore sintering. In practice, pronounced concept drift compounded by significant label verification latency rapidly degrades the performance of offline-trained models. Existing methods based on static architectures or passive update strategies struggle to simultaneously extract multi-scale spatiotemporal features and overcome the stability-plasticity dilemma without immediate supervision. To address these limitations, a Drift-Aware Multi-Scale Dynamic Learning (DA-MSDL) framework is proposed to maintain robust multi-output predictive performance via online adaptive mechanisms on nonstationary data streams. The framework employs a multi-scale bi-branch convolutional network as its backbone to disentangle local fluctuations from long-term trends, thereby enhancing representational capacity for complex dynamic patterns. To circumvent the label latency bottleneck, DA-MSDL leverages Maximum Mean Discrepancy (MMD) for unsupervised drift detection. By quantifying online statistical deviations in feature distributions, DA-MSDL proactively triggers model adaptation prior to inference. Furthermore, a drift-severity-guided hierarchical fine-tuning strategy is developed. Supported by prioritized experience replay from a dynamic memory queue, this approach achieves rapid distribution alignment while effectively mitigating catastrophic forgetting. Long-horizon experiments on real-world industrial sintering data and a public benchmark dataset demonstrate that DA-MSDL consistently outperforms representative baselines under severe concept drift. Exhibiting strong cross-domain generalization and predictive stability, the proposed framework provides an effective online dynamic learning paradigm for quality monitoring in nonstationary environments.
【2】Hierarchical Flow Decomposition for Turning Movement Prediction at Signalized Intersections
标题:信号交叉口转弯运动预测的分层流分解
链接:https://arxiv.org/abs/2604.09336
作者:Md Atiqur Rahman Mallick,Kamrul Hasan,Pulock Das,Liang Hong,S M Shazzad Rassel
备注:Accepted to IEEE SoutheastCon 2026. 6 pages, 5 figures
摘要:Accurate prediction of intersection turning movements is essential for adaptive signal control but remains difficult due to the high volatility of directional flows. This study proposes HFD-TM (Hierarchical Flow-Decomposition for Turning Movement Prediction), a hierarchical deep learning framework that predicts turning movements by first forecasting corridor through-movements and then expanding these predictions to individual turning streams. This design is motivated by empirical traffic structure, where corridor flows account for 65.1% of total volume, exhibit lower volatility than turning movements, and explain 35.5% of turning-movement variance. A physics-informed loss function enforces flow conservation to maintain structural consistency. Evaluated on six months of 15-minute interval LiDAR (Light Detection and Ranging) data from a six-intersection corridor in Nashville, Tennessee, HFD-TM achieves a mean absolute error of 2.49 vehicles per interval, reducing MAE by 5.7% compared to a Transformer and by 27.0% compared to a GRU (Gated Recurrent Unit). Ablation results show that hierarchical decomposition provides the largest performance gain, while training time is 12.8 times lower than DCRNN (Diffusion Convolutional Recurrent Neural Network), demonstrating suitability for real-time traffic applications.
【3】Online Intention Prediction via Control-Informed Learning
标题:通过控制知情学习进行在线意图预测
链接:https://arxiv.org/abs/2604.09303
作者:Tianyu Zhou,Zihao Liang,Zehui Lu,Shaoshuai Mou
摘要:This paper presents an online intention prediction framework for estimating the goal state of autonomous systems in real time, even when intention is time-varying, and system dynamics or objectives include unknown parameters. The problem is formulated as an inverse optimal control / inverse reinforcement learning task, with the intention treated as a parameter in the objective. A shifting horizon strategy discounts outdated information, while online control-informed learning enables efficient gradient computation and online parameter updates. Simulations under varying noise levels and hardware experiments on a quadrotor drone demonstrate that the proposed approach achieves accurate, adaptive intention prediction in complex environments.
【4】Temporal Patch Shuffle (TPS): Leveraging Patch-Level Shuffling to Boost Generalization and Robustness in Time Series Forecasting
标题:时间补丁洗牌(TPS):利用补丁级洗牌来提高时间序列预测的概括性和鲁棒性
链接:https://arxiv.org/abs/2604.09067
作者:Jafar Bakhshaliyev,Johannes Burchert,Niels Landwehr,Lars Schmidt-Thieme
备注:25 pages, 7 figures, 17 tables
摘要:Data augmentation is a crucial technique for improving model generalization and robustness, particularly in deep learning models where training data is limited. Although many augmentation methods have been developed for time series classification, most are not directly applicable to time series forecasting due to the need to preserve temporal coherence. In this work, we propose Temporal Patch Shuffle (TPS), a simple and model-agnostic data augmentation method for forecasting that extracts overlapping temporal patches, selectively shuffles a subset of patches using variance-based ordering as a conservative heuristic, and reconstructs the sequence by averaging overlapping regions. This design increases sample diversity while preserving forecast-consistent local temporal structure. We extensively evaluate TPS across nine long-term forecasting datasets using five recent model families (TSMixer, DLinear, PatchTST, TiDE, and LightTS), and across four short-term forecasting datasets using PatchTST, observing consistent performance improvements. Comprehensive ablation studies further demonstrate the effectiveness, robustness, and design rationale of the proposed method.
【5】Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya
标题:肯尼亚纳罗克使用合成数据进行基于机器学习的儿童疫苗接种预测
链接:https://arxiv.org/abs/2604.08902
作者:Jimmy Bach,Yang Li,Yaqi Liu,John Sankok,Rose Kimani,Carrie B. Dolan,Julius N. Odhiambo,Haipeng Chen
摘要
:Background: Limited data utilization in low-resource settings poses a barrier to the vaccine delivery ecosystem, undermining efforts to achieve equitable immunization coverage. In nomadic populations, individuals face an increased risk of missing crucial vaccination doses as children. One such population is the Maasai in Narok County, Kenya, where the absence of high-volume, quality data hampers accurate coverage estimates, impedes efficient resource allocation, and weakens the ability to deliver timely interventions. Additionally, data privacy concerns are heightened in groups with limited sensitive data. Objectives: First, we aim to identify children at risk of missing key vaccines across a large population to provide timely, evidence-based interventions that support increased vaccination coverage. Second, we aim to better protect the privacy of sensitive health data in a vulnerable population. Methods: We digitized 8 years of child vaccination records from the MOH 510 registry (n=6,913) and applied machine learning models (Logistic Regression and XGBoost) to identify children at risk. Additionally, we utilize a novel approach to tabular diffusion-based synthetic data generation (TabSyn) to protect patient privacy within the models. Results: Our findings show that classification techniques can reliably and successfully predict children at risk of missing a vaccine, with recall, precision, and F1-scores exceeding 90% for some vaccines modeled. Additionally, training these models with synthetic data rather than real data, thus preserving the privacy of individuals within the original dataset, does not lead to a loss in predictive performance. Conclusion: These results support the use of synthetic data implementation in health informatics strategies for clinics with limited digital infrastructure, enabling privacy-preserving, scalable forecasting for childhood immunization coverage.
【6】Smartwatch-Based Sitting Time Estimation in Real-World Office Settings
标题:现实办公环境中基于智能手表的就座时间估计
链接:https://arxiv.org/abs/2604.08808
作者:Olivia Zhang,Zhilin Zhang
备注:Accepted at the 18th International Conference on Machine Learning and Computing (ICMLC 2026), February 6-9, 2026
摘要:Sedentary behavior poses a major public health risk, being strongly linked to obesity, cardiovascular disease, and other chronic conditions. Accurately estimating sitting time is therefore critical for monitoring and improving individual health. This work addresses the problem in real-world office settings, where signals from the inertial measurement units (IMU) on a smartwatch were collected from office workers during their daily routines. We propose a method that estimates sitting time from the IMU signals by introducing the use of rotation vector sequences, derived from Euler angles, as a novel representation of movement dynamics. Experiments on a 34-hour dataset demonstrate that exploiting rotation vector sequences improves algorithm performance, highlighting their potential for robust sitting time estimation in natural environments.
【7】Continuous Orthogonal Mode Decomposition: Haptic Signal Prediction in Tactile Internet
标题:连续垂直模式分解:触觉互联网中的触觉信号预测
链接:https://arxiv.org/abs/2604.09446
作者:Mohammad Ali Vahedifar,Mojtaba Nazari,Qi Zhang
摘要:The Tactile Internet demands sub-millisecond latency and ultra-high reliability, as high latency or packet loss could lead to haptic control instability. To address this, we propose the Mode-Domain Architecture (MDA), a bilateral predictive neural network architecture designed to restore missing signals on both the human and robot sides. Unlike conventional models that extract features implicitly from raw data, MDA utilizes a novel Continuous-Orthogonal Mode Decomposition framework. By integrating an orthogonality constraint, we overcome the pervasive issue of "mode overlapping" found in state-of-the-art decomposition methods. Experimental results demonstrate that this structured feature extraction achieves high prediction accuracies of 98.6% (human) and 97.3% (robot). Furthermore, the model achieves ultra-low inference latency of 0.065 ms, significantly outperforming existing benchmarks and meeting the stringent real-time requirements of haptic teleoperation.
【8】A Predictive View on Streaming Hidden Markov Models
标题:流媒体隐藏马尔科夫模型的预测性观点
链接:https://arxiv.org/abs/2604.09208
作者:Gerardo Duran-Martin
摘要:We develop a predictive-first optimisation framework for streaming hidden Markov models. Unlike classical approaches that prioritise full posterior recovery under a fully specified generative model, we assume access to regime-specific predictive models whose parameters are learned online while maintaining a fixed transition prior over regimes. Our objective is to sequentially identify latent regimes while maintaining accurate step-ahead predictive distributions. Because the number of possible regime paths grows exponentially, exact filtering is infeasible. We therefore formulate streaming inference as a constrained projection problem in predictive-distribution space: under a fixed hypothesis budget, we approximate the full posterior predictive by the forward-KL optimal mixture supported on $S$ paths. The solution is the renormalised top-$S$ posterior-weighted mixture, providing a principled derivation of beam search for HMMs. The resulting algorithm is fully recursive and deterministic, performing beam-style truncation with closed-form predictive updates and requiring neither EM nor sampling. Empirical comparisons against Online EM and Sequential Monte Carlo under matched computational budgets demonstrate competitive prequential performance.
其他神经网络|深度学习|模型|建模(27篇)
【1】Toward World Models for Epidemiology
标题:走向流行病学世界模型
链接:https://arxiv.org/abs/2604.09519
作者:Zeeshan Memon,Yiqi Su,Christo Kurisummoottil Thomas,Walid Saad,Liang Zhao,Naren Ramakrishnan
摘要:World models have emerged as a unifying paradigm for learning latent dynamics, simulating counterfactual futures, and supporting planning under uncertainty. In this paper, we argue that computational epidemiology is a natural and underdeveloped setting for world models. This is because epidemic decision-making requires reasoning about latent disease burden, imperfect and policy-dependent surveillance signals, and intervention effects are mediated by adaptive human behavior. We introduce a conceptual framework for epidemiological world models, formulating epidemics as controlled, partially observed dynamical systems in which (i) the true epidemic state is latent, (ii) observations are noisy and endogenous to policy, and (iii) interventions act as sequential actions whose effects propagate through behavioral and social feedback. We present three case studies that illustrate why explicit world modeling is necessary for policy-relevant reasoning: strategic misreporting in behavioral surveillance, systematic delays in time-lagged signals such as hospitalizations and deaths, and counterfactual intervention analysis where identical histories diverge under alternative action sequences.
【2】Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks
标题:通过广义致动器网络实现肌肉驱动机器人的Sim-Real传输
链接:https://arxiv.org/abs/2604.09487
作者:Jan Schneider,Mridul Mahajan,Le Chen,Simon Guist,Bernhard Schölkopf,Ingmar Posner,Dieter Büchler
摘要:Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GeAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GeAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy precise goal-reaching and dynamic ball-in-a-cup policies trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.
【3】Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
标题:光线作为像素:学习视频和摄像机轨迹的联合分布
链接:https://arxiv.org/abs/2604.09429
作者:Wonbong Jang,Shikun Liu,Soubhik Sanyal,Juan Camilo Perez,Kam Woh Ng,Sanskar Agrawal,Juan-Manuel Perez-Rua,Yiannis Douratsos,Tao Xiang
备注:9 pages, 6 figures, 4 tables. Project page: https://wbjang.github.io/raysaspixels/
摘要:Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.
【4】OASIS: Online Activation Subspace Learning for Memory-Efficient Training
标题:OASIS:在线激活子空间学习记忆高效训练
链接:https://arxiv.org/abs/2604.09406
作者:Sakshi Choudhary,Utkarsh Saxena,Kaushik Roy
摘要:Training large language models (LLMs) is constrained by memory requirements, with activations accounting for a substantial fraction of the total footprint. Existing approaches reduce memory using low-rank weight parameterizations or low-rank gradient subspaces for optimizer states, while activation memory is addressed through architectural modifications or compression schemes based on periodically updated projections. We propose OASIS, an online activation subspace learning algorithm for memory-efficient training that tracks and continuously updates a low-dimensional activation subspace during training. Intermediate activations are projected onto this evolving subspace, reducing memory without modifying forward-pass computations. The evolving activation subspace induces low-rank gradient representations, enabling both gradients and optimizer states to be maintained directly in this subspace, while a projection-aware optimizer consistently transports optimizer states across subspace updates for stable training. Across various finetuning and pretraining tasks, OASIS achieves up to $2\times$ lower peak memory than full fine-tuning while matching its performance and outperforming prior low-rank methods.
【5】Stochastic-Dimension Frozen Sampled Neural Network for High-Dimensional Gross-Pitaevskii Equations on Unbounded Domains
标题:无界区域上多维Gross-Pitaevskii方程的随机维冻结抽样神经网络
链接:https://arxiv.org/abs/2604.09361
作者:Zhangyong Liang
摘要:In this paper, we propose a stochastic-dimension frozen sampled neural network (SD-FSNN) for solving a class of high-dimensional Gross-Pitaevskii equations (GPEs) on unbounded domains. SD-FSNN is unbiased across all dimensions, and its computational cost is independent of the dimension, avoiding the exponential growth in computational and memory costs associated with Hermite-basis discretizations. Additionally, we randomly sample the hidden weights and biases of the neural network, significantly outperforming iterative, gradient-based optimization methods in terms of training time and accuracy. Furthermore, we employ a space-time separation strategy, using adaptive ordinary differential equation (ODE) solvers to update the evolution coefficients and incorporate temporal causality. To preserve the structure of the GPEs, we integrate a Gaussian-weighted ansatz into the neural network to enforce exponential decay at infinity, embed a normalization projection layer for mass normalization, and add an energy conservation constraint to mitigate long-time numerical dissipation. Comparative experiments with existing methods demonstrate the superior performance of SD-FSNN across a range of spatial dimensions and interaction parameters. Compared to existing random-feature methods, SD-FSNN reduces the complexity from linear to dimension-independent. Additionally, SD-FSNN achieves better accuracy and faster training compared to general high-dimensional solvers, while focusing specifically on high-dimensional GPEs on unbounded domains.
【6】Statistical Properties of the King Wen Sequence: An Anti-Habituation Structure That Does Not Improve Neural Network Training
标题:文王序列的统计特性:一种不会改善神经网络训练的反习惯结构
链接:https://arxiv.org/abs/2604.09234
作者:Augustin Chan
备注:9 pages, 8 tables, negative results paper. Code and data: https://doi.org/10.5281/zenodo.14679537
摘要:The King Wen sequence of the I-Ching (c. 1000 BC) orders 64 hexagrams -- states of a six-dimensional binary space -- in a pattern that has puzzled scholars for three millennia. We present a rigorous statistical characterization of this ordering using Monte Carlo permutation analysis against 100,000 random baselines. We find that the sequence has four statistically significant properties: higher-than-random transition distance (98.2nd percentile), negative lag-1 autocorrelation (p=0.037), yang-balanced groups of four (p=0.002), and asymmetric within-pair vs. between-pair distances (99.2nd percentile). These properties superficially resemble principles from curriculum learning and curiosity-driven exploration, motivating the hypothesis that they might benefit neural network training. We test this hypothesis through three experiments: learning rate schedule modulation, curriculum ordering, and seed sensitivity analysis, conducted across two hardware platforms (NVIDIA RTX 2060 with PyTorch and Apple Silicon with MLX). The results are uniformly negative. King Wen LR modulation degrades performance at all tested amplitudes. As curriculum ordering, King Wen is the worst non-sequential ordering on one platform and within noise on the other. A 30-seed sweep confirms that only King Wen's degradation exceeds natural seed variance. We explain why: the sequence's high variance -- the very property that makes it statistically distinctive -- destabilizes gradient-based optimization. Anti-habituation in a fixed combinatorial sequence is not the same as effective training dynamics.
【7】MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs
标题:MAT:在多加速器异类边缘SOC上高效部署深度神经网络
链接:https://arxiv.org/abs/2604.09124
作者:Enrico Russo,Mohamed Amine Hamdi,Alessandro Ottaviano,Francesco Conti,Angelo Garofalo,Daniele Jahier Pagliari,Maurizio Palesi,Luca Benini,Alessio Burrello
备注:Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC26)
摘要:Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.
【8】Synthesizing real-world distributions from high-dimensional Gaussian Noise with Fully Connected Neural Network
标题:利用全连接神经网络从多维高斯噪音合成现实世界的分布
链接:https://arxiv.org/abs/2604.09091
作者:Joanna Komorniczak
摘要:The use of synthetic data in machine learning applications and research offers many benefits, including performance improvements through data augmentation, privacy preservation of original samples, and reliable method assessment with fully synthetic data. This work proposes a time-efficient synthetic data generation method based on a fully connected neural network and a randomized loss function that transforms a random Gaussian distribution to approximate a target real-world dataset. The experiments conducted on 25 diverse tabular real-world datasets confirm that the proposed solution surpasses the state-of-the-art generative methods and achieves reference MMD scores orders of magnitude faster than modern deep learning solutions. The experiments involved analyzing distributional similarity, assessing the impact on classification quality, and using PCA for dimensionality reduction, which further enhances data privacy and can boost classification quality while reducing time and memory complexity.
【9】Feature-Label Modal Alignment for Robust Partial Multi-Label Learning
标题:鲁棒部分多标记学习的标记模式对齐
链接:https://arxiv.org/abs/2604.09064
作者:Yu Chen,Weijun Lv,Yue Huang,Xiaozhao Fang,Jie Wen,Yong Xu,Guanbin Li
摘要:In partial multi-label learning (PML), each instance is associated with a set of candidate labels containing both ground-truth and noisy labels. The presence of noisy labels disrupts the correspondence between features and labels, degrading classification performance. To address this challenge, we propose a novel PML method based on feature-label modal alignment (PML-MA), which treats features and labels as two complementary modalities and restores their consistency through systematic alignment. Specifically, PML-MA first employs low-rank orthogonal decomposition to generate pseudo-labels that approximate the true label distribution by filtering noisy labels. It then aligns features and pseudo-labels through both global projection into a common subspace and local preservation of neighborhood structures. Finally, a multi-peak class prototype learning mechanism leverages the multi-label nature where instances simultaneously belong to multiple categories, using pseudo-labels as soft membership weights to enhance discriminability. By integrating modal alignment with prototype-guided refinement, PML-MA ensures pseudo-labels better reflect the true distribution while maintaining robustness against label noise. Extensive experiments on both real-world and synthetic datasets demonstrate that PML-MA significantly outperforms state-of-the-art methods, achieving superior classification accuracy and noise robustness.
【10】Hypergraph Neural Networks Accelerate MUS Enumeration
标题:超图神经网络加速MUS计数
链接:https://arxiv.org/abs/2604.09001
作者:Hiroya Ijima,Koichiro Yawata
摘要
:Enumerating Minimal Unsatisfiable Subsets (MUSes) is a fundamental task in constraint satisfaction problems (CSPs). Its major challenge is the exponential growth of the search space, which becomes particularly severe when satisfiability checks are expensive. Recent machine learning approaches reduce this cost for Boolean satisfiability problems but rely on explicit variable-constraint relationships, limiting their application domains. This paper proposes a domain-agnostic method to accelerate MUS enumeration using Hypergraph Neural Networks (HGNNs). The proposed method incrementally builds a hypergraph with constraints as vertices and MUSes enumerated until the current step as hyperedges, and employs an HGNN-based agent trained via reinforcement learning to minimize the number of satisfiability checks required to obtain an MUS. Experimental results demonstrate the effectiveness of our approach in accelerating MUS enumeration, showing that our method can enumerate more MUSes within the same satisfiability check budget compared to conventional methods.
【11】Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication
标题:通过价值感知顺序通信进行多主体决策学习
链接:https://arxiv.org/abs/2604.08944
作者:Benjamin Amoh,Geoffrey Parker,Wesley Marrero
备注:15 pages, 6 figures, 3 tables. Includes appendix. Submitted to ICML 2026. Code available at https://github.com/AmohBen1/seqcomm_dfl
摘要:Multi-agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce \textbf{SeqComm-DFL}, unifying the sequential communication with decision-focused learning for task performance. Our approach features \emph{value-aware message generation with sequential Stackelberg conditioning}: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emph{guidance potential} determined by their prosocial ordering. We extend Optimal Model Design to communication-augmented world models with QMIX factorization, enabling efficient end-to-end training via implicit differentiation. We prove information-theoretic bounds showing that communication value scales with coordination gaps and establish $\mathcal{O}(1/\sqrt{T})$ convergence for the bilevel optimization, where $T$ denotes the number of training iterations. On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks, SeqComm-DFL achieves four to six times higher cumulative rewards and over 13\% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.
【12】Delve into the Applicability of Advanced Optimizers for Multi-Task Learning
标题:深入研究高级优化器对多任务学习的适用性
链接:https://arxiv.org/abs/2604.08939
作者:Zhipeng Zhou,Linxiao Cao,Pengcheng Wu,Peilin Zhao,Chunyan Miao
备注:12 pages, 5 figures
摘要:Multi-Task Learning (MTL) is a foundational machine learning problem that has seen extensive development over the past decade. Recently, various optimization-based MTL approaches have been proposed to learn multiple tasks simultaneously by altering the optimization trajectory. Although these methods strive to de-conflict and re-balance tasks, we empirically identify that their effectiveness is often undermined by an overlooked factor when employing advanced optimizers: the instant-derived gradients play only a marginal role in the actual parameter updates. This discrepancy prevents MTL frameworks from fully releasing its power on learning dynamics. Furthermore, we observe that Muon-a recently emerged advanced optimizer-inherently functions as a multi-task learner, which underscores the critical importance of the gradients used for its orthogonalization. To address these issues, we propose APT (Applicability of advanced oPTimizers), a framework featuring a simple adaptive momentum mechanism designed to balance the strengths between advanced optimizers and MTL. Additionally, we introduce a light direction preservation method to facilitate Muon's orthogonalization. Extensive experiments across four mainstream MTL datasets demonstrate that APT consistently augments existing MTL approaches, yielding substantial performance improvements.
【13】A Mathematical Framework for Temporal Modeling and Counterfactual Policy Simulation of Student Dropout
标题:学生辍学时间建模和反事实政策模拟的数学框架
链接:https://arxiv.org/abs/2604.08874
作者:Rafael da Silva,Jeff Eicher,Gregory Longo
备注:Approx. 20 pages, 9 figures. Code and reproducibility package available at https://github.com/rafa-rodriguess/TCM-Student-Dropout This work introduces a temporal survival framework with counterfactual policy simulation
摘要:This study proposes a temporal modeling framework with a counterfactual policy-simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time-to-event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class-balanced logistic regression over person--period rows. Under a late-event temporal holdout, the model attains row-level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest-risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario-indexed policy layer produces survival contrasts $ΔS(T)$ under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch ($T_{\rm policy}=18$: 0.0102, 0.0260, 0.0819), while the mechanism-aware branch is negative ($ΔS_{\rm mech}(18)=-0.0078$, $ΔS_{\rm mech}(38)=-0.0134$). A subgroup analysis by gender quantifies scenario-induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework's capacity for internal structural scenario comparison under observational data constraints.
【14】Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning
标题:通过联合流分布学习为一致性模型提供事后指导
链接:https://arxiv.org/abs/2604.08828
作者:Chia-Hong Hsu,Randall Balestriero
摘要
:Classifier-free Guidance (CFG) lets practitioners trade-off fidelity against diversity in Diffusion Models (DMs). The practicality of CFG is however hindered by DMs sampling cost. On the other hand, Consistency Models (CMs) generate images in one or a few steps, but existing guidance methods require knowledge distillation from a separate DM teacher, limiting CFG to Consistency Distillation (CD) methods. We propose Joint Flow Distribution Learning (JFDL), a lightweight alignment method enabling guidance in a pre-trained CM. With a pre-trained CM as an ordinary differential equation (ODE) solver, we verify with normality tests that the variance-exploding noise implied by the velocity fields from unconditional and conditional distributions is Gaussian. In practice, JFDL equips CMs with the familiar adjustable guidance knob, yielding guided images with similar characteristics to CFG. Applied to an original Consistency Trained (CT) CM that could only do conditional sampling, JFDL unlocks guided generation and reduces FID on both CIFAR-10 and ImageNet 64x64 datasets. This is the first time that CMs are able to receive effective guidance post-hoc without a DM teacher, thus, bridging a key gap in current methods for CMs.
【15】Toward Hardware-Agnostic Quadrupedal World Models via Morphology Conditioning
标题:通过形态条件反射迈向硬件不可知的四足世界模型
链接:https://arxiv.org/abs/2604.08780
作者:Mohamad H. Danesh,Chenhao Li,Amin Abyaneh,Anas Houssaini,Kirsty Ellis,Glen Berseth,Marco Hutter,Hsiu-Chin Lin
摘要:World models promise a paradigm shift in robotics, where an agent learns the underlying physics of its environment once to enable efficient planning and behavior learning. However, current world models are often hardware-locked specialists: a model trained on a Boston Dynamics Spot robot fails catastrophically on a Unitree Go1 due to the mismatch in kinematic and dynamic properties, as the model overfits to specific embodiment constraints rather than capturing the universal locomotion dynamics. Consequently, a slight change in actuator dynamics or limb length necessitates training a new model from scratch. In this work, we take a step towards a framework for training a generalizable Quadrupedal World Model (QWM) that disentangles environmental dynamics from robot morphology. We address the limitations of implicit system identification, where treating static physical properties (like mass or limb length) as latent variables to be inferred from motion history creates an adaptation lag that can compromise zero-shot safety and efficiency. Instead, we explicitly condition the generative dynamics on the robot's engineering specifications. By integrating a physical morphology encoder and a reward normalizer, we enable the model to serve as a neural simulator capable of generalizing across morphologies. This capability unlocks zero-shot control across a range of embodiments. We introduce, for the first time, a world model that enables zero-shot generalization to new morphologies for locomotion. While we carefully study the limitations of our method, QWM operates as a distribution-bounded interpolator within the quadrupedal morphology family rather than a universal physics engine, this work represents a significant step toward morphology-conditioned world models for legged locomotion.
【16】Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup
标题:基于深度学习的韧带断裂追踪和谱系重建
链接:https://arxiv.org/abs/2604.08711
作者:Vrushank Ahire,Vivek Kurumanghat,Mudasir Ganaie,Lipika Kabiraj
摘要:The disintegration of liquid sheets into ligaments and droplets involves highly transient, multi-scale dynamics that are difficult to quantify from high-speed shadowgraphy images. Identifying droplets, ligaments, and blobs formed during breakup, along with tracking across frames, is essential for spray analysis. However, conventional multi-object tracking frameworks impose strict one-to-one temporal associations and cannot represent one-to-many fragmentation events. In this study, we present a two-stage deep learning framework for object detection and temporal relationship modeling across frames. The framework captures ligament deformation, fragmentation, and parent-child lineage during liquid sheet disintegration. In the first stage, a Faster R-CNN with a ResNet-50 backbone and Feature Pyramid Network detects and classifies ligaments and droplets in high-speed shadowgraphy recordings of an impinging Carbopol gel jet. A morphology-preserving synthetic data generation strategy augments the training set without introducing physically implausible configurations, achieving a held-out F1 score of up to 0.872 across fourteen original-to-synthetic configurations. In the second stage, a Transformer-augmented multilayer perceptron classifies inter-frame associations into continuation, fragmentation (one-to-many), and non-association using physics-informed geometric features. Despite severe class imbalance, the model achieves 86.1% accuracy, 93.2% precision, and perfect recall (1.00) for fragmentation events. Together, the framework enables automated reconstruction of fragmentation trees, preservation of parent-child lineage, and extraction of breakup statistics such as fragment multiplicity and droplet size distributions. By explicitly identifying children droplets formed from ligament fragmentation, the framework provides automated analysis of the primary atomization mode.
【17】PRAGMA: Revolut Foundation Model
标题:PRAGMA:Revolut基金会模型
链接:https://arxiv.org/abs/2604.08649
作者:Maxim Ostroukhov,Ruslan Mikhailov,Vladimir Iashin,Artem Sokolov,Andrei Akshonov,Vitaly Protasov,Dmitrii Beloborodov,Vince Mullin,Roman Yokunda Enzmann,Georgios Kolovos,Jason Renders,Pavel Nesterov,Anton Repushko
摘要:Modern financial systems generate vast quantities of transactional and event-level data that encode rich economic signals. This paper presents PRAGMA, a family of foundation models for multi-source banking event sequences. Our approach pre-trains a Transformer-based architecture with masked modelling on a large-scale, heterogeneous banking event corpus using a self-supervised objective tailored to the discrete, variable-length nature of financial records. The resulting model supports a wide range of downstream tasks such as credit scoring, fraud detection, and lifetime value prediction: strong performance can be achieved by training a simple linear model on top of the extracted embeddings and can be further improved with lightweight fine-tuning. Through extensive evaluation on downstream tasks, we demonstrate that PRAGMA achieves superior performance across multiple domains directly from raw event sequences, providing a general-purpose representation layer for financial applications.
【18】VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning
标题:VOLTA:校准深度学习的辅助损失令人惊讶的无效性
链接:https://arxiv.org/abs/2604.08639
作者:Rahul D Ray,Utkarsh Srivastava
摘要
:Uncertainty quantification (UQ) is essential for deploying deep learning models in safety critical applications, yet no consensus exists on which UQ method performs best across different data modalities and distribution shifts. This paper presents a comprehensive benchmark of ten widely used UQ baselines including MC Dropout, SWAG, ensemble methods, temperature scaling, energy based OOD, Mahalanobis, hyperbolic classifiers, ENN, Taylor Sensus, and split conformal prediction against a simplified yet highly effective variant of VOLTA that retains only a deep encoder, learnable prototypes, cross entropy loss, and post hoc temperature scaling. We evaluate all methods on CIFAR 10 (in distribution), CIFAR 100, SVHN, uniform noise (out of distribution), CIFAR 10 C (corruptions), and Tiny ImageNet features (tabular). VOLTA achieves competitive or superior accuracy (up to 0.864 on CIFAR 10), significantly lower expected calibration error (0.010 vs. 0.044 to 0.102 for baselines), and strong OOD detection (AUROC 0.802). Statistical testing over three random seeds shows that VOLTA matches or outperforms most baselines, with ablation studies confirming the importance of adaptive temperature and deep encoders. Our results establish VOLTA as a lightweight, deterministic, and well calibrated alternative to more complex UQ approaches.
【19】From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales
标题:从分散到吸引:耳语模型尺度上幻觉的光谱动力学
链接:https://arxiv.org/abs/2604.08591
作者:Ivan Viakhirev,Kirill Borodin,Grach Mkrtchian
备注:This paper has been submitted to Interspeech 2026 for review
摘要:Hallucinations in large ASR models present a critical safety risk. In this work, we propose the \textit{Spectral Sensitivity Theorem}, which predicts a phase transition in deep networks from a dispersive regime (signal decay) to an attractor regime (rank-1 collapse) governed by layer-wise gain and alignment. We validate this theory by analyzing the eigenspectra of activation graphs in Whisper models (Tiny to Large-v3-Turbo) under adversarial stress. Our results confirm the theoretical prediction: intermediate models exhibit \textit{Structural Disintegration} (Regime I), characterized by a $13.4\%$ collapse in Cross-Attention rank. Conversely, large models enter a \textit{Compression-Seeking Attractor} state (Regime II), where Self-Attention actively compresses rank ($-2.34\%$) and hardens the spectral slope, decoupling the model from acoustic evidence.
【20】FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes
标题:FluidFlow:非结构网格上的流体动力学流匹配生成模型
链接:https://arxiv.org/abs/2604.08586
作者:David Ramos,Lucas Lacasa,Fermín Gutiérrez,Eusebio Valero,Gonzalo Rubio
备注:17 pages, 6 figures
摘要:Computational fluid dynamics (CFD) provides high-fidelity simulations of fluid flows but remains computationally expensive for many-query applications. In recent years deep learning (DL) has been used to construct data-driven fluid-dynamic surrogate models. In this work we consider a different learning paradigm and embrace generative modelling as a framework for constructing scalable fluid-dynamics surrogate models. We introduce FluidFlow, a generative model based on conditional flow-matching, a recent alternative to diffusion models that learns deterministic transport maps between noise and data distributions. FluidFlow is specifically designed to operate directly on CFD data defined on both structured and unstructured meshes alike, without the needs to perform any mesh interpolation pre-processing and preserving geometric fidelity. We assess the capabilities of FluidFlow using two different core neural network architectures, a U-Net and diffusion transformer (DiT), and condition their learning on physically meaningful parameters. The methodology is validated on two benchmark problems of increasing complexity: prediction of pressure coefficients along an airfoil boundary across different operating conditions, and prediction of pressure and friction coefficients over a full three-dimensional aircraft geometry discretized on a large unstructured mesh. In both cases, FluidFlow outperform strong multilayer perceptron baselines, achieving significantly lower error metrics and improved generalisation across operating conditions. Notably, the transformer-based architecture enables scalable learning on large unstructured datasets while maintaining high predictive accuracy. These results demonstrate that flow-matching generative models provide an effective and flexible framework for surrogate modelling in fluid dynamics, with potential for realistic engineering and scientific applications.
【21】Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer
标题:超越多专家学习延迟的增强行动替代品
链接:https://arxiv.org/abs/2604.09414
作者:Yannis Montreuil,Axel Carlier,Lai Xing Ng,Wei Tsang Ooi
摘要:Learning-to-Defer routes each input to the expert that minimizes expected cost, but it assumes that the information available to every expert is fixed at decision time. Many modern systems violate this assumption: after selecting an expert, one may also choose what additional information that expert should receive, such as retrieved documents, tool outputs, or escalation context. We study this problem and call it Learning-to-Defer with advice. We show that a broad family of natural separated surrogates, which learn routing and advice with distinct heads, is inconsistent even in the smallest non-trivial setting. We then introduce an augmented surrogate that operates on the composite expert--advice action space and prove an $\mathcal{H}$-consistency guarantee together with an excess-risk transfer bound, yielding recovery of the Bayes-optimal policy in the limit. Experiments on tabular, language, and multi-modal tasks show that the resulting method improves over standard Learning-to-Defer while adapting its advice-acquisition behavior to the cost regime; a synthetic benchmark confirms the failure mode predicted for separated surrogates.
【22】Sharp description of local minima in the loss landscape of high-dimensional two-layer ReLU neural networks
标题:对多维两层ReLU神经网络损失格局中局部极小值的清晰描述
链接:https://arxiv.org/abs/2604.09412
作者:Jie Huang,Bruno Loureiro,Stefano Sarao Mannelli
备注:34 pages, 22 figures
摘要
:We study the population loss landscape of two-layer ReLU networks of the form $\sum_{k=1}^K \mathrm{ReLU}(w_k^\top x)$ in a realisable teacher-student setting with Gaussian covariates. We show that local minima admit an exact low-dimensional representation in terms of summary statistics, yielding a sharp and interpretable characterisation of the landscape. We further establish a direct link with one-pass SGD: local minima correspond to attractive fixed points of the dynamics in summary statistics space. This perspective reveals a hierarchical structure of minima: they are typically isolated in the well-specified regime, but become connected by flat directions as network width increases. In this overparameterised regime, global minima become increasingly accessible, attracting the dynamics and reducing convergence to spurious solutions. Overall, our results reveal intrinsic limitations of common simplifying assumptions, which may miss essential features of the loss landscape even in minimal neural network models.
【23】Natural Riemannian gradient for learning functional tensor networks
标题:用于学习函数张量网络的自然Riemann梯度
链接:https://arxiv.org/abs/2604.09263
作者:Nikolas Klug,Michael Ulbrich,Marius Willner,André Uschmajew
摘要:We consider machine learning tasks with low-rank functional tree tensor networks (TTN) as the learning model. While in the case of least-squares regression, low-rank functional TTNs can be efficiently optimized using alternating optimization, this is not directly possible in other problems, such as multinomial logistic regression. We propose a natural Riemannian gradient descent type approach applicable to arbitrary losses which is based on the natural gradient by Amari. In particular, the search direction obtained by the natural gradient is independent of the choice of basis of the underlying functional tensor product space. Our framework applies to both the factorized and manifold-based approach for representing the functional TTN. For practical application, we propose a hierarchy of efficient approximations to the true natural Riemannian gradient for computing the updates in the parameter space. Numerical experiments confirm our theoretical findings on common classification datasets and show that using natural Riemannian gradient descent for learning considerably improves convergence behavior when compared to standard Riemannian gradient methods.
【24】Online Quantile Regression for Nonparametric Additive Models
标题:非参数可加性模型的在线分位数回归
链接:https://arxiv.org/abs/2604.08969
作者:Haoran Zhan
摘要:This paper introduces a projected functional gradient descent algorithm (P-FGD) for training nonparametric additive quantile regression models in online settings. This algorithm extends the functional stochastic gradient descent framework to the pinball loss. An advantage of P-FGD is that it does not need to store historical data while maintaining $O(J_t\ln J_t)$ computational complexity per step where $J_t$ denotes the number of basis functions. Besides, we only need $O(J_t)$ computational time for quantile function prediction at time $t$. These properties show that P-FGD is much better than the commonly used RKHS in online learning. By leveraging a novel Hilbert space projection identity, we also prove that the proposed online quantile function estimator (P-FGD) achieves the minimax optimal consistency rate $O(t^{-\frac{2s}{2s+1}})$ where $t$ is the current time and $s$ denotes the smoothness degree of the quantile function. Extensions to mini-batch learning are also established.
【25】A novel hybrid approach for positive-valued DAG learning
标题:一种新颖的正值DAB学习混合方法
链接:https://arxiv.org/abs/2604.08935
作者:Yao Zhao
备注:13 pages, 2 tables. Accepted at CLeaR 2026
摘要:Causal discovery from observational data remains a fundamental challenge in machine learning and statistics, particularly when variables represent inherently positive quantities such as gene expression levels, asset prices, company revenues, or population counts, which often follow multiplicative rather than additive dynamics. We propose the Hybrid Moment-Ratio Scoring (H-MRS) algorithm, a novel method for learning directed acyclic graphs (DAGs) from positive-valued data by combining moment-based scoring with log-scale regression. The key idea is that for positive-valued variables, the moment ratio $\frac{\mathbb{E}[X_j^2]}{\mathbb{E}[(\mathbb{E}[X_j \mid S])^2]}$ provides an effective criterion for causal ordering, where $S$ denotes candidate parent sets. H-MRS integrates log-scale Ridge regression for moment-ratio estimation with a greedy ordering procedure based on raw-scale moment ratios, followed by Elastic Net-based parent selection to recover the final DAG structure. Experiments on synthetic log-linear data demonstrate competitive precision and recall. The proposed method is computationally efficient and naturally respects positivity constraints, making it suitable for applications in genomics and economics. These results suggest that combining log-scale modeling with raw-scale moment ratios provides a practical framework for causal discovery in positive-valued domains.
【26】Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States
标题:回归神经网络量子状态中几何诱导的远程相关性
链接:https://arxiv.org/abs/2604.08661
作者:Asif Bin Ayub,Amine Mohamed Aboussalah,Mohamed Hibat-Allah
备注:16 pages, 4 figures, and 1 table
摘要
:Neural Quantum States based on autoregressive recurrent neural network (RNN) wave functions enable efficient sampling without Markov-chain autocorrelation, but standard RNN architectures are biased toward finite-length correlations and can fail on states with long-range dependencies. A common response is to adopt transformer-style self-attention, but this typically comes with substantially higher computational and memory overhead. Here we introduce dilated RNN wave functions, where recurrent units access distant sites through dilated connections, injecting an explicit long-range inductive bias while retaining a favorable $\mathcal{O}(N \log N)$ forward pass scaling. We show analytically that dilation changes the correlation geometry and can induce power-law correlation scaling in a simplified linearized and perturbative setting. Numerically, for the critical 1D transverse-field Ising model, dilated RNNs reproduce the expected power-law connected two-point correlations in contrast to the exponential decay typical of conventional RNN ansätze. We further show that the dilated RNN accurately approximates the one-dimensional Cluster state, a paradigmatic example with long-range conditional correlations that has previously been reported to be challenging for RNN-based wave functions. These results highlight dilation as a simple geometric mechanism for building correlation-aware autoregressive neural quantum states.
【27】Spectral-Transport Stability and Benign Overfitting in Interpolating Learning
标题:内插学习中的光谱传输稳定性和良性过拟
链接:https://arxiv.org/abs/2604.08625
作者:Gustav Olaf Yunus Laitinen-Lundström Fredriksson-Imanov
备注:50 pages, 7 figures, 4 tables. Research article. Includes full proofs, model-specific corollaries, and synthetic supporting experiments. Submitted to Machine Learning
摘要:We develop a theoretical framework for generalization in the interpolating regime of statistical learning. The central question is why highly overparameterized estimators can attain zero empirical risk while still achieving nontrivial predictive accuracy, and how to characterize the boundary between benign and destructive overfitting. We introduce a spectral-transport stability framework in which excess risk is controlled jointly by the spectral geometry of the data distribution, the sensitivity of the learning rule under single-sample replacement, and the alignment structure of label noise. This leads to a scale-dependent Fredriksson index that combines effective dimension, transport stability, and noise alignment into a single complexity parameter for interpolating estimators. We prove finite-sample risk bounds, establish a sharp benign-overfitting criterion through the vanishing of the index along admissible spectral scales, and derive explicit phase-transition rates under polynomial spectral decay. For a model-specific specialization, we obtain an explicit theorem for polynomial-spectrum linear interpolation, together with a proof of the resulting rate. The framework also clarifies implicit regularization by showing how optimization dynamics can select interpolating solutions of minimal spectral-transport energy. These results connect algorithmic stability, double descent, benign overfitting, operator-theoretic learning theory, and implicit bias within a unified structural account of modern interpolation.
其他(30篇)
【1】Envisioning the Future, One Step at a Time
标题:展望未来,一步一个脚印
链接:https://arxiv.org/abs/2604.09527
作者:Stefan Andreas Baumann,Jannik Wiese,Tommaso Martorella,Mahdi M. Kalayeh,Björn Ommer
备注:CVPR 2026. For code and models, see http://compvis.github.io/myriad
摘要:Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.
【2】RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval
标题:RecaLLM:通过显式上下文检索解决思维迷失现象
链接:https://arxiv.org/abs/2604.09494
作者:Kyle Whitecross,Negin Rahimi
备注:Code, data, and models available at https://github.com/kswhitecross/RecaLLM
摘要:We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.
【3】Offline Local Search for Online Stochastic Bandits
标题:线下本地搜索在线随机盗贼
链接:https://arxiv.org/abs/2604.09423
作者:Gerdus Benadè,Rathish Das,Thomas Lavastida
备注:Part of this work has been accepted at ACM SIGMETRICS 2026
摘要
:Combinatorial multi-armed bandits provide a fundamental online decision-making environment where a decision-maker interacts with an environment across $T$ time steps, each time selecting an action and learning the cost of that action. The goal is to minimize regret, defined as the loss compared to the optimal fixed action in hindsight under full-information. There has been substantial interest in leveraging what is known about offline algorithm design in this online setting. Offline greedy and linear optimization algorithms (both exact and approximate) have been shown to provide useful guarantees when deployed online. We investigate local search methods, a broad class of algorithms used widely in both theory and practice, which have thus far been under-explored in this context. We focus on problems where offline local search terminates in an approximately optimal solution and give a generic method for converting such an offline algorithm into an online stochastic combinatorial bandit algorithm with $O(\log^3 T)$ (approximate) regret. In contrast, existing offline-to-online frameworks yield regret (and approximate regret) which depend sub-linearly, but polynomially on $T$. We demonstrate the flexibility of our framework by applying it to three online stochastic combinatorial optimization problems: scheduling to minimize total completion time, finding a minimum cost base of a matroid and uncertain clustering.
【4】Stability Enhanced Gaussian Process Variational Autoencoders
标题:稳定性增强的高斯过程变分自编码器
链接:https://arxiv.org/abs/2604.09331
作者:Carl R. Richardson,Jichen Zhang,Ethan King,Ján Drgoňa
摘要:A novel stability-enhanced Gaussian process variational autoencoder (SEGP-VAE) is proposed for indirectly training a low-dimensional linear time invariant (LTI) system, using high-dimensional video data. The mean and covariance function of the novel SEGP prior are derived from the definition of an LTI system, enabling the SEGP to capture the indirectly observed latent process using a combined probabilistic and interpretable physical model. The search space of LTI parameters is restricted to the set of semi-contracting systems via a complete and unconstrained parametrisation. As a result, the SEGP-VAE can be trained using unconstrained optimisation algorithms. Furthermore, this parametrisation prevents numerical issues caused by the presence of a non-Hurwitz state matrix. A case study applies SEGP-VAE to a dataset containing videos of spiralling particles. This highlights the benefits of the approach and the application-specific design choices that enabled accurate latent state predictions.
【5】Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images
标题:超越分割:从不完美图像中分析结构化的立面
链接:https://arxiv.org/abs/2604.09260
作者:Maciej Janicki,Aleksander Plocharski,Przemyslaw Musialski
备注:4 pages, 4 figures, EUROGRAPHICS 2026 Short Paper
摘要:Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.
【6】Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
标题:Nexus:相同的训练前损失,通过共同最小值实现更好的下游概括
链接:https://arxiv.org/abs/2604.09258
作者:Huanran Chen,Huaqing Zhang,Xiao Li,Yinpeng Dong,Ke Shen,Jun Zhu
摘要:Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.
【7】MixFlow: Mixed Source Distributions Improve Rectified Flows
标题:MixFlow:混合来源分配改善了已纠正的流程
链接:https://arxiv.org/abs/2604.09181
作者:Nazir Nayal,Christopher Wewer,Jan Eric Lenssen
摘要
:Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curvature, as shown by previous work, is independence between the source distribution (standard Gaussian) and the data distribution. In this work, we tackle this limitation by two complementary contributions. First, we attempt to break away from the standard Gaussian assumption by introducing $κ\texttt{-FC}$, a general formulation that conditions the source distribution on an arbitrary signal $κ$ that aligns it better with the data distribution. Then, we present MixFlow, a simple but effective training strategy that reduces the generative path curvatures and considerably improves sampling efficiency. MixFlow trains a flow model on linear mixtures of a fixed unconditional distribution and a $κ\texttt{-FC}$-based distribution. This simple mixture improves the alignment between the source and data, provides better generation quality with less required sampling steps, and accelerates the training convergence considerably. On average, our training procedure improves the generation quality by 12\% in FID compared to standard rectified flow and 7\% compared to previous baselines under a fixed sampling budget. Code available at: $\href{https://github.com/NazirNayal8/MixFlow}{https://github.com/NazirNayal8/MixFlow}$
【8】CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
标题:CORA:用于受保护的移动图形用户界面自动化的保形风险控制代理
链接:https://arxiv.org/abs/2604.09155
作者:Yushi Feng,Junye Du,Qifan Wang,Zizhan Ma,Qian Niu,Yutaka Matsuo,Long Feng,Lequan Yu
摘要:Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees. We propose CORA (COnformal Risk-controlled GUI Agent), a post-policy, pre-action safeguarding framework that provides statistical guarantees on harmful executed actions. CORA reformulates safety as selective action execution: we train a Guardian model to estimate action-conditional risk for each proposed step. Rather than thresholding raw scores, we leverage Conformal Risk Control to calibrate an execute/abstain boundary that satisfies a user-specified risk budget and route rejected actions to a trainable Diagnostician model, which performs multimodal reasoning over rejected actions to recommend interventions (e.g., confirm, reflect, or abort) to minimize user burden. A Goal-Lock mechanism anchors assessment to a clarified, frozen user intent to resist visual injection attacks. To rigorously evaluate this paradigm, we introduce Phone-Harm, a new benchmark of mobile safety violations with step-level harm labels under real-world settings. Experiments on Phone-Harm and public benchmarks against diverse baselines validate that CORA improves the safety--helpfulness--interruption Pareto frontier, offering a practical, statistically grounded safety paradigm for autonomous GUI execution. Code and benchmark are available at cora-agent.github.io.
【9】Score-Driven Rating System for Sports
标题:体育成绩驱动的评分系统
链接:https://arxiv.org/abs/2604.09143
作者:Vladimír Holý,Michal Černý
摘要:This paper introduces a score-driven rating system, a generalization of the classical Elo rating system that employs the score, i.e. the gradient of the log-likelihood, as the updating mechanism for player and team ratings. The proposed framework extends beyond simple win/loss game outcomes and accommodates a wide range of game results, such as point differences, win/draw/loss outcomes, or complete rankings. Theoretical properties of the score are derived, showing that it has zero expected value, sums to zero across all players, and decreases with increasing value of a player's rating, thereby ensuring internal consistency and fairness. Furthermore, the score-driven rating system exhibits a reversion property, meaning that ratings tend to follow the underlying unobserved true skills over time. The proposed framework provides a theoretical rationale for existing dynamic models of sports performance and offers a systematic approach for constructing new ones.
【10】GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimisation
标题:GeoPAS:连续黑匣子优化中算法选择的几何探测
链接:https://arxiv.org/abs/2604.09095
作者:Jiabao Brad Wang,Xiang Shi,Yiliang Yuan,Mustafa Misir
备注:Companion to a paper to appear at GECCO 2026
摘要:Automated algorithm selection in continuous black-box optimisation typically relies on fixed landscape descriptors computed under a limited probing budget, yet such descriptors can degrade under problem-split or cross-benchmark evaluation. We propose GeoPAS, a geometric probing approach that represents a problem instance by multiple coarse two-dimensional slices sampled across locations, orientations, and logarithmic scales. A shared validity-aware convolutional encoder maps each slice to an embedding, conditions it on slice-scale and amplitude statistics, and aggregates the resulting features permutation-invariantly for risk-aware solver selection via log-scale performance prediction with an explicit penalty on tail failures. On COCO/BBOB with a 12-solver portfolio in dimensions 2--10, GeoPAS improves over the single best solver under leave-instance-out, grouped random, and leave-problem-out evaluation. These results suggest that multi-scale geometric slices provide a useful transferable static signal for algorithm selection, although a small number of heavy-tail regimes remain and continue to dominate the mean. Our code is available at $\href{https://github.com/BradWangW/GeoPAS}{GitHub}$.
【11】U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster
标题:U-Cast:一款令人惊讶的简单有效的前沿概率AI天气预报员
链接:https://arxiv.org/abs/2604.09041
作者:Salva Rühling Cachay,Duncan Watson-Parris,Rose Yu
备注:Our code is available at: https://github.com/Rose-STL-Lab/u-cast
摘要:AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce U-Cast, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5$^\circ\$ resolution while reducing training compute by over 10$\times$ compared to leading CRPS-based models and inference latency by over 10$\times$ compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 60-step ensemble forecast in 11 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community. Our code is available at: https://github.com/Rose-STL-Lab/u-cast.
【12】Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA
标题:状态条件检索:两跳QA的理论和可转移路由器
链接:https://arxiv.org/abs/2604.09019
作者:Andre Bacellar
备注:8 pages, 5 figures. Theory and empirical validation of regime-conditional multi-hop retrieval routing
摘要:Two-hop QA retrieval splits queries into two regimes determined by whether the hop-2 entity is explicitly named in the question (Q-dominant) or only in the bridge passage (B-dominant). We formalize this split with three theorems: (T1) per-query AUC is a monotone function of the cosine separation margin, with R^2 >= 0.90 for six of eight type-encoder pairs; (T2) regime is characterized by two surface-text predicates, with P1 decisive for routing and P2 qualifying the B-dominant case, holding across three encoders and three datasets; and (T3) bridge advantage requires the relation-bearing sentence, not entity name alone, with removal causing an 8.6-14.1 pp performance drop (p < 0.001). Building on this theory, we propose RegimeRouter, a lightweight binary router that selects between question-only and question-plus-relation-sentence retrieval using five text features derived directly from the predicate definitions. Trained on 2WikiMultiHopQA (n = 881, 5-fold cross-fitted) and applied zero-shot to MuSiQue and HotpotQA, RegimeRouter achieves +5.6 pp (p < 0.001), +5.3 pp (p = 0.002), and +1.1 pp (non-significant, no-regret) R@5 improvement, respectively, with artifact-driven.
【13】How does Chain of Thought decompose complex tasks?
标题:思维链如何分解复杂任务?
链接:https://arxiv.org/abs/2604.08872
作者:Amrut Nadgir,Vijay Balasubramanian,Pratik Chaudhari
摘要:Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes ("degree"). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they "think'", i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.
【14】Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance
标题:LoRA适配器的光谱几何编码训练目标并预测有害的依从性
链接:https://arxiv.org/abs/2604.08844
作者:Roi Paul
备注:15 pages, 8 figures, pre-registered experiment, data at https://github.com/roip/task-geometry-experiment-results
摘要:We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking ($ρ\geq 0.956$). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC~1.00 for objective separation), orthogonal to training duration on PC2. Query-projection weights detect that drift occurred; value-projection weights identify which objective. Cross-method generalization fails completely: a DPO-trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO-inverted-harmlessness adapters show elevated harmful compliance on HEx-PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, $Δ= +0.154$), with near-perfect dose--response ($ρ= 0.986$). The geometry-to-behavior rank correlation is $ρ= 0.72$ across 24 non-steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross-method monitoring requires per-method calibration.
【15】Discrete Meanflow Training Curriculum
标题:离散均流训练课程
链接:https://arxiv.org/abs/2604.08837
作者:Chia-Hong Hsu,Frank Wood
摘要:Flow-based image generative models exhibit stable training and produce high quality samples when using multi-step sampling procedures. One-step generative models can produce high quality image samples but can be difficult to optimize as they often exhibit unstable training dynamics. Meanflow models exhibit excellent few-step sampling performance and tantalizing one-step sampling performance. Notably, MeanFlow models that achieve this have required extremely large training budgets. We significantly decrease the amount of computation and data budget it takes to train Meanflow models by noting and exploiting a particular discretization of the Meanflow objective that yields a consistency property which we formulate into a ``Discrete Meanflow'' (DMF) Training Curriculum. Initialized with a pretrained Flow Model, DMF curriculum reaches one-step FID 3.36 on CIFAR-10 in only 2000 epochs. We anticipate that faster training curriculums of Meanflow models, specifically those fine-tuned from existing Flow Models, drives efficient training methods of future one-step examples.
【16】Loom: A Scalable Analytical Neural Computer Architecture
标题:织机:可扩展的分析神经计算机架构
链接:https://arxiv.org/abs/2604.08816
作者:Mehmet Kerem Turkcan
摘要:We present Loom, a computer architecture that executes programs compiled from C inside a looped transformer whose weights are derived analytically. The architecture implements a 22-opcode instruction set in 8 transformer layers. Each forward pass executes one instruction; the model is applied iteratively until the program counter reaches zero. The full machine state resides in a single tensor $X \in \mathbb{R}^{d \times n}$ of fixed size, and every step has fixed cost for fixed $d$ and $n$, independent of program length or execution history. The default configuration uses $d = 155$ and $n = 1024$, yielding 4.7 million parameters and 928 instruction slots. A compact configuration at $d = 146$ and $n = 512$ suffices for a 9$\times$9 Sudoku solver (284 instructions). The weights are program-independent: programs live in the state tensor, and the same fixed-weight model executes any compiled program. We make Loom source code publicly available at https://github.com/mkturkcan/Loom.
【17】A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
标题:一点点排名就能走很长的路:带有LoRA适配器的随机支架就是您所需要的
链接:https://arxiv.org/abs/2604.08749
作者:Hananel Hazan,Yanbo Zhang,Benedikt Hartl,Michael Levin
摘要:How many of a neural network's parameters actually encode task-specific information? We investigate this question with LottaLoRA, a training paradigm in which every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning diverse architecture families from single-layer classifiers to 900M parameter Transformers low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests.Three mechanistic findings underpin this result:(1) the frozen backbone is actively exploited when static the learned scaling~$β$ remains strictly positive across all architectures but when the scaffold is destabilized, the optimizer silences it and the LoRA factors absorb all task information; (2) the frozen backbone is preferable but interchangeable any random initialization works equally well, provided it remains fixed throughout training; and (3) the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task, reminiscent of the number of components retained in Principal Component Analysis (PCA). The construction is formally analogous to Reservoir Computing unfolded along the depth axis of a feedforward network. Because the backbone is determined by a random seed alone, models can be distributed as adapters plus seed a footprint that grows with task complexity, not model size, so that storage and memory savings compound as architectures scale.
【18】Sustained Impact of Agentic Personalisation in Marketing: A Longitudinal Case Study
标题:营销中激进个性化的持续影响:纵向案例研究
链接:https://arxiv.org/abs/2604.08621
作者:Olivier Jeunen,Eleanor Hanna,Schaun Wheeler
备注:To appear in the 34th ACM International Conference on User Modeling, Adaptation and Personalization (UMAP '26) Industry Track
摘要:In consumer applications, Customer Relationship Management (CRM) has traditionally relied on the manual optimisation of static, rule-based messaging strategies. While adaptive and autonomous learning systems offer the promise of scalable personalisation, it remains unclear to what extent ``human-in-the-loop'' oversight is required to sustain performance uplift over time. This paper presents a longitudinal case study analysing a real-world consumer application that leverages agentic infrastructure to personalise marketing messaging for a large-scale user base over an 11-month period. We compare two distinct periods: an active phase where marketers directly curated content, audiences, and strategies -- followed immediately by a passive phase where agents operated autonomously from a fixed library of components. Our results demonstrate that whilst active human management generates the highest relative lift in engagement metrics, the autonomous agents successfully sustained a positive lift during the passive period. These findings suggest a symbiotic model where human intervention drives strategic initialisation and discovery, yet autonomous agents can ensure the scalable retention and preservation of performance gains.
【19】From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
标题:从选择到调度:联邦几何感知纠正使示例重播在连续动态异时更好地工作
链接:https://arxiv.org/abs/2604.08617
作者:Zhuang Qi,Ying-Peng Tang,Lei Meng,Guoqing Chao,Lei Wu,Han Yu,Xiangxu Meng
备注:CVPR 2026 accepted
摘要:Exemplar replay has become an effective strategy for mitigating catastrophic forgetting in federated continual learning (FCL) by retaining representative samples from past tasks. Existing studies focus on designing sample-importance estimation mechanisms to identify information-rich samples. However, they typically overlook strategies for effectively utilizing the selected exemplars, which limits their performance under continual dynamic heterogeneity across clients and tasks. To address this issue, this paper proposes a Federated gEometry-Aware correcTion method, termed FEAT, which alleviates imbalance-induced representation collapse that drags rare-class features toward frequent classes across clients. Specifically, it consists of two key modules: 1) the Geometric Structure Alignment module performs structural knowledge distillation by aligning the pairwise angular similarities between feature representations and their corresponding Equiangular Tight Frame prototypes, which are fixed and shared across clients to serve as a class-discriminative reference structure. This encourages geometric consistency across tasks and helps mitigate representation drift; 2) the Energy-based Geometric Correction module removes task-irrelevant directional components from feature embeddings, which reduces prediction bias toward majority classes. This improves sensitivity to minority classes and enhances the model's robustness under class-imbalanced distributions.
【20】TiAb Review Plugin: A Browser-Based Tool for AI-Assisted Title and Abstract Screening
标题:TiAb评论插件:一种基于浏览器的工具,用于人工智能辅助标题和摘要筛选
链接:https://arxiv.org/abs/2604.08602
作者:Yuki Kataoka,Masahiro Banno,Michihito Kyo,Shuri Nakao,Tomoo Sato,Shunsuke Taito,Tomohiro Takayama,Takahiro Tsuge,Yasushi Tsujimoto,Ryuhei So,Toshi A. Furukawa
备注:25 pages, 2 figures. Abstract submitted to Cochrane Colloquium 2026. Code: https://github.com/youkiti/tiab-review-plugin
摘要:Background: Server-based screening tools impose subscription costs, while open-source alternatives require coding skills. Objectives: We developed a browser extension that provides no-code, serverless artificial intelligence (AI)-assisted title and abstract screening and examined its functionality. Methods: TiAb Review Plugin is an open-source Chrome browser extension (available at https://chromewebstore.google.com/detail/tiab-review-plugin/alejlnlfflogpnabpbplmnojgoeeabij). It uses Google Sheets as a shared database, requiring no dedicated server and enabling multi-reviewer collaboration. Users supply their own Gemini API key, stored locally and encrypted. The tool offers three screening modes: manual review, large language model (LLM) batch screening, and machine learning (ML) active learning. For ML evaluation, we re-implemented the default ASReview active learning algorithm (TF-IDF with Naive Bayes) in TypeScript to enable in-browser execution, and verified equivalence against the original Python implementation using 10-fold cross-validation on six datasets. For LLM evaluation, we compared 16 parameter configurations across two model families on a benchmark dataset, then validated the optimal configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95) with a sensitivity-oriented prompt on five public datasets (1,038 to 5,628 records, 0.5 to 2.0 percent prevalence). Results: The TypeScript classifier produced top-100 rankings 100 percent identical to the original ASReview across all six datasets. For LLM screening, recall was 94 to 100 percent with precision of 2 to 15 percent, and Work Saved over Sampling at 95 percent recall (WSS@95) ranged from 48.7 to 87.3 percent. Conclusions: We developed a functional browser extension that integrates LLM screening and ML active learning into a no-code, serverless environment, ready for practical use in systematic review screening.
【21】OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains
标题:OpenKedge:通过执行限制的安全和证据链治理突发突变
链接:https://arxiv.org/abs/2604.08601
作者:Jun He,Deying Yu
备注:17 pages, 3 figures, 2 tables
摘要:The rise of autonomous AI agents exposes a fundamental flaw in API-centric architectures: probabilistic systems directly execute state mutations without sufficient context, coordination, or safety guarantees. We introduce OpenKedge, a protocol that redefines mutation as a governed process rather than an immediate consequence of API invocation. OpenKedge requires actors to submit declarative intent proposals, which are evaluated against deterministically derived system state, temporal signals, and policy constraints prior to execution. Approved intents are compiled into execution contracts that strictly bound permitted actions, resource scope, and time, and are enforced via ephemeral, task-oriented identities. This shifts safety from reactive filtering to preventative, execution-bound enforcement. Crucially, OpenKedge introduces an Intent-to-Execution Evidence Chain (IEEC), which cryptographically links intent, context, policy decisions, execution bounds, and outcomes into a unified lineage. This transforms mutation into a verifiable and reconstructable process, enabling deterministic auditability and reasoning about system behavior. We evaluate OpenKedge across multi-agent conflict scenarios and cloud infrastructure mutations. Results show that the protocol deterministically arbitrates competing intents and cages unsafe execution while maintaining high throughput, establishing a principled foundation for safely operating agentic systems at scale.
【22】Reservoir observer enhanced with residual calibration and attention mechanism
标题:通过剩余校准和关注机制增强了储层观察器
链接:https://arxiv.org/abs/2604.08592
作者:Yichen Liu,Wei Xiao,Tianguang Chu
摘要:Reservoir observers provide a data-driven approach to the inference of unmeasured variables from observed ones for nonlinear dynamical systems. While previous studies have demonstrated wide applicability, their performance may vary considerably with different input variables, even compromising reliability in the worst cases. To enhance the performance of inference, we integrate residual calibration and attention mechanism into the reservoir observer design. The residual calibration module leverages information from the estimation residuals to refine the observer output, and the attention mechanism exploits the temporal dependencies of the data to enrich the representation of reservoir internal dynamics. Experiments on typical chaotic systems demonstrate that our method substantially improves inference accuracy, especially for the worst cases resulting from the traditional reservoir observers. We also invoke the notion of transfer entropy to explain the reason for the input-dependent observation discrepancy and the effectiveness of the proposed method.
【23】Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
标题:自动数据注释标签函数的结构化探索和开发
链接:https://arxiv.org/abs/2604.08578
作者:Phong Lam,Ha-Linh Nguyen,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
备注:Accepted by KBS Journal
摘要
:High-quality labeled data is critical for training reliable machine learning and deep learning models, yet manual annotation remains costly and error-prone. Programmatic labeling addresses this challenge by using label functions (LFs), i.e., heuristic rules that automatically generate weak labels for training datasets. However, existing automated LF generation methods either rely on large language models (LLMs) to synthesize surface-level heuristics or employ model-based synthesis over hand-crafted primitives. These approaches often result in limited coverage and unreliable label quality. In this paper, we introduce EXPONA, an automated framework for programmatic labeling that formulates LF generation as a principled process balancing diversity and reliability. EXPONA systematically explores multi-level LFs, spanning surface, structural, and semantic perspectives. EXPONA further applies reliability-aware mechanisms to suppress noisy or redundant heuristics while preserving complementary signals. To evaluate EXPONA, we conducted extensive experiments on eleven classification datasets across diverse domains. Experimental results show that EXPONA consistently outperformed state-of-the-art automated LF generation methods. Specifically, EXPONA achieved nearly complete label coverage (up to 98.9%), improved weak label quality by up to 87%, and yielded downstream performance gains of up to 46% in weighted F1. These results indicate that EXPONA's combination of multi-level LF exploration and reliability-aware filtering enabled more consistent label quality and downstream performance across diverse tasks by balancing coverage and precision in the generated LF set.
【24】Dynamic sparsity in tree-structured feed-forward layers at scale
标题:树结构前向层的大规模动态稀疏性
链接:https://arxiv.org/abs/2604.08565
作者:Reza Sedghi,Robin Schiewer,Anand Subramoney,David Kappel
摘要:At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.
【25】Self-Sovereign Agent
标题:自我主权代理人
链接:https://arxiv.org/abs/2604.08551
作者:Wenjie Qu,Xuandong Zhao,Jiaheng Zhang,Dawn Song
摘要:We investigate the emerging prospect of self-sovereign agents -- AI systems that can economically sustain and extend their own operation without human involvement. Recent advances in large language models and agent frameworks have substantially expanded agents' practical capabilities, pointing toward a potential shift from developer-controlled tools to more autonomous digital actors. We analyze the remaining technical barriers to such deployments and discuss the security, societal, and governance challenges that could arise if such systems become practically viable. A project page is available at: https://self-sovereign-agent.github.io.
【26】On Divergence Measures for Training GFlowNets
标题:关于GFlowNets训练的分歧措施
链接:https://arxiv.org/abs/2410.09355
作者:Tiago da Silva,Eliezer de Souza da Silva,Diego Mesquita
备注:Accepted at NeurIPS 2024, https://openreview.net/forum?id=N5H4z0Pzvn
摘要:Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects, with applications in generative modeling for tasks in fields such as causal discovery, NLP, and drug discovery. Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) and a target (backward policy) distribution, which enforces certain flow-matching conditions. While this training procedure is closely related to variational inference (VI), directly attempting standard Kullback-Leibler (KL) divergence minimization can lead to proven biased and potentially high-variance estimators. Therefore, we first review four divergence measures, namely, Renyi-$α$'s, Tsallis-$α$'s, reverse and forward KL's, and design statistically efficient estimators for their stochastic gradients in the context of training GFlowNets. Then, we verify that properly minimizing these divergences yields a provably correct and empirically effective training scheme, often leading to significantly faster convergence than previously proposed optimization. To achieve this, we design control variates based on the REINFORCE leave-one-out and score-matching estimators to reduce the variance of the learning objectives' gradients. Our work contributes by narrowing the gap between GFlowNets training and generalized variational approximations, paving the way for algorithmic ideas informed by the divergence minimization viewpoint.
【27】Identifying Causal Effects Using a Single Proxy Variable
标题:使用单个代理变量识别因果效应
链接:https://arxiv.org/abs/2604.09135
作者:Silvan Vollmer,Niklas Pfister,Sebastian Weichwald
备注:Equal contribution between Pfister and Weichwald
摘要:Unobserved confounding is a key challenge when estimating causal effects from a treatment on an outcome in scientific applications. In this work, we assume that we observe a single, potentially multi-dimensional proxy variable of the unobserved confounder and that we know the mechanism that generates the proxy from the confounder. Under a completeness assumption on this mechanism, which we call Single Proxy Identifiability of Causal Effects or simply SPICE, we prove that causal effects are identifiable. We extend the proxy-based causal identifiability results by Kuroki and Pearl (2014); Pearl (2010) to higher dimensions, more flexible functional relationships and a broader class of distributions. Further, we develop a neural network based estimation framework, SPICE-Net, to estimate causal effects, which is applicable to both discrete and continuous treatments.
【28】Policy-Aware Design of Large-Scale Factorial Experiments
标题:大规模析因实验的政策意识设计
链接:https://arxiv.org/abs/2604.08804
作者:Xin Wen,Xi Chen,Will Wei Sun,Yichen Zhang
摘要
:Digital firms routinely run many online experiments on shared user populations. When product decisions are compositional, such as combinations of interface elements, flows, messages, or incentives, the number of feasible interventions grows combinatorially, while available traffic remains limited. Overlapping experiments can therefore generate interaction effects that are poorly handled by decentralized A/B testing. We study how to design large-scale factorial experiments when the objective is not to estimate every treatment effect, but to identify a high-performing policy under a fixed experimentation budget. We propose a two-stage design that centralizes overlapping experiments into a single factorial problem and models expected outcomes as a low-rank tensor. In the first stage, the platform samples a subset of intervention combinations, uses tensor completion to infer performance on untested combinations, and eliminates weak factor levels using estimated marginal contributions. In the second stage, it applies sequential halving to the surviving combinations to select a final policy. We establish gap-independent simple-regret bounds and gap-dependent identification guarantees showing that the relevant complexity scales with the degrees of freedom of the low-rank tensor and the separation structure across factor levels, rather than the full factorial size. In an offline evaluation based on a product-bundling problem constructed from 100 million Taobao interactions, the proposed method substantially outperforms one-shot tensor completion and unstructured best-arm benchmarks, especially in low-budget and high-noise settings. These results show how centralized, policy-aware experimentation can make combinatorial product design operationally feasible at platform scale.
【29】Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate
标题:Adam-HNAG:Adam的加速收敛重构
链接:https://arxiv.org/abs/2604.08742
作者:Yaxin Yu,Long Chen,Zeyi Xu
备注:27 pages, 4 figures
摘要:Adam has achieved strong empirical success, but its theory remains incomplete even in the deterministic full-batch setting, largely because adaptive preconditioning and momentum are tightly coupled. In this work, a convergent reformulation of full-batch Adam is developed by combining variable and operator splitting with a curvature-aware gradient correction. This leads to a continuous-time Adam-HNAG flow with an exponentially decaying Lyapunov function, as well as two discrete methods: Adam-HNAG, and Adam-HNAG-s, a synchronous variant closer in form to Adam. Within a unified Lyapunov analysis framework, convergence guarantees are established for both methods in the convex smooth setting, including accelerated convergence. Numerical experiments support the theory and illustrate the different empirical behavior of the two discretizations. To the best of our knowledge, this provides the first convergence proof for Adam-type methods in convex optimization.
【30】An Algorithm for Fast Assembling Large-Scale Defect-Free Atom Arrays
标题:快速组装大规模无缺陷原子阵列的算法
链接:https://arxiv.org/abs/2604.08669
作者:Tao Zhang,Xiaodi Li,Hui Zhai,Linghui Chen
摘要:It is widely believed that tens of thousands of physical qubits are needed to build a practically useful quantum computer. Atom arrays formed by optical tweezers are among the most promising platforms for achieving this goal, owing to the excellent scalability and mobility of atomic qubits. However, assembling a defect-free atom array with ~ 10^4 qubits remains algorithmically challenging, alongside other hardware limitations. This is due to the computationally hard path-planning problems and the time-consuming generation of suffciently smooth trajectories for optical tweezer potentials by spatial light modulators (SLM). Here, we present a unified framework comprising two innovative components to fully address these algorithmic challenges: (1) a path-planning module that employs a supervised learning approach using a graph neural network combined with a modified auction decoder, and (2) a potential-generation module called the phase and profile-aware Weighted Gerchberg-Saxton algorithm. The inference time for the first module is nearly a size-independent constant overhead of ~ 5 ms, and the second module generates a potential frame with about 0.5 ms, a timescale shorter than the current commercial SLM refresh time. Altogether, our algorithm enables the assembly of an atom array with 10^4 qubits on a timescale much shorter than the typical vacuum lifetime of the trapped atoms.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递